6.4 Partial failover

A Partial Failover is the scenario where a tenant still has his infrastructure up and running, and only one or more virtual machines are having issues. In this situation, nobody wants to run a complete site failover to solve issues of a few VM’s; partial failover allows to start the replica VM of one or more VM’s at the service provider side, and let all the other VM’s running at the tenant side.

To make this possible, the technology integrated into the Network Extension Appliance extends (hence the name) any customer network to the service provider site, so that production VM’s can communicate with replicas without any change in the IP addressing.

This happens because NEA creates for each involved network a Layer 2 VPN tunnel that transparently extents the tenant network to the corresponding service provider network.

Veeam Cloud Connect Partial Failover

6.29: Veeam Cloud Connect Partial Failover

The Cloud Gateway at the service provider is responsible for interconnecting the two NEA’s, at the tenant and at the service provider. Thanks to this interconnection, OpenVPN Client running at the tenant can initiate a VPN tunnel towards the OpenVPN Server running in the service provider tenant. The final result is that a Layer2 tunnel is created between the two networks, and thanks to a Proxy-ARP solution running in both the appliances, packets can travel inside the tunnel and VM’s can communicate with each other, regardless in which site they are powered on.

NOTE: virtual machines running at the service provider can reach internet by using the internet connection of the tenant. Any packet created at the service provider and with a destination other that its own subnet is forwarded to the default gateway, which is running at the tenant side. As it can be understood by the 169.254.0.1 address, the internal interface of the Provider NEA is enabled but it has no IP address, so it cannot be used as a gateway. Instead, it forwards ethernet frames to the coupled arp-proxy at the tenant side.

Partial failover operation

To initiate a partial failover, the tenant selects from the ready replica’s the virtual machine he wants to failover. Note that Veeam doesn’t verify if the original virtual machine is still running, thus possible IP address conflicts may occur if the tenant doesn’t verify this information prior to starting the partial failover.

Start the failover of a single VM

6.30: Start the failover of a single VM

In the wizard, the tenant can add additional VM’s to the partial failover, and for each of them he can select the restore point he wants to use:

Select the restore point to be used for the failover

6.31: Select the restore point to be used for the failover

The wizard is finished, and after a few seconds the operation is completed:

Partial failover is completed successfully

6.32: Partial failover is completed successfully

What has happened behind this screen? A few things.

On both sides, NEA’s are started so that the VPN tunnel and the proxy-ARP components are up and running. This is the NEA at the tenant side:

NEA at tenant side is started

6.33: NEA at tenant side is started

On the service provider side, both the NEA and the requested replica VM are started. The replica VM has the same configurations and IP address as its original copy:

VM is started at the service provider side

6.34: VM is started at the service provider side

Note the virtual machine is connected to the network tenant.vsphere-tenant1.vlan3001 created by Cloud Connect.

Service provider can also verify in the Veeam Backup & Replication console that the failover has been started by Tenant 1:

test-2012 is in failover state at the service provider

6.35: VM is in failover state at the service provider

and he can also see the two tasks originated by the failover:

Task list at the service provider

6.36: Task list at the service provider

There is a completed Cloud Failover task, related to the power on of the replica VM, and a VPN Tunnel task in active state, as the failover is still in the process.

The final result, for the tenant, is that any connection towards the failed-over VM happens as usual:

Connection to the original VM and its replica

6.37: Connection to the original VM and its replica

We can see here two ping operations: the first one against the original VM has a time below 1 ms and a TTL of 64, signs that the ping packet was reaching a local VM. The second test has higher latency (never below 1 ms) and a TTL of 62, a clear sign that the connection is still over a Layer2 network (both machines are in the same subnet) but the link is towards a remote location and is using a arp-proxy, that is “eating” two routing hops to pass the packets back and forth (the reduction from 64 to 62).

The partial failover is correctly working, and can be kept up and running as long as the tenant needs it. Once the failover is not needed anymore, the tenant can choose among different options:

Options for a failed over VM

6.38: Options for a failed over VM

Undo failover stops the replica VM at the service provider side, and any change applied to that VM is lost. If the tenant has made some changes to the replica VM and wants that version to be the one to be used from now on, he can use choose Failback to production to replicate the replica VM back into his production site.