Serve & Secure: Data Center Workload Mobility and Disaster Recovery

By Carlos Rodriguez posted 05-02-2013 11:21

Like

Over the past several most we have seen a lot of discussion on the eGroups and in our networks centering around the future of law firm Data Centers. Discussions have been along the lines of how can a data center and its workload be extended over geographically dispersed facilities dynamically, effectively and efficiently; or even if we should consider moving everything to the "cloud".

Our good friends Down Under at King & Wood Mallesons in Brisbane, Australia led by Paul O’Leary, Infrastructure Specialist at the firm tell us about their experience rolling out Cisco OTV, VMware SRM and EMC RecoverPoint in their successful journey to extending their data center and creating an automated Workload Mobility environment. Paul can be reached at Paul.OLeary@au.kwm.com

We will have a session at conference on different ways of extending your data center as well.

Enjoy this post.

The Removal of Complexity: Data Center Workload Mobility and Disaster Recovery at King & Wood Mallesons

"Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it." Alan Perlis

In the IT departments of the world, complexity is assumed. In fact if complexity is absent, then the general consensus is that there must be something wrong with the solution.

“It's not robust, obviously”

“It won't handle my specific corner case”

“There doesn't seem to be enough options to tweak’.

All would be common complaints to any technology that is presented in a way that removes all extraneous features or abilities and just focuses on a core competency.

A team of us at KWM, from engineering to management have been fixated with removing some of our administrative complexity.

We have a Disaster Recovery (DR) and Workload Mobility solution that we have deployed and use operationally. It has exactly 2 options - "Do you want it powered on at the recovery site, and what order would you like them to come up in?"

KWM has invested the time and effort to implement OTV from Cisco using Nexus 7k datacentre switches, VMware SRM and EMC RecoverPoint to make a complex task as straightforward as asking those 2 questions of our Administrators. By owning the difficult design and deployment tasks that will apply to every workload we run, we can just present the simple choices to the business and be consistent in every recovery or even “teleporting” of workloads from site to site.

There are three primary pieces - Overlay Transport Virtualisation Protocol, RecoverPoint, and Site Recovery Manager.

Using OTV we have engineered our subnets to be available across multiple sites. That means IP connectivity is available without changing the address space of the Operating System, even when moving the whole VM from one physical location to another.

Using BGP we’re able to make the OTV Vlan’s available via redundant MPLS connections for remote sites accessing resources in the datacentre. The OTV design took some work to ensure we got it right but it has worked very well. When designing a Layer 2 overlay on top of a Layer 3 network we were obviously concerned about introducing loops, latency or connectivity issues but it has worked well for the the 2.5 years since we implemented the technology.

It's the cornerstone of being able to recover a system without requiring administrative intervention.

RecoverPoint manages the IO from LUNs located on specific SANs. These writes are logged and replicated without performance penalty to the VM. They are then tracked and managed at the remote site waiting for the request to mount the LUN. It is then a complete block for block replica of the original. The journaling aspect of RecoverPoint allows the system to handle any link loss, or bandwidth reductions due to production traffic gracefully. It's monitoring capabilities extend to vCenter and the SAN OS so it's alerting is meaningful and does not require constant attention to guarantee the quality of the replication.

Lastly, Site Recovery Manager from VMware has integrated the RecoverPoint replication into an ability to designate VM's and OS's to be mirrored for recovery at an alternate site. We currently have over 90 VM's that are protected in this way. Each group of VM's is capable of not only being recovery in a catastrophic loss of a site, but they can be 'Planned Migrated' if we need to test the recovery. Even more impressively we can migrate these groups just to balance loads across sites if we choose. All this occurs in minutes. Some groups we have failed over in as little as 7 minutes for 5 nodes, or more complex 24 node groups took around 1 hour.

Again, the key component of this was - the Administrator did 'Nothing' in the recovery process. They requested a migration or a test and we facilitated it. Once the workloads were powered up and operating at the alternate site, they were already pingable, and the administrator RDP'd to their VM's and started their own verification process.

No IP change, no scripts, no logging into SANs for promoting replica's or tape restores of data. No managing 2 OS's for only 1 function and thus duplicating attack surface or limited Administrator hours.

And there-in lay the biggest talking point of the whole deployment. Describing and demonstrating such a simple result lead to much discussion from Administrators. How could you only have two options for my service to be recovered, and yet still be a reliable solution?

All it took was the first live use of the design, and we had very pleased IT Staff. This was something that reduced their burden for compliance, but kept our goal of recovery and business continuity.

Simple.

Almost Genius like.