When Cloud Services Evaporate
I’ve said it before and I’ll say it again–cloud computing just isn’t what naïve, over-optimistic cloud advocates think it is. One of the greatest risks is loss of availability of cloud services. If you don’t believe this, ask IT people who work for the state of Virginia, and specifically the Virginia Information Technologies Agency (VITA). VITA purchased storage services–“infrastructure as a service (IaaS)–“ from Northrup Grumman, but this service provider’s primary storage area network (SAN) failed mid-last week due to a faulty memory card within an EMC DMX-3 storage array. Ironically, this storage array is the provider’s flagship product. The fact that the backup SAN also failed prolonged the outage. Virginia state agency employees from 27 out of the 89 state agencies were unable to access applications and data. Earlier this week this number was down to seven, and today the number was reportedly down to three, but as Murphy’s Law would have it, some of these agencies (the Department of Motor Vehicles, Department of Taxation, and Department of Elections) provide extremely essential services.
The services that failed were, as in most cloud computing services nowadays, virtualized. Virtualization is a great thing, but when it fails, it seems to fail big. Suppose that you have a primary data center with, say, 25 physical servers, each of which runs six Virtual Machines (VMs), each of which runs a particular application. Suppose, too, that the secondary data center has the same number of physical servers and that each runs the exact VMs and applications as the primary data center physical servers–perfectly mirrored environments. In this case, rollover to the secondary data center should be a “piece of cake.” An application on VM number 3 on physical server 1 in the primary data center will also be the same application on VM number 3 on physical server 1 in the secondary data center. Where confusion starts to abound (even though in theory it should not very much) is when an application on VM number 3 on physical server 1 in the primary data center corresponds to the same application that runs on VM number 5 on physical server 23 in the secondary data center. Technology that maps VMs and applications running on disparate physical servers is available, but I have seen very few organizations use this technology.
But there is more irony. VITA’s previous CIO was fired after he withheld $15 million out of the amount due to be paid to Northrup Grumman due to alleged failure to meet some of the contractual requirements. The state of Virginia had experienced service outages, contractual delays, and cost overruns. He was replaced by the current CIO, Sam Nixon, approximately one year ago. Nixon, was told to clean up the problems with Northrup Grumman. By all appearances, he has not gotten all that far in this endeavor so far.
According to the latest status update, the primary SAN that failed is running again. The catch is that a massive data restoration effort that is going to take some time is necessary to ensure that state of Virginia agencies and the applications that they run have the correct data.
Long live cloud computing, but what happened with the Northrup Grumman SAN has provided a stellar example of what can go wrong when cloud services fail. Too many cloud fanatics neither recognize the real risk nor sufficiently plan for continuity of services in the advent of the loss of cloud service availability. And too many of them still have not really caught on to the potential impact that the fact that most cloud services are delivered between the incredibly public and infinitely attackable Internet can and will have.