During Hurricane Sandy, many data centers were damaged and/or without electricity. The aftermath of the hurricane is evidence of just how quickly a disaster, whether it is natural or man-made, can come and destroy the complex structures that humans have built. When something like this happens, we are reminded of just how important it is to plan ahead and prepare for any situation. Without this preparation, all the information collected in data centers can be lost in a split second.
Other recent cases of data center failures have been with Amazon Web Services. There was one failure in June, where major sites such as Quora and Netflix were down for several hours because of a power outage. Although Amazon had a backup generator in place, it did not perform as expected, which furthered the outage until the normal supply of power was restored. Yet another failure was in late October, with sites like Reddit and Coursera as victims. From Amazon’s reports, this one seems to have been caused by performance issues, specifically with the storage systems of AWS. Upon investigation, the source of the outage was traced back to a bug in Amazon’s software for collecting performance data. This bug was triggered when some failed networking equipment was replaced, and caused the software to continually use more memory until the system slowed to a crawl. These failures have made many people wonder about how reliable Amazon is as a cloud hosting company, and they have greatly affected Amazon’s image.
The North Campus data center seemed well-equipped to deal with problems such as power outages; they had powerful generators and always an extra machine. The guide emphasized redundancy– being able to fall on a backup quickly if the current machine failed. I’m curious to know how Amazon could have experienced such big problems as the backup generator blunder and the software bug. I would think that they would have more than just one backup for all their data, but then again, it would be incredibly expensive to implement. The reading from this week says, “Drive redundancy increases reliability but by itself does not guarantee that the storage server will be always up. Many other single points of failure also need to be attacked (power supplies, operating system software, etc.), and dealing with all of them incurs extra cost while never assuring fault-free operation .” It brings up the point that no matter how many backups or checks you make, you can never be 100% certain that all the data will be protected. So the question is: what is a good balance between redundancy and cost?