Data Center Failures

During Hurricane Sandy, many data centers were damaged and/or without electricity. The aftermath of the hurricane is evidence of just how quickly a disaster, whether it is natural or man-made, can come and destroy the complex structures that humans have built. When something like this happens, we are reminded of just how important it is to plan ahead and prepare for any situation. Without this preparation, all the information collected in data centers can be lost in a split second.

Other recent cases of data center failures have been with Amazon Web Services. There was one failure in June, where major sites such as Quora and Netflix were down for several hours because of a power outage. Although Amazon had a backup generator in place, it did not perform as expected, which furthered the outage until the normal supply of power was restored. Yet another failure was in late October, with sites like Reddit and Coursera as victims. From Amazon’s reports, this one seems to have been caused by performance issues, specifically with the storage systems of AWS. Upon investigation, the source of the outage was traced back to a bug in Amazon’s software for collecting performance data. This bug was triggered when some failed networking equipment was replaced, and caused the software to continually use more memory until the system slowed to a crawl. These failures have made many people wonder about how reliable Amazon is as a cloud hosting company, and they have greatly affected Amazon’s image.

The North Campus data center seemed well-equipped to deal with problems such as power outages; they had powerful generators and always an extra machine. The guide emphasized redundancy– being able to fall on a backup quickly if the current machine failed. I’m curious to know how Amazon could have experienced such big problems as the backup generator blunder and the software bug. I would think that they would have more than just one backup for all their data, but then again, it would be incredibly expensive to implement. The reading from this week says, “Drive redundancy increases reliability but by itself does not guarantee that the storage server will be always up. Many other single points of failure also need to be attacked (power supplies, operating system software, etc.), and dealing with all of them incurs extra cost while never assuring fault-free operation [78].” It brings up the point that no matter how many backups or checks you make, you can never be 100% certain that all the data will be protected. So the question is: what is a good balance between redundancy and cost?

Advertisements

About sharleeism

Is mayonnaise an instrument?
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Data Center Failures

  1. kristenmayer says:

    I’m also surprised that Amazon would not have had more measures in place the ensure that their services did not go down. While redundancy does cost more money, I think it is also important to consider the consequences of having services go down (loss of data, or loss of image). At some points, redundancy is worth the cost, but too much redundancy is a waste of money. It’s important to try to optimize the relationship between the two: making sure that, for the most part, services will not go down, while keeping costs under control.

    It’s possible that Amazon thought they had optimized the system, but later found out that this was an incorrect assumption. It will be interesting to see if either of these flaws lead them to add more redundancy, or if power outages and bugs will continue to be an issue.

  2. alexjking11 says:

    Sure, the redundancy vs. cost trade-off is tough. But isn’t it interesting that as frequently as websites that pay to be hosted by AWS are offline, Amazon.com itself never is?

    There was an interesting report from Amazon a while back that estimated that if all their pages loaded one second slower over the course of a year, it would cost them $1.6 billion in lost sales (http://www.fastcompany.com/1825005/how-one-second-could-cost-amazon-16-billion-sales). So obviously it’s crucial to them to keep Amazon.com up and running. But other websites that use AWS appear to be less demanding.

    Amazon has spent a lot of money building up AWS. Despite significant profits from the Amazon.com e-commerce operation, the company’s net income has consistently been negative over recent quarters as management continues to invest aggressively in AWS. But the question becomes: will websites like Reddit and Quora continue to pay for service that is less than what Amazon.com requires for itself? Perhaps there is room for a new competitor to emerge who can offer better uptime.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s