Outage Report for 9/12/12

We have purchased and installed all new and faster hardware for the systems in our Atlanta data center. The new servers have a different generation of CPU than the old ones, so live migration was out of the question for the 35-40 virtual machines making up our hosting environment. On Monday night, we scheduled a planned outage to move the few critical machines that can't be redundant for technical reasons. That outage went exactly as expected and the only thing left to move on Tuesday were the machines that are redundant. They could all be moved one at a time without suffering an outage.

One pair of those redundant machines are the route servers that handle all networking. We moved the backup route server to the new hardware without issue, so moving the primary should have caused no problems. However... when the primary route server was originally created, a copy was made of it to create the backup. The only problem was that a new IP didn't get put on the backup, so both servers had the same IP. Thus, when we shut down the primary to move it, the backup didn't pick up and everything went down including our ability to access the system to get in and fix things.

As you can see, this was an unusual set of circumstances and no further problems like this are anticipated.

