ICG Link News
Power Outage Sungard - updated
Last Monday, capacitors in the A side of UPS-D burned out. The unit failed over to its B side, which also failed. It was then we learned that our A-B power was more like A-A' power in that we were not really on two separate power systems, but on the A and B sides of UPS-D. We immediately had Sungard run power from another UPS and restore B power to us. About the time they completed the installation of our new B power from the separate UPS, UPS-D came back on line, (in non-redundant mode) providing us with A power. At that time Sungard began repairing UPS-D one side at a time in order to try to avoid further outages. Our setup since then has been that we have had A-B power, one of them being the crippled UPS-D running in non-redundant mode. Since then, power from UPS-D has failed twice for a few seconds each time. The second time was at 3am Sunday morning when after completing repairs to both sides of the unit, Sungard tried to bring it back on line in fully redundant mode. An issue in the communications software that talks between the two sides of the UPS caused an outage and they went back to non-redundant mode. All the while, we have had the new UPS providing our B power, so machines using A-B power in our cabinets (including all ICG LInk hosting and mail clients) were not effected by the two failures of UPS-D since Monday. Sungard has completed installation of new whips from other UPS systems in the data center in order to switch all customers off UPS-D before they upgrade the software and try another restoration to redundant mode.
Tonight after 6pm, we will be unplugging the power strip fed by UPS-D and plugging it into the newly installed whip. While no ICG Link hosted clients will be effected, any co-located machines on single power will be experience a power outage, which will have a duration of less than a minute.
ORIGINAL POST 07.11.2011:
PROBLEM: We have long had redundant power at our Nashville Sungard data center. Today, capacitors burned out in one of our UPS systems and it successfully failed over to the backup system. A currently unknown fault in the second unit caused it to be shut down effecting a significant portion of the data center including us.
SHORT TERM SOLUTION: We were able to temporarily run power from another UPS to our systems in order to restore power, but running new power was a time consuming effort resulting in the outage experienced today. The UPS systems will take at least a day to be restored to normal operations.
LONG TERM SOLUTION: We have already purchased and partially delivered equipment to a newly contracted secondary data center in Atlanta. That facility will be fully provisioned as a disaster recovery location within 30 days. Had this event occurred 30 days from now, there would have been no effect on customers.
A full report on what happened and what will be done to avoid it in the future will be forthcoming when Sungard reports to us.