ICG Link News
A loss of connectivity event occurred at our primary hosting location on July 22 at roughly 2330 CDT and continued until services were restored the following day beginning 0800 CDT. A comprehensive Root Cause Analysis of systems and procedures was undertaken and remedial actions were enacted. The causes of the event are identified as follows.
1) The primary external gateway/firewall for the location encountered a software error that caused a system crash, preventing traffic from flowing.
2) The secondary gateway failed to assume the load as it was in bypass mode due to an error earlier in the week during routine configuration maintenance. Restoration of the connection had been delayed until the weekend when a short system outage would be more acceptable.
3) The primary monitoring and notification system for all services resides at the affected location and was unable to notify systems administrators of problems due to loss of the above listed gateways.
4) The backup monitoring and notification system was hosted off site via a well known "cloud" services provider. This system did correctly see the loss of service and reported appropriate alerts. However the SMS notifications were blocked by providers due to being incorrectly identified as "SPAM" messages.
5) Outage notifications by customers relied on an email system within the effected location. These messages were received only after service was restored as they were queued up at various mail servers across the Internet awaiting restoration of mail service.
6) Customers who called in after the problem had been identified and was being worked on were informed of that fact upon being connected with a customer service representative. However no prior notification was available on the phone system itself.
The following corrective actions have been taken:
1) The software fault that caused the primary firewall to crash was identified and repaired.
2) Procedures have been revised with regard to maintenance on backup systems and on those effecting high availability such that systems are not impaired or diminished any longer than strictly necessary to ensure reliable service.
3) A third party external monitoring service has been established to provide backup monitoring of the primary monitoring and notification system. This service notifies systems engineers and the executive staff by phone as well as SMS.
4) Secondary DNS services have been moved to an alternate location such that an outage of the primary location will not affect DNS service.
5) The phone system has been re-configured such that in the event of a customer reported outage, the system will allow callers to leave an emergency message with notification of the message going immediately to the cellular phones of the technical staff 24/7/365. The outages email system has also been retained for use in all cases not effecting total loss of service.
6) In addition to the ICG Link news feed which reported on this outage, Twitter and Facebook will be employed in the future to report ongoing status of outages.
7) An alternate message on our phone system will be used in the event of a known outage to inform customers calling in that there is an issue and that we are aware of it.
These remedial actions will eliminate the possibility of a similar problem in the future and will provide for better communications with customers during any emergency, especially after-hours events. Additionally, we would like to take this opportunity to announce the near completion of a new systems architecture that will bring a whole new level of availability, redundancy and options to our service. This system will be fully operational by year end.