WHAT HAPPENED: On Tuesday 12/20, at approximately 2pm Pacific time, after five months of 99.9% uptime, HipChat experienced an extended interruption of service. We sincerely apologize for this disruption. The underlying cause was a crash of one of HipChat’s MySQL RDS instances, which severely downgraded our authentication token flows.
USER IMPACT: The incident lasted approximately 6 hours and 05 minutes. During this time all users of HipChat experienced degraded functionality (e.g. files, links, room creation) but basic chat would have been available. Another portion of users were logged out of HipChat and were not able to log back in for the duration of the incident.
RECOVERY EFFORTS: In order to restore functionality we gradually restricted application traffic in order to give the MySQL DB the opportunity to fully recover. Once the DB was recovered we were able to fully restore traffic and all application functionality.
ACTION ITEMS: We are working with our upstream providers to understand the root cause of the DB crash to understand what conditions led to the crash so we can prevent those conditions in the future. In addition we are investigating options to making our systems more resilient to DB crashes.
We know your teams rely on HipChat to do your work, and we are extremely sorry for this interruption. Thank you for your patience as we continue to improve the reliability of the service our customers depend on.