HipChat Service Outage
Incident Report for HipChat
Postmortem

WHAT HAPPENED: On Tuesday 12/20, at approximately 2pm Pacific time, after five months of 99.9% uptime, HipChat experienced an extended interruption of service. We sincerely apologize for this disruption. The underlying cause was a crash of one of HipChat’s MySQL RDS instances, which severely downgraded our authentication token flows.

USER IMPACT: The incident lasted approximately 6 hours and 05 minutes. During this time all users of HipChat experienced degraded functionality (e.g. files, links, room creation) but basic chat would have been available. Another portion of users were logged out of HipChat and were not able to log back in for the duration of the incident.

RECOVERY EFFORTS: In order to restore functionality we gradually restricted application traffic in order to give the MySQL DB the opportunity to fully recover. Once the DB was recovered we were able to fully restore traffic and all application functionality.

ACTION ITEMS: We are working with our upstream providers to understand the root cause of the DB crash to understand what conditions led to the crash so we can prevent those conditions in the future. In addition we are investigating options to making our systems more resilient to DB crashes.

We know your teams rely on HipChat to do your work, and we are extremely sorry for this interruption. Thank you for your patience as we continue to improve the reliability of the service our customers depend on.

Posted Dec 22, 2016 - 00:37 UTC

Resolved
We are confident that HipChat service is fully restored and will remain stable. Thank you again for your patience while we investigated and fixed this issue. We will share a post-mortem report about this incident as soon as possible.
Posted Dec 21, 2016 - 06:22 UTC
Update
Our systems are now back to normal and we will continue to monitor. For those affected, you should be able to use HipChat as normal across all your apps. Thank you for your patience while we investigated and fixed this issue. We know HipChat is crucial for your work and keeping your team communicating is our top priority. We will have a post-mortem report about this incident for you as soon as possible. Thank you again (heart)
Posted Dec 21, 2016 - 05:02 UTC
Monitoring
HipChat services have been restored and users are able to log-in as normal. We will continue to monitor and deploy additional improvements throughout the next hour. Next update in 30 minutes.
Posted Dec 21, 2016 - 04:31 UTC
Update
We are seeing signs that service is stabilizing. We continue working to fully restore service.

Next update in 30 minutes.
Posted Dec 21, 2016 - 03:54 UTC
Update
Fixing HipChat for our customers is our top priority. We are doing everything possible to restore service.

Next update in 30 minutes.
Posted Dec 21, 2016 - 02:52 UTC
Update
Resolving this issue remains our top priority. We continue working as quickly as possible to restore service.

Next update in 30 minutes.
Posted Dec 21, 2016 - 02:14 UTC
Update
Upon further investigation, we have determined the issue to be more complex than our initial findings led us to believe. We assure you, solving this issue is our top priority and are working as swiftly as we can to bring our operational level back to 100%.

Next update in 30 minutes.
Posted Dec 21, 2016 - 01:42 UTC
Update
Getting you back on HipChat is our top priority. We have all hands on deck to resolve this issue. Please see our next update in 30 minutes.
Posted Dec 21, 2016 - 01:06 UTC
Update
Our team has found the cause of the issue and is working to fix it promptly. We have all our resources working to get your team back on HipChat. Thank you for bearing with us during this. Our next update will be in 30 minutes.
Posted Dec 21, 2016 - 00:33 UTC
Update
We've determined the cause of the current service disruption and are now working to resolve it. Thank you for your patience and understanding as we work through this. We will continue to update you every 30 minutes.
Posted Dec 21, 2016 - 00:09 UTC
Identified
The team has identified the cause of the service disruption and is working to resolve the issue. Next update in 30 minutes.
Posted Dec 20, 2016 - 23:29 UTC
Update
Update - Thanks for your patience as we continue to investigate the service interruption . The next update will be in 30 minutes.
Posted Dec 20, 2016 - 22:54 UTC
Investigating
HipChat is currently experiencing a disruption in service. We are very sorry for the inconvenience. We are investigating and will have an update for you soon.
Posted Dec 20, 2016 - 22:17 UTC