Timeline of Events:
11:00 am EDT: Club OS noticed lag in one of our two read replica databases and began investigating the root cause, a lock in the database.
11:30 am EDT: We initiated the process to bring up a replacement database while also attempting to fix the lock in the existing database.
1:00 pm EDT: The replacement database came online. We switched traffic to the replacement database; however, the lock reappeared and furthermore, calls coming from the web app became stuck behind the lock.
1:45 pm EDT: The web app began to fail as the requests accumulated on the replacement database. Although our automatic scaling brought up new servers to handle the traffic, the volume of traffic overwhelmed them.
2:15 pm EDT: Our fallback scaling having repeatedly failed to handle the web traffic, we began working on removing the original read replica database.
2:45 pm EDT: The database was removed and we started directing traffic to the remaining database.
3:04 pm EDT: All traffic was moved over to the database and web services were restored. We are continuing to monitor the database and web app latency.
One of our two read replica databases experienced a lock, which initially affected KPI and report requests. In attempting to bring up a new database instance, the same issue recurred and ultimately caused a back-up in processing calls from the web app such that web services went down.
Club OS removed the affected database and routed all traffic to our single stable database, restoring service to all clients.
To prevent future occurrences of the same, Club OS shall:
FOR MORE INFORMATION
If you have questions about this issue, please open a ticket with us by sending a note to [email protected]