Summary of event: Doxy.me experienced a severe increase in API response time, preventing login attempts and causing overall system slowness.
Business Impacts: Doxy.me was intermittently unresponsive from 2:10 PM until roughly 2:23 PM EDT on Thursday, March 17th.
Root cause: The latency spikes which occurred three times over the past few weeks were the result of a sub-processor answering our requests too slowly. It was responding so slowly that systems kicked in and rebooted everything before logs could be sent. Confusing the issue, we noticed this each time acutely following deployments, leading us to question whether the problem was in our new code, leading to rollbacks and introspection. Finally, we discovered a service call that was taking up to 5 minutes, and which was called frequently enough in our busy periods that it almost immediately caused the server to be non-functional and rebooted.
How was this resolved: We identified and removed the system calls that were occasionally slow enough to take down our service.
Preventative next steps: In addition, we are reassessing any unnecessary API calls that are being made and the length of time we wait for a response so we do not continue to run into long running requests that tie up our server resources. We are also working on improving our system health tracking so we can identify and address anomalies more efficiently.