Elevated failure rates when trying to connect to patients
Incident Report for Doxy.me Inc.
Postmortem

From Vonage:

Public PM. IM-539 - 2020-05-27 - DNS Misconfiguration

What Happened

Starting on May 27th, OpenTok clients using a given set of API servers received private IP addresses in the STUN, TURN/UDP and TURN/TCP server configuration, for some of the servers in EAST and WEST coast regions making those protocols effectively unusable by clients on those servers. Only TURN/TLS was accessible for those cases, as clients receive the server domain names in that case, instead of IP addresses.

Clients using those API servers had to resort to TURN/TLS, which takes longer to establish, making more probable a timeout during negotiation, and requires more CPU resources on the servers. Both factors resulted in an increased error rate with code 1554 on clients using a given set of API servers.

An increase in the CPU load of TURN servers was detected on May 29th, a temporary fix was deployed on May 29th, while the root cause was being investigated. This reduced the error rates to values close to normal ones.

Independently a faulty TURN server involved in the investigation resulted in a spike in errors on June 6th-8th. The faulty server was removed on June 8th, bringing error rates back to close to normal levels. Root cause investigation progressed, final remediation was put in place June 12th. Failure rates have been normal since June 8th.

Root Cause

Root cause was a DNS misconfiguration after several API servers were moved to a different cloud provider, starting on May 27th. As a result of that suboptimal configuration, API servers resolved the domain names for the same region TURN servers as private IP addresses.

Clients using those API servers, and in the specific region of the affected TURN servers could not use STUN, TURN/UDP or TURN/TCP, having to resort to TURN/TLS.

Impact

Clients connected to several API servers in the US east and west coast, and using TURN in those regions respectively for proper connectivity, had an increase in error rate for “1554” errors, related to timeouts in media connection establishment during May 27th-28th and June 6th-8th.

Once the connection was established, there was no further impact for those clients.

Clients using relayed (point to point) sessions saw a higher relative impact as those scenarios are more likely to require STUN/TURN.

Clients not using TURN were not affected. Clients connected to distinct API servers from the affected TURN servers region were not affected. Clients in other regions were not affected.

Preventive actions

Update server migration processes with additional configuration and testing steps to enhance validation of DNS configuration and TURN scenarios.

Increase monitoring and tune alerting thresholds for client failure rates.

Review decision criteria to raise incidents and perform Incident Management training across teams.

Posted Jun 23, 2020 - 17:44 EDT

Resolved
We have received the Root Cause Analysis from Vonage on this topic and are going to mark it as resolved. We will post the RCA immediately after this post.
Posted Jun 23, 2020 - 17:40 EDT
Monitoring
Starting around June 1st, doxy.me customers on restricted networks (behind firewalls), began seeing elevated failure rates when trying to connect to patients, mostly on mobile devices and/or networks. One of our service providers, Vonage, told us today they implemented a fix and we should start seeing these errors decrease. We will continue to monitor and post updates here as they come.

I would like to take a moment to address the lack of communication on this topic. As is evident from our discussion board ( https://discuss.doxy.me/t/all-my-providers-having-issues/30480/50 ) lots of people were seeing these failures and we did not post anything on our status page. I would like to assure our customers that we believe in transparency and good communication. In this regard, I would say we failed. Going forward, we are going to be making a few changes with how we confirm systemic issues and how we communicate these to our customers.

I'm troubled by the amount of frustration this caused our customers and their patients; the slow reaction to identify the issue; and the poor communication. We are working very closely with our vendor (Vonage in this case), to be better informed and updated with important information so we can communicate that clearly. We are also going to start offering IP Whitelisting in the coming weeks to better ensure traffic on restricted networks is properly handled. If you are interested in IP Whitelisting please reach out to sales@doxy.me or your Customer Success Manager for more information or to request early access.

We will be posting updates and a Root Cause Analysis here once we get them.
Posted Jun 15, 2020 - 16:02 EDT