From Vonage:
What Happened
Starting on May 27th, OpenTok clients using a given set of API servers received private IP addresses in the STUN, TURN/UDP and TURN/TCP server configuration, for some of the servers in EAST and WEST coast regions making those protocols effectively unusable by clients on those servers. Only TURN/TLS was accessible for those cases, as clients receive the server domain names in that case, instead of IP addresses.
Clients using those API servers had to resort to TURN/TLS, which takes longer to establish, making more probable a timeout during negotiation, and requires more CPU resources on the servers. Both factors resulted in an increased error rate with code 1554 on clients using a given set of API servers.
An increase in the CPU load of TURN servers was detected on May 29th, a temporary fix was deployed on May 29th, while the root cause was being investigated. This reduced the error rates to values close to normal ones.
Independently a faulty TURN server involved in the investigation resulted in a spike in errors on June 6th-8th. The faulty server was removed on June 8th, bringing error rates back to close to normal levels. Root cause investigation progressed, final remediation was put in place June 12th. Failure rates have been normal since June 8th.
Root Cause
Root cause was a DNS misconfiguration after several API servers were moved to a different cloud provider, starting on May 27th. As a result of that suboptimal configuration, API servers resolved the domain names for the same region TURN servers as private IP addresses.
Clients using those API servers, and in the specific region of the affected TURN servers could not use STUN, TURN/UDP or TURN/TCP, having to resort to TURN/TLS.
Impact
Clients connected to several API servers in the US east and west coast, and using TURN in those regions respectively for proper connectivity, had an increase in error rate for “1554” errors, related to timeouts in media connection establishment during May 27th-28th and June 6th-8th.
Once the connection was established, there was no further impact for those clients.
Clients using relayed (point to point) sessions saw a higher relative impact as those scenarios are more likely to require STUN/TURN.
Clients not using TURN were not affected. Clients connected to distinct API servers from the affected TURN servers region were not affected. Clients in other regions were not affected.
Preventive actions
Update server migration processes with additional configuration and testing steps to enhance validation of DNS configuration and TURN scenarios.
Increase monitoring and tune alerting thresholds for client failure rates.
Review decision criteria to raise incidents and perform Incident Management training across teams.