2023.10.16 Degraded Performance
Incident Report for Violet
Resolved
This incident has been resolved. The underlying services framework we utilize, Apache CXF, switched from using URLConnection to using HttpClient and this introduced multiple issues. One issue was a significant memory leak at scale that forced us to regularly perform orchestrated restarts of some pods to prevent OOM. Another was a bug that limited our service clients to using a single TLS protocol, 1.3, and this presented connection issues for any self-hosted merchants using older versions of CertBot (LetsEncrypt) that only supported up to 1.2. By going back to the last version of Apache CXF that still utilized URLConnection these issues were resolved.

Update: We've received approval from Apache to submit a formal bug report.
Posted Oct 18, 2023 - 07:01 PDT
Monitoring
A temporary patch has been implemented and we are monitoring the results. During this time it may still be possible for full-catalog syncs to not complete.
Posted Oct 16, 2023 - 15:12 PDT
Investigating
We are investigating a memory leak in our syncing service after performing library upgrades. During this time it's possible for failures to occur during catalog syncs. While a recent increase in system load has surfaced this issue it has possibly been occurring intermittently during moments of peak usage since 2023.09.27.
Posted Oct 16, 2023 - 14:00 PDT
This incident affected: Production API.