Our point of sale (POS) , back office web applications (B2B), scanners suffered outages on 1/12/2024. This was an unfortunate event, and we take responsibility to ensure this root cause incident does not happen. Let me explain the chain of events below and the actions to prevent a re-occurrence.
Our shared POS and B2B/scanner API servers went down on two instances and recovered partially between failures. For our clients on our native payment rails, 44% of the POS transactions were recovered platform-wide via the POS, having gone offline. Unfortunately, our offline mode is compatible with Stripe deployments.
From a back-end uptime perspective, we saw 100% unavailability from 12:05-1:05 PT, recovery to 92% until 1:25 PT, and 93% unavailability from 1:25-3 pm PT, after which backends resumed to expected uptime.
The cause was a saturation of the outbound Internet network interface due to the secondary backup of our transactional database into an external AWS storage area. Our AWS storage slowed down, which caused overlapping syncs to queue up on the network.
It impeded the startup of our API application servers, which serve the POS and back office in our elastic infrastructure. In an elastic environment, we constantly start/shut down services to accommodate fluctuating traffic demand and control data center costs. Upon startup, given our servers fetch artifacts from the outbound gateway to the server, they could no longer come up to service requests, causing other servers that were up to be overloaded.
We mitigated the network interface failure by disabling the elastic mechanism, manually starting up servers, and bypassing the outbound fetches via local copies to servers from our VPN. In parallel, we disabled the backup of backups into a different location, our “air-gap” procedure.
Moving forward, we are assessing a few network infrastructure changes and reducing the API servers' dependency on the Internet for startup to mitigate such incidents.
In a decade of operation, this is the first time we have seen such a specific event. We will eliminate the probability of a repeat event with alteration to our physical architectures. We are also working with AWS as we speak to understand why the external storage area slowed down on writes. Until we get to all the details via AWS and design our new plans for changes, I will send a second communication to explain our changes moving forward.
Echeyde Cubillo
CTO/co-founder