Platform Service Disruption - POS and Backoffice
Incident Report for ACME Technologies
Postmortem

Our point of sale (POS) , back office web applications (B2B), scanners suffered outages on 1/12/2024. This was an unfortunate event, and we take responsibility to ensure this root cause incident does not happen. Let me explain the chain of events below and the actions to prevent a re-occurrence.

Our shared POS and B2B/scanner API servers went down on two instances and recovered partially between failures. For our clients on our native payment rails, 44% of the POS transactions were recovered platform-wide via the POS, having gone offline. Unfortunately, our offline mode is compatible with Stripe deployments.

From a back-end uptime perspective, we saw 100% unavailability from 12:05-1:05 PT, recovery to 92% until 1:25 PT, and 93% unavailability from 1:25-3 pm PT, after which backends resumed to expected uptime.

The cause was a saturation of the outbound Internet network interface due to the secondary backup of our transactional database into an external AWS storage area. Our AWS storage slowed down, which caused overlapping syncs to queue up on the network.

It impeded the startup of our API application servers, which serve the POS and back office in our elastic infrastructure. In an elastic environment, we constantly start/shut down services to accommodate fluctuating traffic demand and control data center costs. Upon startup, given our servers fetch artifacts from the outbound gateway to the server, they could no longer come up to service requests, causing other servers that were up to be overloaded.

We mitigated the network interface failure by disabling the elastic mechanism, manually starting up servers, and bypassing the outbound fetches via local copies to servers from our VPN. In parallel, we disabled the backup of backups into a different location, our “air-gap” procedure.

Moving forward, we are assessing a few network infrastructure changes and reducing the API servers' dependency on the Internet for startup to mitigate such incidents. 

In a decade of operation, this is the first time we have seen such a specific event. We will eliminate the probability of a repeat event with alteration to our physical architectures. We are also working with AWS as we speak to understand why the external storage area slowed down on writes. Until we get to all the details via AWS and design our new plans for changes,  I will send a second communication to explain our changes moving forward.

Echeyde Cubillo
CTO/co-founder

Posted Jan 18, 2024 - 07:07 PST

Resolved
The incident has been resolved and all platform operations have normalized.
Posted Jan 12, 2024 - 13:24 PST
Monitoring
Our team has deployed a fix to our infrastructure systems and service is now gradually being restored to all users. Our team will continue to closely monitor to ensure we maintain normal operation parameters.
Posted Jan 12, 2024 - 13:17 PST
Identified
Our team has identified the root issue and a fix for this is currently underway.
Posted Jan 12, 2024 - 12:58 PST
Investigating
We are aware that users are still unable to access our applications. Our team is still working on determining the root cause of the cloud server disruption so we can return to normal operations as quickly as possible.
Posted Jan 12, 2024 - 12:28 PST
Update
We are continuing to monitor for any further issues.
Posted Jan 12, 2024 - 11:57 PST
Monitoring
A cloud server used to access Backoffice, ACME AC and ACME Sales was experiencing an intermittent performance issue. Service is now being returned to most users. Our team will continue monitoring to see if additional action is needed.
Posted Jan 12, 2024 - 11:51 PST
Investigating
ACME is experiencing a service disruption that is impacting multiple platform components. ACME E-commerce service is still operational.

Our engineering team is investigating the issue and we will update this incident as soon as possible.
Posted Jan 12, 2024 - 11:37 PST
This incident affected: ACME Platform (Access Control, ACME Backoffice (B2B), ACME Sales POS application).