Resilience Notes

Background

The budget app is using dynamodb. Every call could fail, and assuming the it would just work is too optimistic.

Calls should be allowed to fail without leaving inconsistent data behind. It’s OK to tell the user that something didn’t work. It’s not OK to leave partial or incorrect data behind.

Transaction Generation

The system collects events. Persisting an even is a single call. If the call fails, no damage is done, the user is informed and they can try again or not.

Transactions are created, updated and deleted by processing events. This should be separate from persisting events.

For example, if we managed to persist a transaction creation event but the call to create and persist the transaction failed, we’re left with partial data.

It is possible to “fix” this with transactions: whenever an event is persisted, also make the change it implies in a single transaction. That works, but complicated things - now transaction update is coupled with event persistence.

A (maybe) better approach is to have a separate, resilient process for updating transactions, one that can fail and leave consistent data.

Event Information

Each transaction should include the id of the last event that touched the transaction.

Event Pointer (Transaction)

There will be a separate event pointer that holds the last event that was applied (or no record if that hasn’t happened yet).

Update Process

The update process:

Ferch the last event (they are ordered using a ULID so it’s a simple query)
Fetch the event pointer
While the event pointer is not the last event, in a single transaction:
- Apply the event (create, update or delete transaction)
- Update event pointer conditionally on it being the previous value (or non-existent)

If this (dynamodb) transaction fails, we’re still in a consistent state. The only thing is that the event pointer does not match rhe last event - this can be checked, and if we’re trying to fetch transactions we can report an error to the user.

If the (dynamodb) transaction was successful then we’ve applied the event and updated the pointer, and we’re sure that another process (lambda) didn’t get there before us, because of the conditional write. For “belt and braces” we can also add a condition that thr “last event to touch the transaction” is greater than the existing value, and this should be true because we’re using ULIDs. It protects against something else updating the transaction first (but that shouldn’t happen anyway).

If the process fails, we can apply it again asynchronously (thay is, catch and put an event on a queue to attempt it), and we can also try it again when we fetch transactions and we check the event pointer (we can just run it regardless, it’ll check the pointer, try to update if required and return success or error).

Snapshot Update

The same consideration is true for snapshots - we don’t want to couple snapshot update to persisting events or to updating transactions, but if we do update a transaction we need to leave some information behind to know that snapshots must be invalidated and updated.