Problem
A multi-step workflow had to process onboarding, verification, and fulfillment across several services. The original synchronous approach stalled when any downstream system slowed or failed. Teams needed to trace each step, replay failures, and keep processing even with intermittent dependencies.
Constraints
The system had to remain operational under partial outages, guarantee at-least-once delivery, and provide an audit trail of every state transition. Compliance demanded traceable retries and evidence that no messages were lost.
Architecture
Diagram placeholder
RabbitMQ exchanges per domain, worker pools, and an outbox publisher with a central audit log.
The workflow used a RabbitMQ-based event bus with explicit routing keys for each state transition. Producers emitted durable events through an outbox table, and consumers used idempotency keys to guard side effects. Dead-letter queues captured failures with structured metadata for replays.
Key decisions & trade-offs
We chose at-least-once delivery with deduplication instead of exactly-once semantics to keep the system observable and debuggable. Message payloads were kept small and versioned to avoid tight coupling, trading some latency for forward compatibility.
Security & compliance
All queue access was scoped per service account, with short-lived credentials and audit logging for message consumption. Sensitive payloads were encrypted before publish and redacted in operational tools.
Outcome
Processing latency became predictable, with clear retry semantics and transparent failure visibility. Operations gained the ability to replay stuck workflows without engineering intervention.
What I’d improve next
Introduce fine-grained replay tooling so compliance can isolate specific workflows without re-emitting unrelated messages.