Building Reliable Event-Driven Systems
Event-driven architecture sounds great in theory. In practice, it introduces a whole new class of problems. Here's what I've learned about making it work -- with real patterns, real trade-offs, and the mistakes I've seen teams repeat.
- Event-driven architecture decouples producers from consumers -- but trades application complexity for infrastructure complexity.
- The hard problems are ordering, idempotency, observability, and schema evolution. Solve them before you ship.
- Three patterns you will need: Outbox, Dead Letter Queue, and Correlation IDs.
- Not every system needs events. Simple CRUD apps, strong-consistency requirements, and small teams are often better off without them.
Why Event-Driven?
Most applications start simple: a request comes in, you process it, you return a response. Synchronous, predictable, easy to debug. But as systems grow, this model starts to creak.
You add a notification service. Now your API handler sends an email after processing an order. Then you add analytics tracking. Then inventory updates. Then partner webhooks. Suddenly, your “simple” order endpoint is doing six things, and if any one of them fails, the whole request fails.
The synchronous trap
POST /orders → save order → send email → update inventory → notify partner → track analytics
If any step fails, the entire request fails. Latency compounds with each addition.
Event-driven architecture is the answer to this problem. Instead of doing everything in one request, you publish an event (“order.created”) and let other services react to it independently. The producer does not know — or care — who is listening.
Key Insight
Event-driven is not about events. It is about decoupling the decision to do something from the decision of what to do about it. The order service decides an order was created. What happens next is someone else’s problem.
The Promise
The appeal is real and measurable. Here is what you actually gain when event-driven architecture is done well:
Loose Coupling
Services don’t need to know about each other. The order service publishes an event; it doesn’t care who consumes it. Adding a new consumer means deploying a new service, not modifying existing code.
Independent Scalability
Each consumer scales based on its own load. Your email service can handle 100 messages/sec while your analytics pipeline processes 10,000/sec. They don’t contend for the same resources.
Resilience
If the notification service is down, orders still get processed. Notifications catch up when the service recovers. The message broker acts as a buffer between producers and consumers.
Extensibility
Adding a new reaction to an event means adding a new consumer, not modifying existing code. Your order service does not change when you decide to also send SMS confirmations or update a data warehouse.
The Reality: Four Hard Problems
In practice, event-driven systems introduce problems that are genuinely hard. I have seen all four of these bite teams who adopted events without preparing for them. They are not edge cases; they are the default experience.
Complexity distribution
Where teams spend debugging time
Problem 1: Ordering Guarantees
Events can arrive out of order. If you publish “order.created” followed by “order.updated”, there is no guarantee the consumer sees them in that sequence. This matters more than you think.
At an e-commerce company I worked with, a race condition between “order.paid” and “order.shipped” events caused the fulfillment system to occasionally attempt shipping before payment was confirmed. The bug only appeared under load — about once per 2,000 orders — and took three weeks to diagnose.
Use partition keys in your message broker. Kafka does this well: all events for the same entity go to the same partition, preserving order within that entity. The key is usually the entity ID (order ID, user ID, etc.).
Problem 2: Exactly-Once Processing
Messages can be delivered more than once. Network hiccups, consumer restarts, broker retries — all of these can cause a message to be delivered again. Your consumer needs to handle this gracefully.
The industry term is idempotency, and it is harder than it sounds. A payment service that processes the same “charge customer” event twice will charge them twice. An inventory service that processes “decrement stock” twice will show the wrong count.
Use an idempotency key (usually the event ID) and store processed event IDs in your database. Check before processing. This adds a database lookup to every event, but it is the only reliable approach I have found.
Problem 3: Debugging and Observability
When something goes wrong in a synchronous system, you get a stack trace. In an event-driven system, you get… nothing obvious. The event was published, but was it consumed? By which service? Did it fail silently? Is it sitting in a retry queue?
I once spent two days tracking down a bug where events were being consumed but silently dropped because a schema change in the producer did not match the consumer’s expectations. There was no error log because the consumer’s deserialization caught the exception and logged it at DEBUG level.
Correlation IDs. Generate a unique ID at the entry point (the API request) and propagate it through every event. Log it everywhere. When something goes wrong, you can trace the entire flow across services.
Problem 4: Schema Evolution
Events are contracts. Once you publish an event with a certain shape, consumers depend on that shape. Changing it is like changing a public API — you need versioning, backward compatibility, and a migration plan.
The real danger is that unlike API changes (where you get immediate errors), schema mismatches in events can fail silently. A consumer might ignore unknown fields, or worse, silently use a default value instead of the new field you just added.
Use a schema registry (Confluent Schema Registry for Kafka, or roll your own). Enforce backward compatibility at the schema level. Never remove fields; deprecate them. Version your event types.
Patterns That Work
After building several event-driven systems, here are the three patterns I keep coming back to. Each one addresses a specific failure mode, and together they form a reliable foundation.
The Outbox Pattern
Instead of publishing events directly from your application code, write them to an “outbox” table in your database as part of the same transaction that writes your business data. A separate process (the “relay”) reads the outbox and publishes to the message broker.
This guarantees that your data and your events are consistent. If the transaction rolls back, the event is never published. If the broker is down, events accumulate in the outbox and get published when it recovers.
BEGIN TX → INSERT order + INSERT outbox_event → COMMIT
Relay process: poll outbox → publish to broker → mark as sent
Dead Letter Queues
When a consumer fails to process a message after N retries, move it to a dead letter queue (DLQ) instead of dropping it. This gives you a place to inspect failures, fix bugs, and replay messages. Without a DLQ, failed messages vanish silently — and you discover the data loss days or weeks later.
Set up alerts on DLQ depth, but set sensible thresholds. A few messages per day might be normal (transient errors, bad data). Hundreds per hour means something is broken. If you alert on every single DLQ message, your team will start ignoring the alerts within a week.
Correlation IDs and Distributed Tracing
Generate a unique correlation ID at the system entry point (usually the API gateway or the first service in the chain). Propagate it through every event and every downstream call. Log it everywhere. This is your lifeline when debugging production issues.
Without correlation IDs
”Order 12345 was not fulfilled.” → check 8 services, 40 log streams, find nothing conclusive. Resolution: 2 days.
With correlation IDs
”Order 12345, correlation abc-789.” → one query shows the full event chain, the exact failure point, and the error message. Resolution: 15 minutes.
Worked Comparison: Three Messaging Approaches
Not all event-driven architectures are built the same. Here is a practical comparison of three common approaches I have evaluated and deployed. The scores reflect maintainability, reliability, and operational overhead for a mid-sized engineering team (15-30 engineers).
Direct publish (fire-and-forget)
Publish events inline in application code. Simplest to implement. No delivery guarantees if the broker is down.
Outbox + relay
Write events to a DB outbox table in the same transaction. A relay process publishes them. Strong consistency.
CDC (Change Data Capture)
Stream database changes (e.g., Debezium on Postgres WAL). Zero application code changes. High infra complexity.
| Approach | Reliability | Setup Time | Ops Overhead | Best For |
|---|---|---|---|---|
| A: Direct publish | Low | 1-2 days | Minimal | Prototypes, non-critical events |
| B: Outbox + relay | High | 1-2 weeks | Moderate | Most production systems |
| C: CDC (Debezium) | High | 2-4 weeks | Significant | Legacy systems, zero-code-change needs |
The Outbox pattern wins for most teams because it balances reliability with operational simplicity. You get strong consistency guarantees (the event is part of the same database transaction as the business data) without the infrastructure complexity of CDC. Direct publish is fine for prototypes and non-critical events like analytics, but it will lose messages when the broker is unavailable. CDC is powerful for legacy systems where you cannot modify application code, but running Debezium in production requires dedicated infrastructure expertise.
Real-World Case Studies
The patterns above are theoretical until you see them succeed and fail in practice. Here are three situations I have been part of, with the specific decisions that made or broke the implementation.
The e-commerce checkout that stopped losing orders
An e-commerce platform was using direct publish for order events. Under normal load, it worked fine. During Black Friday, the message broker hit capacity limits and started rejecting messages. The team estimated they lost about 340 orders worth approximately $47,000 in a four-hour window.
The fix was the Outbox pattern. Orders were written to the database (which they were already doing) along with an outbox row in the same transaction. A relay process published events at a controlled rate. During the next sale event, the broker went down for 20 minutes. Zero orders were lost — they queued in the outbox and published when the broker recovered.
Lesson: The Outbox pattern’s value is not measured during normal operations. It is measured during the worst 0.1% of uptime, when everything else is breaking.
The payment service that charged customers twice
A fintech company’s payment consumer did not implement idempotency. When the message broker retried a batch of messages after a network partition, 23 customers were charged twice — totaling about $3,200 in overcharges. The incident required manual refunds, customer apology emails, and a post-mortem that consumed a week of the team’s time.
The fix was straightforward but required discipline: an idempotency table keyed on event ID. Before processing any payment event, the consumer checks if the event ID has already been processed. If yes, it acknowledges the message without acting on it. The table added about 2ms of latency per event — a negligible cost compared to the alternative.
Lesson: Idempotency is not optional. Every consumer that has side effects (payments, emails, inventory changes) must be idempotent. It is cheaper to add it upfront than to deal with the fallout of duplicate processing.
The microservices migration that was traced back to sanity
A team migrating from a monolith to microservices initially had no distributed tracing. When a user reported that their profile update was not reflected in search results, diagnosing the issue required checking logs across 6 services manually. The median time to resolve these issues was about 4 hours.
After implementing correlation IDs and integrating with Jaeger for distributed tracing, the same class of issue took 15-20 minutes to diagnose. The correlation ID, generated at the API gateway and propagated through every event, made it possible to see the entire lifecycle of a request in a single trace view.
Lesson: Correlation IDs are not a “nice to have.” They are as essential as logging itself. The cost of implementing them is hours; the cost of not having them is days of debugging per incident.
Implementation Checklist
If you are adopting event-driven architecture, here is the order I recommend tackling the implementation. Each step builds on the previous one.
Define your event contracts
Before writing any code, document the events your system will produce and consume. Include the event name, payload schema, and who the producer is. Use a schema registry from day one, even if you only have two services. Retrofitting one later is painful.
Implement the Outbox pattern for critical events
Start with the events that have business consequences if lost (orders, payments, user actions). Less critical events (analytics, activity feeds) can use direct publish initially and migrate later.
Make every consumer idempotent
Add an idempotency check to every consumer. An event_id column in a processed_events table is the simplest approach. Test it by deliberately publishing duplicate messages and verifying that side effects only happen once.
Add correlation IDs from the start
Generate a correlation ID at the API gateway. Include it in every event payload. Log it in every service. Integrate with a tracing tool (Jaeger, Zipkin, or Datadog APM). This investment pays for itself on the first production incident.
Set up Dead Letter Queues and alerts
Configure a DLQ for every consumer. Set up monitoring on DLQ depth with sensible thresholds. Build a simple admin tool to inspect and replay DLQ messages. You will need this sooner than you think.
Add consumer lag monitoring
Track the gap between the latest published event and the latest consumed event for each consumer group. Alert if the lag exceeds a threshold (e.g., 5 minutes for real-time consumers, 1 hour for batch consumers). Consumer lag is the single best early warning signal for event-driven systems.
Where It Goes Wrong
Event-driven architecture is not complicated in theory, but I have seen these recurring mistakes undermine real implementations.
Eventing everything
Not every interaction needs to be asynchronous. If Service A needs an immediate response from Service B (e.g., “is this user authorized?”), a synchronous call is simpler and more reliable. Forcing everything through events creates unnecessary complexity and latency. Use events for things that can happen eventually, not things that need to happen right now.
Ignoring consumer lag
A consumer that falls behind is not just slow — it is a ticking time bomb. If your order consumer is 30 minutes behind, you have 30 minutes of orders that have not been fulfilled. If you are not monitoring consumer lag, you do not know you have a problem until customers complain. By then, you have a backlog that will take hours to clear.
Fat events with too much data
Putting the entire entity in every event (the “fat event” anti-pattern) creates tight coupling through the back door. When you change the order schema, every consumer that deserializes the full order object breaks. Instead, publish thin events with just the entity ID and event type, and let consumers fetch the data they need. Or use a hybrid: include only the fields that changed.
No schema versioning
Events evolve. If you do not version your event schemas from the start, you will eventually deploy a producer change that breaks every consumer simultaneously. Add a version field to every event. Maintain backward compatibility for at least two versions. Use a schema registry to enforce this automatically.
Testing only the happy path
Most teams test that events are published and consumed correctly. Few test what happens when the broker is down, when messages arrive out of order, when a consumer crashes mid-processing, or when a message is delivered twice. These failure modes are not edge cases in event-driven systems — they are the normal operating conditions. Test them explicitly.
Treating the broker as infinitely reliable
Kafka, RabbitMQ, and SQS are all highly reliable — but none of them are infallible. Broker restarts, network partitions, disk failures, and configuration mistakes all happen. Your system needs to handle broker unavailability gracefully. The Outbox pattern is one answer. Circuit breakers on the publish path are another. Assuming the broker will always be up is how you lose data.
When Not to Use Events
Event-driven architecture is a tool, not a religion. Here are three situations where I would recommend against it:
| Scenario | Why Events Are Overkill | Better Alternative |
|---|---|---|
| Simple CRUD app | Minimal side effects, no downstream consumers | Direct database calls |
| Strong consistency required | Need immediate confirmation across services | Synchronous API calls + saga pattern |
| Small team (< 5 engineers) | Operational overhead exceeds the benefit | Monolith with in-process events |
The third point deserves emphasis. Running a message broker, monitoring consumers, debugging distributed flows, managing schema evolution, and handling DLQ replays is a significant operational burden. If your team is five people, that overhead might consume 20-30% of your engineering capacity. A well-structured monolith with in-process event dispatching (like Rails’ ActiveSupport::Notifications or Spring’s ApplicationEventPublisher) gives you most of the architectural benefits without the infrastructure cost.
If your event-driven services cannot be deployed independently, or if changing one event schema requires coordinated deployments across multiple services, you have not built a distributed system. You have built a distributed monolith — which has all the complexity of microservices with none of the benefits.
Closing Thoughts
Event-driven architecture is a powerful tool, but it is not a free lunch. The complexity does not disappear — it shifts from your application code to your infrastructure and operational practices. Make sure you are trading up, not just trading sideways.
The question is never “should we use events?” It is “which interactions benefit from being asynchronous, and are we prepared to handle the operational complexity that comes with them?”
If you can answer that honestly, you will make the right call for your system.
Pick one interaction in your system that is currently synchronous and causing problems (slow response times, cascading failures, tight coupling). Move that one interaction to events using the Outbox pattern. Live with it for a month. If the operational overhead is manageable and the benefits are real, expand from there. If not, you have learned something valuable at low cost.