Notes

When building data pipelines, it's tempting to write results directly to your database. This creates tight coupling that can make your entire pipeline brittle. A better approach: use a message queue as a buffer between your pipeline and database.

The Problem with Direct Writes

When your pipeline writes directly to a database, several issues emerge:

Coupled failures: The pipeline only succeeds if the database happens to be ready for writes the moment it finishes. If the database is down, slow, has exhausted its connection limit, or is briefly refusing writes (while an index is being built or a materialised view created), the write fails and the whole pipeline fails with it.
Costly re-runs: Recovering from a failed write usually means re-running the entire pipeline. When that pipeline takes hours and runs on expensive compute such as GPUs, a single transient database hiccup at the very end throws all of that work away and forces a costly redo.
Performance bottlenecks: Individual database writes are slower than bulk operations.

A More Resilient Architecture

By introducing a message queue, we decouple the pipeline from the database. The consumer can write to the database directly or through whatever API layer fronts it; either way, the pipeline only ever talks to the queue.

1. Higher availability: Message queues (like Pub/Sub, SQS, or Kafka) have much higher uptime than databases. Your pipeline succeeds as long as it can publish messages, even when the database is mid-migration or saturated with connections.

2. Batch processing: The consumer can accumulate messages and perform bulk database operations, significantly improving throughput.

3. Graceful failure handling: If database writes fail, messages can be routed to a dead letter queue for inspection and manual intervention. You can even repair the database by hand from the dead letter queue, with no need to re-run a compute-heavy pipeline.

4. Independent scaling: Scale your pipeline and database consumers independently based on their unique demands.

5. Natural backfills: To load results from past runs, replay their messages onto the queue. The consumer processes them exactly like live traffic, so there is no separate backfill path to build and maintain.

Implementation Considerations

Message schema: Define clear message contracts (e.g., using Protobuf or JSON Schema)
Idempotency: Ensure consumers can safely process the same message multiple times
Monitoring: Track queue depth and consumer lag to detect bottlenecks
Ordering: Consider whether message ordering matters for your use case

What About Reads?

This pattern decouples writes; it does nothing for reads, which still hit the database directly. If read load is the bottleneck, or readers need insulating from write contention, a read replica is the usual next step. How stale that replica is allowed to be depends entirely on the use case.

Is It Worth It?

This architecture is not free. A queue, a consumer service, and a dead-letter queue are extra moving parts to build, deploy, and monitor. Whether that complexity pays off depends on how expensive a failed write is. A pipeline that runs in seconds and is cheap to retry rarely justifies it; one that burns hours of GPU time only to lose all of it because the database was briefly unavailable almost always does. Weigh the added complexity against the cost of failure, and decide where your own pipeline sits.