SerialReads

Pipes-and-Filters Architecture: Building Streaming & Batch Pipelines

Jun 10, 2025

Great. I’ll create a modular executive overview titled “Pipes-and-Filters Architecture: Building Streaming & Batch Pipelines” with a narrative, problem-solution style, targeting intermediate-to-advanced engineers. The overview will hit all 13 required bullets, include a TL;DR, a GitHub Pages–friendly PlantUML diagram, and a concise “Key Take-aways” section. I’ll keep technology examples light and define any advanced terms.

I’ll let you know when it’s ready for your review.

Pipes-and-Filters Architecture: Building Streaming & Batch Pipelines

TL;DR: The Pipes-and-Filters architectural pattern divides complex data processing into a pipeline of independent stages. Each filter performs one task, passing data via pipes. This modular design enhances reuse, scalability (parallel stages, back-pressure control), and flexibility across streaming or batch jobs – with built-in support for error isolation, monitoring, and robust, secure data flows.

From Monolithic ETL to Pipeline – Why Pipes & Filters?

Monolithic ETL jobs or ad-hoc scripts that handle extraction, transformation, and loading in one big chunk eventually hit a wall. Imagine a single massive script that reads data, validates it, applies transformations, and loads results. As requirements grow, this monolith becomes brittle – code is tangled, hard to reuse, and any change risks breaking the whole flow. Often the same parsing or cleaning logic is duplicated across multiple such scripts. When one part slows down or fails, the entire job suffers, and scaling is all-or-nothing. These issues with one-size-fits-all ETL modules motivated a better approach.

Enter Pipes-and-Filters. Instead of one opaque program, the work is broken into a series of independent steps (filters), each responsible for a single task (say, parse JSON, validate schema, anonymize PII, aggregate stats). Each filter’s output flows into the next step through a pipe – a conduit that carries data records or messages downstream. This pattern yields composability and reuse: filters are small and self-contained, so you can add, remove, or reorder them without rewriting everything. Need a new pipeline? Mix and match existing filters or plug in new ones as needed, instead of cloning and tweaking a monolith. In short, Pipes-and-Filters turns a rigid script into a flexible assembly line.

Core Anatomy: Filters, Pipes, and the Pipeline Runner

At the heart of this architecture are three concepts: filters, pipes, and the pipeline orchestrator (or runner). A filter is an independent processing stage that performs one operation on the data. It could be as simple as a function in memory or as heavy as a separate process or container – the key is it encapsulates one unit of computation. Filters are often stateless (more on that shortly) and ideally do not share resources, which makes them easy to scale and test in isolation.

A pipe is the channel that connects filters and passes data along. In a program, a pipe might be an in-memory queue, a Unix pipe streaming bytes, or an asynchronous message queue/topic connecting two services. The pipe decouples the filters – a filter just writes output to the pipe, and the next filter reads from it, without needing direct function calls. This decoupling means filters don’t need to know about each other’s internals, only the contract of data that flows through.

The pipeline runner (or orchestrator) is what sets up and coordinates the chain of filters and pipes. In a simple in-process pipeline, this might be just the thread of execution passing data from one function to the next. In more complex systems, the runner could be a framework or engine (like Apache Beam’s runtime or a Unix shell) that takes a pipeline definition and manages threading, scheduling, and data movement. Essentially, it ensures each filter gets data from its inbound pipe, runs the processing, and emits to the outbound pipe, until the pipeline completes.

Dataflow Modes: Push vs Pull, Streaming vs Batch

A Pipes-and-Filters pipeline can run in different modes of dataflow depending on how data availability is managed:

Filter Design: Stateless vs Stateful, Pure vs Effectful

One important design decision is whether filters hold state. A stateless filter treats each input independently and does not retain anything between messages. For example, a filter that converts Celsius to Fahrenheit or trims text fields – given the same input, it always produces the same output and doesn’t rely on past history. Such filters are much easier to scale and reuse, and they don’t introduce coupling between pipeline stages. In fact, it’s recommended that filters be idempotent and stateless whenever possible – ensuring that re-processing an event or running two instances of a filter yields the same result. Stateless filters can be duplicated to increase throughput without worrying about consistency.

However, some tasks inherently need state. A stateful filter might aggregate counts, compute running averages, or join events over a time window – it must remember past inputs. These filters can still fit in pipelines but require careful handling: e.g. periodic checkpointing of state for recovery, and logic to manage state across parallel instances (sharding by key, etc.). For example, a filter that counts unique users needs to store seen user IDs. When using stateful filters, design them so that state is internal (encapsulated in that filter) and consider making them idempotent – if re-run with the same inputs, they don’t double-count, which aids exactly-once processing.

We also distinguish pure transformation filters versus sources and sinks (effectful stages). A pure filter takes input, transforms it, and emits output with no external side-effects – e.g. mapping a record to a new schema. In contrast, a source filter has no upstream (it introduces new data into the pipeline, like reading from a file or message queue), and a sink filter consumes data at the end (writing to a database or triggering an API call). Sources and sinks often have side effects (I/O, external interactions). While internal filters ideally avoid side effects, sinks by nature produce an effect (persisting results), and sources often involve external systems (data ingestion). In design interviews, it’s good to mention that sources/sinks wrap the edges of a pipeline, while the filters in between focus on pure transformations whenever possible – this separation makes the middle of the pipeline easier to test and parallelize.

Concurrency and Parallelism: Pipelines at Scale

One of the big advantages of Pipes-and-Filters is the ability to run different filters concurrently. In a monolith, even if you can use threads, the tightly coupled nature often forces sequential flow. In a pipeline, each filter can potentially run in its own thread or even on a separate machine, consuming from input pipe and pushing to output asynchronously. This opens the door to parallel filters and distributed pipelines.

If one filter stage is a bottleneck (say it does heavy CPU work or waits on I/O), we can scale out that stage by running multiple instances of the filter in parallel, all reading from the same inbound pipe (queue) and writing to the outbound pipe. For example, if Filter 2 is slower than others, spin up N parallel consumers of the pipe feeding Filter 2 – now the work is divided and the system’s throughput increases. Each filter being independent also allows deploying different stages on different hardware optimized for their workload (CPU-heavy filters on beefy servers, I/O-bound ones on cheaper nodes). This partitioned pipeline approach (also called parallel pipeline) lets us meet demand by scaling individual stages rather than the whole thing.

Back-pressure is a critical concern in concurrent pipelines. What if an upstream filter produces data faster than the downstream can handle? Without checks, the pipes will fill up (in memory or queue length) and possibly crash or cause high latency. Back-pressure is the mechanism to handle this: it’s essentially feedback from slower consumers upstream to throttle production. In a pull-based pipeline, back-pressure is implicit (downstream only pulls when ready). In push-based systems, frameworks implement protocols to signal demand. For instance, Reactive Streams use a subscription request model where the consumer signals how many items it can handle, and the producer limits itself. In distributed pipelines using queues, the queue depth can serve as a signal – e.g. auto-scaling more consumers when backlog grows. If scaling isn’t an option, the pipeline might pause ingest or shed load (drop messages or divert them) when back-pressure builds. The goal is to resist uncontrolled flow and prevent a slow stage from crashing the whole pipeline.

Below is a simple PlantUML diagram of a 4-stage pipeline with a branching path and a back-pressure feedback loop. Stage 2 splits into two parallel filters (Stage 3A and 3B), whose outputs are merged and collected by Stage 4. A dashed red arrow illustrates back-pressure: if the final stage (Stage 4) is overwhelmed, a signal is sent upstream (in this case to Stage 2) to slow down input.

@startuml
skinparam pipelineDiagramFontColor black
skinparam componentStyle rectangle
skinparam defaultTextAlignment center
skinparam arrowColor #blue
participant "Stage 1:\nSource" as S1
participant "Stage 2:\nFilter" as S2
participant "Stage 3A:\nFilter" as S3A
participant "Stage 3B:\nFilter" as S3B
participant "Stage 4:\nSink" as S4
S1 -> S2 : pipe
S2 -> S3A : pipe (branch 1)
S2 -> S3B : pipe (branch 2)
S3A -> S4 : pipe
S3B -> S4 : pipe
S4 -[#red,dashed]-> S2 : back-pressure
@enduml

In this diagram, Stage 1 might be a source (no input pipe, perhaps reading from a file or API). Stage 2 processes the data and then branches into two parallel filters (Stage 3A and 3B) – for example, Stage 3A could enrich data with one service, while 3B performs a different computation. Both results go into Stage 4, which might merge or store the final output. If Stage 4 can’t keep up, the back-pressure dashed arrow indicates that Stage 2 (and thus Stage 1’s intake) will be slowed or paused until Stage 4 catches up. This ensures the pipeline doesn’t blindly push data into a bottleneck.

Data Serialization and Contracts

For filters to work in concert, they must agree on the data format and schema traveling through pipes. This is the data contract. In simple in-memory pipelines, the contract might just be a shared data structure or object. But in most realistic scenarios – especially if filters are separate services or processes – you need to serialize data into a common format as it passes through pipes.

Common choices include CSV, JSON Lines (JSONL), Avro, or Protobuf:

The choice of format affects coupling. Text formats (CSV, JSONL) are more forgiving – a new field might be ignored by old filters – but they rely on convention. Binary formats with explicit schemas (Avro, Protobuf) enforce the contract strictly, which is safer but means you must manage schema evolution carefully. In any case, it’s critical to define the data contract up front: what fields, types, and semantics each record carries. This contract is what allows independent teams to build or change one filter without breaking others (as long as they respect the contract or negotiate changes through versioning).

Error Handling and Compensation

No pipeline is perfect – errors will happen, whether it’s a malformed record, an unavailable external service, or a bug in filter logic. In a monolith script, often the whole job fails on an unexpected error. In Pipes-and-Filters, we have more nuanced strategies to handle errors and continue processing when possible.

One technique is using Poison Pills. A “poison pill” is a data item that causes a filter to consistently fail – analogous to a toxic message that poisons the pipeline if not dealt with. For example, a corrupt record might make a parsing filter throw an exception every time it’s read. If the pipeline naively retries the same bad message, it will stall (keep failing on the pill). To avoid a single bad apple halting the entire flow, filters implement error-handling policies. A common approach is to catch exceptions within a filter and either skip the bad record (logging it) or send it to an error channel. Many robust pipelines use a Dead Letter Queue (DLQ) for poison pills: when a message keeps failing, it’s published to a separate DLQ instead of propagating the failure downstream. This way, the pipeline can continue with subsequent messages and the problematic data is isolated for offline analysis. The DLQ acts as a holding area for events that were not processed – operators can examine these events later, possibly run a compensation workflow or manual fix, then replay them if needed after the issue is resolved.

There is a trade-off between “skip-record” vs “halt-pipeline” on error. Skipping (or routing to DLQ) prioritizes throughput – the pipeline keeps running even if some inputs are bad, which is crucial in long-running streams. However, you may lose data or at least delay processing those records. Halting the pipeline (fail fast) ensures no data is lost silently – the moment something is wrong, everything stops so the problem can be addressed. In practice, many systems adopt a hybrid: try a few retries (in case the error was transient, e.g. a momentary network glitch), then if still failing, send to DLQ and move on. This provides resilience without giving up on problematic events entirely.

Another error-handling pattern is the use of sentinel messages like a special end-of-stream marker or flush command. For instance, when shutting down a pipeline gracefully, one can inject a “poison pill” message that isn’t actual data but a signal for filters to finish up and stop. Each filter, upon receiving this pill, would perform any cleanup and not forward it further. This is a controlled way to terminate a pipeline (especially in multi-threaded scenarios, rather than killing threads abruptly).

Compensation refers to corrective actions after an error. In the context of a pipeline, this might mean if a filter failed halfway through a batch, you can replay that batch (if your pipeline supports replay) or apply a compensating transaction. For example, if a sink filter wrote partial results to a database before a downstream filter crashed, you might have to roll back those writes or mark them incomplete. Idempotent filter operations greatly simplify compensation: if filters can be safely retried without side effects, the system can just rerun failed pieces. We’ll discuss replay and idempotency more in the resilience section.

Observability: Metrics and Tracing

Building a pipeline is one thing – keeping an eye on its health is another. With multiple filters possibly distributed across systems, observability becomes critical. You’ll want to instrument stage-level metrics: each filter should track how many records it’s processing per second (throughput), how long it takes on average to process one item or a batch (latency), and how often it fails or discards a message (failure/error rate). By monitoring these metrics, you can pinpoint bottlenecks – e.g. if Filter 3’s throughput is lagging behind incoming rate, it’s time to scale it out or investigate why it’s slow. Many pipeline frameworks provide these metrics out of the box, or you can embed counters and timers in your filter code (and export to a monitoring system like Prometheus, CloudWatch, etc.).

End-to-end tracing is crucial for understanding the journey of data through the pipeline. In a monolith, you might log “started processing record X” and “finished record X” in one process. In a distributed pipeline, record X might pass through four different services – we need a way to correlate events. A common approach is to assign a unique ID or correlation ID to each message (or batch) at the source, and include it in logs/metrics at each stage. This allows using tracing tools (like OpenTelemetry, Jaeger, Zipkin, or cloud-specific ones like AWS X-Ray) to reconstruct the path of a message. Indeed, when filters are spread out, “monitoring and observability become challenging… necessitating distributed tracing solutions… to understand the interactions between filters”. With a trace, if an output is wrong or delayed, you can see which stage caused the issue.

Logs are also part of observability: each filter should log important events, but avoid overwhelming logging in hot loops (that can itself become a performance issue). Structured logs (e.g. JSON logs) with the message ID, stage name, timestamp, and outcome (processed, error, etc.) are invaluable for debugging in production.

In summary, treat the pipeline as an assembly line where you have sensors at each station. You want to quickly detect if stage 2 is backing up, or if error rate spikes at stage 4 after a new deploy. Throughput, latency, and error metrics per stage, plus end-to-end latency from source to sink, give a comprehensive picture. Dashboards often visualize each stage’s throughput and queue backlog. And tracing enables drilling into specific instances or flows for diagnostics. Observability isn’t just nice-to-have – in a complex pipeline, it’s what turns “black box” stages into transparent units you can trust and tune.

Resilience Patterns: Checkpointing, Replay, and Exactly-Once

In long-running pipelines (especially streaming ones), things will go wrong: a node might crash, a network partition might occur, etc. We need resilience so that the pipeline can recover from failures without data loss or duplication. A few key patterns help achieve this:

Orchestration Options: DIY vs Frameworks

You can implement a basic pipes-and-filters pipeline with just the Unix shell: for example, cat data.txt | grep "ERROR" | sed -e 's/.*ERROR: //' | sort > output.txt represents source -> filter -> filter -> filter -> sink using shell commands connected by |. The shell (or OS) acts as the pipeline runner, piping the stdout of one process to the stdin of the next. This is a quick and vendor-neutral way to orchestrate simple pipelines, especially for text data. The downside is limited error handling and observability (you rely on each command’s exit code and logs), and everything runs on one machine.

For more complex or large-scale pipelines, there are robust frameworks:

DIY vs Framework: Building a custom pipeline runner (perhaps using threads, queues, and custom code) can be appropriate for simple use cases or when learning the fundamentals. It gives full control – you can optimize every part for your exact needs. However, you’ll have to implement many features from scratch: queue management, scaling, checkpoints, metrics, etc. Frameworks and cloud services provide these out of the box, at the cost of some learning curve and less low-level control. In an interview, it’s wise to mention that for most large systems, using a proven framework (Beam, Flink, etc.) is preferable to reinventing the wheel – unless the problem is very unique or the team is extremely specialized. A framework also provides portability: e.g. a Beam pipeline can run on your laptop in batch mode or on Cloud Dataflow in streaming mode with minimal changes.

In summary, a lone developer might string together Python scripts or shell commands as a pipeline (DIY approach), but as complexity grows (need for parallelism, exactly-once, complex windows, etc.), they would likely migrate to a pipeline framework or service that handles those concerns systematically.

Performance Tuning Knobs

Even with a good framework or architecture, performance tuning can make a big difference in a pipeline’s throughput and latency. Here are key levers:

In performance tuning, beware of the downstream effects: speeding up one stage can move the bottleneck to the next stage. Always tune and test the pipeline end-to-end, watching the metrics as you adjust knobs.

Security Considerations

When data flows through multiple stages, possibly across networks or services, security must be considered at each step:

Security filters can be part of the pipeline itself (like a stage that scrubs data), and security measures like encryption are part of the pipe infrastructure. In system design, mentioning encryption (in-flight and at-rest) and data privacy handling in the pipeline will earn you points.

Common Pitfalls and Anti-Patterns

While Pipes-and-Filters is powerful, there are traps that engineers should avoid:

By being aware of these pitfalls, one can better utilize Pipes-and-Filters. Essentially, keep the spirit of the pattern – independence, modularity, and controlled interactions – without over-complicating or undermining it.

Key Take-aways

software-architecture-pattern