SerialReads

software-architecture

Pipes‑and‑Filters Architecture: Streaming & Batch Pipelines

Core ideas, benefits, design choices, and common pitfalls of the Pipes‑and‑Filters pattern. Click a question to reveal the answer.

1. What is the Pipes‑and‑Filters architectural pattern?

It decomposes data processing into a linear (or branching) pipeline of filters (independent stages) connected by pipes that pass data downstream.

2. Why replace a monolithic ETL with a Pipes‑and‑Filters pipeline?

• Reduces tangled code and duplication
• Enables reuse by mixing and matching filters
• Scales individual stages, not the whole job
• Isolates failures—one bad stage no longer brings down the entire flow

3. Name the three core elements of this pattern.
  1. Filter – self‑contained processing stage
  2. Pipe – conduit carrying data between filters
  3. Pipeline runner / orchestrator – sets up, schedules, and coordinates the chain
4. Push vs pull dataflow—what’s the difference?

Push: Producer sends data as soon as it’s ready (low latency, needs back‑pressure).
Pull: Consumer requests data when ready (built‑in throttling).

5. How do streaming, micro‑batch, and batch modes differ?

Streaming – event‑at‑a‑time, near real‑time results
Micro‑batch – tiny, frequent batches (seconds)
Batch – large, discrete data sets (hours / days)
The pattern itself is agnostic; choice affects latency & complexity.

6. Stateless vs stateful filter—why prefer stateless?

• Stateless filters are idempotent, easy to parallelize, and need no checkpointing.
• Stateful filters hold running aggregates or joins and require careful state management.

7. What roles do **source** and **sink** filters play?

Sources ingest external data (no inbound pipe); sinks emit side‑effects (DB writes, API calls). In between, aim for pure transformations.

8. How does the pattern enable concurrency and parallelism?

Each filter can run in its own thread, process, container, or node. Bottleneck stages can be scaled horizontally by adding more filter instances reading from the same inbound pipe.

9. Define back‑pressure.

A feedback mechanism that slows producers when consumers lag to prevent unbounded queue growth and crashes.

10. Give two serialization formats suited to multi‑process pipes and a key trade‑off.

Avro / Protobuf (binary, schema‑enforced) – compact & fast but not human‑readable.
JSON Lines (text) – easy to debug, flexible, but verbose and slower to parse.

11. What is a ‘poison pill’ and how is it handled?

A bad record that consistently fails a filter. Route it to a Dead‑Letter Queue (DLQ) after limited retries to keep the pipeline healthy.

12. Why is observability critical in distributed pipelines?

Stage‑level metrics (throughput, latency, error rate) and end‑to‑end tracing pinpoint bottlenecks and failures across multiple services.

13. Explain checkpointing in stream processing.

Periodic snapshots of filter state and input offsets so the pipeline can restart from the last consistent point after a crash.

14. How do idempotent filters aid exactly‑once semantics?

If re‑processing the same input doesn’t change the final state or duplicate side‑effects, retries and replays become safe.

15. List two popular frameworks that implement Pipes‑and‑Filters.

Apache Beam (Dataflow, Flink runners, etc.)
Kafka Streams
(Flink, Spark Structured Streaming, and cloud ETL services also follow the pattern.)

16. Name three performance tuning knobs for a pipeline.

• Pipe buffer size
• Filter parallelism level
• Window size / trigger policy in stateful operators

17. What security measures should accompany inter‑service pipes?

• Encrypt data in transit (TLS)
• Encrypt data at rest (spill files, checkpoints)
• Apply PII masking/anonymization early in the pipeline

18. Describe the ‘chatty pipes’ anti‑pattern.

Filters emit excessively small messages, causing high serialization and network overhead. Batch or fuse operations to improve throughput.

19. Why can a ‘mega‑filter’ undermine the pattern?

Combining many responsibilities into one stage kills modularity, reuse, and fine‑grained scaling—reverting to a mini‑monolith.

20. How does schema evolution relate to contract coupling?

Rigid, unversioned schemas break downstream filters on change. Versioned schemas with backward compatibility prevent fragile coupling.

21. What is a Dead‑Letter Queue (DLQ)?

A separate pipe or topic that stores messages the pipeline failed to process after retries, allowing offline inspection and replay.

22. Give an example of fallback logic in a filter.

If an enrichment API is down, the filter inserts default values or tags the record for later re‑enrichment instead of blocking the whole pipeline.

23. How does operator fusion improve performance in engines like Flink?

Adjacent filters are combined into one physical task to avoid unnecessary serialization and context switching, boosting throughput.

24. What metrics signal that a stage has become a bottleneck?

Rising input queue depth, higher per‑item latency, and lower throughput compared to upstream stages.

25. Explain vertical vs horizontal scaling **within** a pipeline context.

Vertical: Give a slow filter more CPU/RAM on one node.
Horizontal: Add more parallel instances of that filter, each consuming from the shared pipe.

26. What design principle guides the granularity of filters?

Single Responsibility: each filter should perform one conceptual transformation to maximize reuse and testability.

← Back to all decks