Question: The report is generated only once per day and process 2TB per day. Would you use Batch or Streaming?
- Latency Requirements
- Business doesn't require realtime insights.
- Streaming systems are designed for low latency processing which becomes unnecessary complexity.
- Cost efficiency
- Streaming pipelines run continuously. (Compute resource must stay active all the time which will increase cloud costs)
- Batch processing runs only when the job is scheduled.
- Data volume consideration
- Both streaming & Batch can handle 2TB but streaming require maintaining state and handling continous processing. It will increase operational complexity.
- Batch process alows heavy transformation.
Example Architecture:
ERP DB -> Data Ingestion -> Cloud Storage/ DataLake -> Spark or Dataflow batch job -> Data warehouse -> BI report
Example Architecture of Streaming System:
Login Event -> Pub/Sub or Kafka -> Streaming Processing Engine (DataFlow / Flink) -> Real-time detection logic -> Alert / Security system
Streaming System Data Volume Limit
Streaming systems do not have a strict maximum dataset size, what matter is event rate, not total dataset size. Streaming systems are designed around event throughput. For example,
100 events/sec
10,000 events/sec
1,000,000 events/sec
The architecture scales horizontallly to handle higher event rates (by adding more server).
The real challenge in streaming is state management. Many streaming jobs require remembering previous events. For example, fraud detection -> session tracking -> user behavior aggregation. This requires maintaining large state storage since it don't query external database for each event (due to latency). Therefore, streaming engines maintain local state storage and state growth can become the biggest bottleneck in streaming pipelines.
For example, Apache Flink uses embedded RocksDB. Spark Structured Steraming uses state stores. Google Dataflow uses managed state storage. But Databases can still be used in streaming pipelines such as reference data lookups. For example, a streaming fraud detection system may check a list of blocked credit cards, and that list may be stored in a database or cache like Redis.
Comments
Post a Comment