DE_2. Lambda VS Kappa Architecture

Lambda Architecture

Batch Layer (Spark engine): Stores the master dataset (contains all historical data). Periodially recomputes views from scratch to ensure data correctness.

For example, user click events are collected continuously, these events are used to compute user preference scores. Later, the company improves the recommendation algorithm. The new algorithm must be applied to all historical events. This requires recomputing the dataset from the beginning.

Speed Layer (Streaming Framework): Produces real-time views. Usually processes only recent data. (Apache Flinks, Spark Streaming, Google DataFlow)

Serving Layer: Merges results from the batch layer and speed layer. Queries are answered using both batch views (accuracy) and speed views (freshness).

It tolerates failures in the speed layer becuase batch recomputation can correct errors in streaming processes. But it doubles development and maintanence complexity -> Using Kappa to simplify.

*** Database lookup is not part of Lambda architecture but part of data enrichment. Data enrichment adds external context to events. For example, a credit card transaction event arrives, the system checks whether the card is on a blocked list. The blocked list may be stored in a database or cache. This enrichment step can exist in both Lambda and Kappa architectures. It is independent of the architecture style.

*** If the speed layer fails temporarily, real-time results may be missing. However, the batch layer will eventually recompute everything correctly. This correct batch result will replace the imcomplete speed-layer results. Therefore the system eventually becomes correct [EVENTUAL CORRECTNESS].

Kappa Architecture

Uses only streaming pipeline (no separate batch pipeline). Historical recomputation is done by replaying that event log.

Event Log (Kafka) -> Streaming Engine -> Derived Views

If bug occurs, the system reprocesses the event log from beginning.

This works best when event logs can be retained long-term, and when streaming engines are powerful enough for recomputation (Modern system since streaming engines became more powerful).

*** Lambda architecture stores raw data in batch storage system, but Kappa architecture stores raw data in an event log. Both approach preserve the full data history, but simply use different mechanisms.

*** In modern cloud data systems, many companies now perfer Kappa-style architectures especially for event-driven systems.

*** Even in Kappa architectures, companies often store raw data in data lakes as well. This is used for compliance, backup, and analytics workload. So real systems often combine ideas from both arthitectures.

Hailey's Data Journey

Search This Blog