DE_4. Google Cloud Data Engineering Stack

Example Architecture

  1. Streaming Architecture: 
    1. Application Log / Mobile app events 
    2. Pub/Sub (ingestion)
    3. Dataflow streaming pipeline (data processing engine)
    4. BigQuery (analytics warehouse)
    5. BI Dashboard
  2. Batch Architecture: 
    1. ERP Database
    2. Cloud storage (raw dataset)
    3. Dataproc spark job (ETL job/ ML preprocessing)
    4. BigQuery warehouse
    5. Business Analytics

BEAM (Framework) & SPARK (Processing engine)

  • If we only use spark without Apache Beam for data processing work, we need...
    • Batch process (for log analysis once a day)
    • Streaming process (for real-time user event analysis)
  • If we implement Beam...
    • Beam pipeline (Read -> Transform -> Filter -> Write)
      • Batch Runner (spark)
      • Streaming Runner (dataflow)
    • One Beam code can process both batch and streaming runners. 
    • Just manage the runner engines. 
    • One code can be used for multiple engines (spark, flink, dataflow).

Comments