DE_3. Data Lake VS. Data Warehouse

Data Lake

  • Storage system designed to store large volumes of raw data. 
  • Store data in its original format (structured, semi-structured, unstructured)
    • JSON, CSV, Logs, Images, Videos
  • Massive Scalability & Low storage cost 
  • Does not provide native SQL query capabilities
    • To analyze data in cloud storage, you must use a compute engine (spark / dataflow)
  • SCHEMA-ON-READ
  • Amazon S3, Google cloud storage, Azure data lake storage, HDFS

Data Warehouse

  • Optimized for analytics and querying
  • Data is typically cleaned, structured, and transformed before loading.
  • Supports fast queries across large dataset. 
  • SCHEMA-ON-WRITE
  • BigQuery, Snowflake, Redshift, Synapse

Lakehouse

  • Combines the benefits of lake & warehouse
  • Delta lakes, Apache Iceberg, Apache Hudi


Comments