π₯οΈData Stream

Architecture Description
This architecture aims to build a large-scale data stream processing platform with both real-time and batch processing capabilities.
Kafka
β’ As a message queue system, Kafka is used to receive and buffer real-time data from various data sources.
β’ Data sources can be user operation logs, sensor data, or business system events.
Flink
β’ Data in Kafka is processed in real time through Flink to complete filtering, aggregation, and complex event processing (CEP).
β’ Flink tasks will continue to consume data from Kafka topics and output the processed results to downstream storage.
Spark
β’ Used for batch analysis of historical data or large-scale offline data.
β’ Spark reads historical data from data stores (such as HDFS, S3) and performs big data computing and machine learning tasks.
ClickHouse
β’ As a high-performance columnar storage, ClickHouse is used to store and query processed aggregated or historical analysis data.
β’ ClickHouse supports real-time data insertion and complex query operations, and is suitable for building BI reports and data analysis systems.
Other components
β’ Zookeeper: Coordinates Kafka partitions and distributed task management.
β’ Connector/ETL: Data is imported from different sources into Kafka or Flink/Spark.
β’ Dashboard or BI tool: User interface used to visualize and analyze ClickHouse data.
Last updated