🖥️Data Stream

Architecture Description

This architecture aims to build a large-scale data stream processing platform with both real-time and batch processing capabilities.

Kafka

As a message queue system, Kafka is used to receive and buffer real-time data from various data sources.

Data sources can be user operation logs, sensor data, or business system events.

• Data in Kafka is processed in real time through Flink to complete filtering, aggregation, and complex event processing (CEP).

• Flink tasks will continue to consume data from Kafka topics and output the processed results to downstream storage.

Spark

• Used for batch analysis of historical data or large-scale offline data.

• Spark reads historical data from data stores (such as HDFS, S3) and performs big data computing and machine learning tasks.

ClickHouse

• As a high-performance columnar storage, ClickHouse is used to store and query processed aggregated or historical analysis data.

• ClickHouse supports real-time data insertion and complex query operations, and is suitable for building BI reports and data analysis systems.

Other components

Zookeeper: Coordinates Kafka partitions and distributed task management.

Connector/ETL: Data is imported from different sources into Kafka or Flink/Spark.

Dashboard or BI tool: User interface used to visualize and analyze ClickHouse data.

Last updated