Data Stream
Last updated
Last updated
This architecture aims to build a large-scale data stream processing platform with both real-time and batch processing capabilities.
• As a message queue system, Kafka is used to receive and buffer real-time data from various data sources.
• Data sources can be user operation logs, sensor data, or business system events.
• Data in Kafka is processed in real time through Flink to complete filtering, aggregation, and complex event processing (CEP).
• Flink tasks will continue to consume data from Kafka topics and output the processed results to downstream storage.
• Used for batch analysis of historical data or large-scale offline data.
• Spark reads historical data from data stores (such as HDFS, S3) and performs big data computing and machine learning tasks.
• As a high-performance columnar storage, ClickHouse is used to store and query processed aggregated or historical analysis data.
• ClickHouse supports real-time data insertion and complex query operations, and is suitable for building BI reports and data analysis systems.
• Zookeeper: Coordinates Kafka partitions and distributed task management.
• Connector/ETL: Data is imported from different sources into Kafka or Flink/Spark.
• Dashboard or BI tool: User interface used to visualize and analyze ClickHouse data.