# Data Stream

<figure><img src="https://671263767-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fg7wr5wNTWT2nhsQnxVvd%2Fuploads%2F2C2oPZPjV4EBOZ1TLkQ9%2Fimage.png?alt=media&#x26;token=fe3fc90b-47c5-4d93-8d19-7b93ebe10911" alt=""><figcaption></figcaption></figure>

## Architecture Description

This architecture aims to build a **large-scale data stream processing platform** with both **real-time** and **batch processing capabilities**.

### Kafka

• **As a message queue system**, Kafka is used to **receive and buffer real-time data** from various data sources.

• **Data sources** can be user operation logs, sensor data, or business system events.

### Flink

• Data in Kafka is processed in **real time** through Flink to complete **filtering, aggregation, and complex event processing (CEP)**.

• Flink tasks will continue to **consume data from Kafka topics** and output the processed results to **downstream storage**.

### Spark

• Used for **batch analysis** of **historical data** or **large-scale offline data**.

• Spark reads historical data from **data stores** (such as HDFS, S3) and performs **big data computing and machine learning tasks**.

### ClickHouse

• As a **high-performance columnar storage**, ClickHouse is used to **store and query processed aggregated or historical analysis data**.

• ClickHouse supports **real-time data insertion** and **complex query operations**, and is suitable for building **BI reports** and **data analysis systems**.

### Other components

• **Zookeeper**: Coordinates **Kafka partitions** and **distributed task management**.

• **Connector/ETL**: Data is imported from **different sources into Kafka or Flink/Spark**.

• **Dashboard or BI tool: User interface** used to **visualize and analyze ClickHouse data**.
