ClickHouse and Friends (12) Magical Materialized Views and Their Implementation
Last updated: 2020-08-31
In ClickHouse, Materialized Views can be described as magical and powerful, with unique applications.
This article analyzes the underlying mechanism to understand how ClickHouse Materialized Views work, enabling better usage.
What is a Materialized View
For most people, the concept of materialized views is quite abstract: materialized? view?…
To better understand it, let’s look at a scenario.
Suppose you’re a “happy” little programmer at *hub. One day, the product manager has a requirement: real-time statistics of hourly video downloads.
User download detail table:
1 | clickhouse> SELECT * FROM download LIMIT 10; |
Calculate hourly downloads:
1 | clickhouse> SELECT toStartOfHour(when) AS hour, userid, count() as downloads, sum(bytes) AS bytes FROM download GROUP BY userid, hour ORDER BY userid, hour; |
Easy, right? But there’s a problem:
Each time you need to compute based on the download table as base data. *hub has too much data, unbearable.
An idea comes to mind: What if we pre-aggregate the download data, save results to a new table download_hour_mv, and update it incrementally in real-time as download grows, then query download_hour_mv each time?
This new table can be considered a materialized view, which in ClickHouse is a regular table.
Creating a Materialized View
1 | clickhouse> CREATE MATERIALIZED VIEW download_hour_mv |
This statement mainly:
- Creates a materialized view
download_hour_mvwithSummingMergeTreeengine - The materialized view’s data source is the
downloadtable, performing corresponding “materialization” operations based on expressions in theselectstatement - Selects a future time (current time is
2020-08-31 18:00:00) as starting pointWHERE when >= toDateTime('2020-09-01 04:00:00'), meaning only data after2020-09-01 04:00:00will be synced todownload_hour_mv
Thus, currently download_hour_mv is an empty table:
1 | clickhouse> SELECT * FROM download_hour_mv ORDER BY userid, hour; |
Note: The official documentation mentions the POPULATE keyword, but it’s not recommended because data written to download during view creation would be lost. This is why we add a WHERE clause as a data synchronization point.
So, how can we consistently sync source table data to download_hour_mv?
Materializing Full Data
After 2020-09-01 04:00:00, we can materialize download historical data with an INSERT INTO SELECT... statement with a snapshot WHERE clause:
1 | clickhouse> INSERT INTO download_hour_mv |
Query the materialized view:
1 | clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads DESC; |
You can see the data has been “materialized” into download_hour_mv.
Materializing Incremental Data
Write some data to the download table:
1 | clickhouse> INSERT INTO download |
Query the materialized view download_hour_mv:
1 | clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads; |
You can see the last row is the incremental materialized aggregation, synced in real-time. How is this achieved?
Materialized View Implementation
The principle of ClickHouse’s materialized views is not complex. When new data is written to the download table, if an associated materialized view is detected, it performs materialization operations on the newly written batch of data.
For example, the new data above was generated by this SQL:
1 | clickhouse> SELECT |
The materialized view executes a statement similar to:
1 | INSERT INTO download_hour_mv |
Code navigation:
Add view OutputStream, InterpreterInsertQuery.cpp
1
2
3
4if (table->noPushingToViews() && !no_destination)
out = table->write(query_ptr, metadata_snapshot, context);
else
out = std::make_shared<PushingToViewsBlockOutputStream>(table, metadata_snapshot, context, query_ptr, no_destination);Construct Insert, PushingToViewsBlockOutputStream.cpp
1
2
3
4ASTPtr insert_query_ptr(insert.release());
InterpreterInsertQuery interpreter(insert_query_ptr, *insert_context);
BlockIO io = interpreter.execute();
out = io.out;
- Materialize new data: PushingToViewsBlockOutputStream.cpp
1 | Context local_context = *select_context; |
Summary
Materialized views have many applications.
For example, they can solve table indexing problems. We can use materialized views to create alternative physical orderings to satisfy queries under certain conditions.
They also enable more flexible table structure changes through real-time data synchronization capabilities.
More powerfully, they can leverage the MergeTree family engines (SummingMergeTree, AggregatingMergeTree, etc.) to achieve real-time pre-aggregation, enabling fast queries.
The principle is to process incremental data according to AS SELECT ... and write it to the materialized view table. Materialized views are regular tables that can be directly read from and written to.