ClickHouse and Friends (12) Magical Materialized Views and Their Implementation [ Bohu's Blog ]

Last updated: 2020-08-31

In ClickHouse, Materialized Views can be described as magical and powerful, with unique applications.

This article analyzes the underlying mechanism to understand how ClickHouse Materialized Views work, enabling better usage.

What is a Materialized View

For most people, the concept of materialized views is quite abstract: materialized? view?…

To better understand it, let’s look at a scenario.

Suppose you’re a “happy” little programmer at *hub. One day, the product manager has a requirement: real-time statistics of hourly video downloads.

User download detail table:

clickhouse> SELECT * FROM download LIMIT 10;
+---------------------+--------+--------+
| when                | userid | bytes  |
+---------------------+--------+--------+
| 2020-08-31 18:22:06 |     19 | 530314 |
| 2020-08-31 18:22:06 |     19 | 872957 |
| 2020-08-31 18:22:06 |     19 | 107047 |
| 2020-08-31 18:22:07 |     19 | 214876 |
| 2020-08-31 18:22:07 |     19 | 820943 |
| 2020-08-31 18:22:07 |     19 | 693959 |
| 2020-08-31 18:22:08 |     19 | 882151 |
| 2020-08-31 18:22:08 |     19 | 644223 |
| 2020-08-31 18:22:08 |     19 | 199800 |
| 2020-08-31 18:22:09 |     19 | 511439 |

... ....

Calculate hourly downloads:

clickhouse> SELECT toStartOfHour(when) AS hour, userid, count() as downloads, sum(bytes) AS bytes FROM download GROUP BY userid, hour ORDER BY userid, hour;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.13 sec)

Easy, right? But there’s a problem:
Each time you need to compute based on the download table as base data. *hub has too much data, unbearable.

An idea comes to mind: What if we pre-aggregate the download data, save results to a new table download_hour_mv, and update it incrementally in real-time as download grows, then query download_hour_mv each time?

This new table can be considered a materialized view, which in ClickHouse is a regular table.

Creating a Materialized View

clickhouse> CREATE MATERIALIZED VIEW download_hour_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(hour) ORDER BY (userid, hour)
AS SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM download WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

This statement mainly:

Creates a materialized view download_hour_mv with SummingMergeTree engine
The materialized view’s data source is the download table, performing corresponding “materialization” operations based on expressions in the select statement
Selects a future time (current time is 2020-08-31 18:00:00) as starting point WHERE when >= toDateTime('2020-09-01 04:00:00'), meaning only data after 2020-09-01 04:00:00 will be synced to download_hour_mv

Thus, currently download_hour_mv is an empty table:

1 2	clickhouse> SELECT * FROM download_hour_mv ORDER BY userid, hour; Empty set (0.02 sec)

Note: The official documentation mentions the POPULATE keyword, but it’s not recommended because data written to download during view creation would be lost. This is why we add a WHERE clause as a data synchronization point.

So, how can we consistently sync source table data to download_hour_mv?

Materializing Full Data

After 2020-09-01 04:00:00, we can materialize download historical data with an INSERT INTO SELECT... statement with a snapshot WHERE clause:

clickhouse> INSERT INTO download_hour_mv
SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM download WHERE when < toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

Query the materialized view:

clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads DESC;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.05 sec)

You can see the data has been “materialized” into download_hour_mv.

Materializing Incremental Data

Write some data to the download table:

clickhouse> INSERT INTO download
       SELECT
         toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
         19,
         rand() % 1000000
       FROM system.numbers
       LIMIT 10;

Query the materialized view download_hour_mv:

clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
| 2020-09-01 04:00:00 |     19 |        10 |    5732600 |
+---------------------+--------+-----------+------------+
11 rows in set (0.00 sec)

You can see the last row is the incremental materialized aggregation, synced in real-time. How is this achieved?

Materialized View Implementation

The principle of ClickHouse’s materialized views is not complex. When new data is written to the download table, if an associated materialized view is detected, it performs materialization operations on the newly written batch of data.

For example, the new data above was generated by this SQL:

clickhouse> SELECT
    ->          toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
    ->          19,
    ->          rand() % 1000000
    ->        FROM system.numbers
    ->        LIMIT 10;
+---------------------+------+-------------------------+
| when                | 19   | modulo(rand(), 1000000) |
+---------------------+------+-------------------------+
| 2020-09-01 04:00:00 |   19 |                  870495 |
| 2020-09-01 04:00:00 |   19 |                  322270 |
| 2020-09-01 04:00:00 |   19 |                  983422 |
| 2020-09-01 04:00:01 |   19 |                  759708 |
| 2020-09-01 04:00:01 |   19 |                  975636 |
| 2020-09-01 04:00:01 |   19 |                  365507 |
| 2020-09-01 04:00:02 |   19 |                  865569 |
| 2020-09-01 04:00:02 |   19 |                  975742 |
| 2020-09-01 04:00:02 |   19 |                   85827 |
| 2020-09-01 04:00:03 |   19 |                  992779 |
+---------------------+------+-------------------------+
10 rows in set (0.02 sec)

The materialized view executes a statement similar to:

INSERT INTO download_hour_mv
SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM [newly added 10 rows] WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

Code navigation:

Add view OutputStream, InterpreterInsertQuery.cpp

if (table->noPushingToViews() && !no_destination)
    out = table->write(query_ptr, metadata_snapshot, context);
else
    out = std::make_shared<PushingToViewsBlockOutputStream>(table, metadata_snapshot, context, query_ptr, no_destination);

Construct Insert, PushingToViewsBlockOutputStream.cpp

ASTPtr insert_query_ptr(insert.release());
InterpreterInsertQuery interpreter(insert_query_ptr, *insert_context);
BlockIO io = interpreter.execute();
out = io.out;

Materialize new data: PushingToViewsBlockOutputStream.cpp

Context local_context = *select_context;
local_context.addViewSource(
    StorageValues::create(
        storage->getStorageID(), metadata_snapshot->getColumns(), block, storage->getVirtuals()));
select.emplace(view.query, local_context, SelectQueryOptions());
in = std::make_shared<MaterializingBlockInputStream>(select->execute().getInputStream()

Summary

Materialized views have many applications.

For example, they can solve table indexing problems. We can use materialized views to create alternative physical orderings to satisfy queries under certain conditions.

They also enable more flexible table structure changes through real-time data synchronization capabilities.

More powerfully, they can leverage the MergeTree family engines (SummingMergeTree, AggregatingMergeTree, etc.) to achieve real-time pre-aggregation, enabling fast queries.

The principle is to process incremental data according to AS SELECT ... and write it to the materialized view table. Materialized views are regular tables that can be directly read from and written to.