ClickHouse and Friends (6) MergeTree Storage Structure [ Bohu's Blog ]

Last updated: 2020-09-28

The previous post Storage Engine Technology Evolution and MergeTree introduced the evolution of storage algorithms.

The storage engine is a database’s chassis — it must be stable and powerful.

Next, we’ll explore ClickHouse MergeTree columnar storage engine together, deconstructing the most important components of this “sports car.”

All storage engines, whether excellent or shoddy, ultimately write data back to disk to meet storage and indexing purposes.

Disk file construction can be said to be the physical embodiment of algorithms. We can even reverse-engineer algorithm implementation from these storage structures.

So, to deeply understand a storage engine, the best starting point is its disk storage structure. Then reflecting on its read/write mechanisms will feel natural.

If you reverse this analysis order, it feels forced. Most online tutorials use this “forced” teaching method. This article will strike at the soul from the bottom layer, thoroughly clarifying 4 questions:

What files does MergeTree have?
How is MergeTree data distributed?
How is MergeTree index organized?
How does MergeTree use indexes to accelerate?

Without further ado, here’s the table:

CREATE TABLE default.mt
(
    `a` Int32,
    `b` Int32,
    `c` Int32,
    INDEX `idx_c` (c) TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY a
ORDER BY b
SETTINGS index_granularity=3

Generate some data:

1
2
3

insert into default.mt(a,b,c) values(1,1,1);
insert into default.mt(a,b,c) values(5,2,2),(5,3,3);
insert into default.mt(a,b,c) values(3,10,4),(3,9,5),(3,8,6),(3,7,7),(3,6,8),(3,5,9),(3,4,10);

Disk Files

1 2	ls ckdatas/data/default/mt/ 1_4_4_0 3_6_6_0 5_5_5_0 detached format_version.txt

You can see 3 data directories were generated. Each directory is called a partition (part) in ClickHouse. The directory name prefix is exactly the value of field a when we wrote: 1, 3, 5, because the table partition is positioned like this: PARTITION BY a.

Now let’s look at the a=3 partition:

1 2	ls ckdatas/data/default/mt/3_6_6_0/ a.bin a.mrk2 b.bin b.mrk2 c.bin checksums.txt c.mrk2 columns.txt count.txt minmax_a.idx partition.dat primary.idx skp_idx_idx_c.idx skp_idx_idx_c.mrk2

*.bin are column data files, sorted by primary key (ORDER BY), here sorted by field b
*.mrk2 mark files, purpose is to quickly locate bin file data positions
minmax_a.idx partition key min-max index file, purpose is to accelerate partition key a lookup
primary.idx primary key index file, purpose is to accelerate primary key b lookup
skp_idx_idx_c.* field c index file, purpose is to accelerate c lookup

On disk, MergeTree has only one physical sort: the ORDER BY primary key order. Other files (like .mrk/.idx) are logical acceleration. Around this single physical sort, the problem to solve is:

With field b physically sorted, how to achieve fast lookup of field a and field c?

MergeTree engine is simple to summarize:
The entire dataset is divided into multiple physical partitions by the partition field. Within each partition, logical files accelerate lookup around the only physical sort.

Storage Structure

Data Files

For storage structure within a single physical partition, first understand one point: MergeTree data has only one copy: *.bin.

a.bin is field a data, b.bin is field b data, c.bin is field c data — the columnar storage everyone’s familiar with.

Each bin file is sort-aligned with b.bin (b is the sort key), as shown:

This has a serious problem:
If *.bin files are large, even reading one row of data requires loading the entire bin file, wasting massive I/O. Unbearable.

granule

High tech arrives! ClickHouse MergeTree divides bin files into multiple granules based on granularity (GRANULARITY). Each granule is compressed and stored separately.

SETTINGS index_granularity=3 means every 3 rows of data form one granule. The partition currently has only 7 rows, so it’s divided into 3 granules (three color blocks):

For easy reading of a granule, *.mrk files record each granule’s offset. Each granule’s header records some metadata for reading and parsing:

This way, we can directly locate the desired granule based on the mark file, then read and verify this individual granule.

Currently, we still lack a mapping: the correspondence between each mark and field values. Which value ranges fall in mark0, which fall in mark1…?

With this mapping, we can achieve minimized granule reading to accelerate queries:

Determine which marks are needed based on query conditions
Read corresponding granules based on marks

Storage Sorting

Before understanding MergeTree’s indexing mechanism, you need to understand two points:

There’s only one complete data copy, stored in *.bin files
*.bin is stored in descending order by ORDER BY field

Sparse Index

Because there’s only one data copy and only one physical sort, MergeTree chose a simple, efficient sparse index mode in index design.

What is a sparse index? It’s selecting some points at intervals from already sorted complete data and recording which mark these points belong to.

1. primary index

Primary key index, can be specified via [PRIMARY KEY expr], defaults to ORDER BY field value.

Note: ClickHouse primary index is not the same concept as MySQL primary key.

For sparse point selection, take each granule’s minimum value:

2. skipping index

Regular index.

INDEXidx_c(c) TYPE minmax GRANULARITY 1 creates a minmax mode index for field c.

GRANULARITY is the granule granularity for sparse point selection. GRANULARITY 1 means select one per 1 granule:

If defined as GRANULARITY 2, then select one per 2 granules:

3. partition minmax index

For partition keys, MergeTree also creates a min/max index to accelerate partition selection.

4. Panoramic View

Query Optimization

Now familiar with MergeTree’s storage structure, let’s experience it through a few queries.

1. Partition Key Query

Statement:

1	select * from default.mt where a=3

The query directly locates a single partition based on a=3:

<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "a = 3" moved to PREWHERE
<Debug> default.mt (SelectExecutor): Key condition: unknown
<Debug> default.mt (SelectExecutor): MinMax index condition: (column 0 in [3, 3])
<Debug> default.mt (SelectExecutor): Selected 1 parts by a, 1 parts by key, 3 marks by primary key, 3 marks to read from 1 ranges
┌─a─┬──b─┬──c─┐
│ 3 │  4 │ 10 │
│ 3 │  5 │  9 │
│ 3 │  6 │  8 │
│ 3 │  7 │  7 │
│ 3 │  8 │  6 │
│ 3 │  9 │  5 │
│ 3 │ 10 │  4 │
└───┴────┴────┘

2. Primary Key Index Query

Statement:

1	select * from default.mt where b=5

The query first reads primary.idx from 3 partitions, then locates that only one partition meets the condition, finding the mark to read:

<Debug> default.mt (SelectExecutor): Key condition: (column 0 in [5, 5])
<Debug> default.mt (SelectExecutor): MinMax index condition: unknown
<Debug> default.mt (SelectExecutor): Selected 3 parts by a, 1 parts by key, 1 marks by primary key, 1 marks to read from 1 ranges
┌─a─┬─b─┬─c─┐
│ 3 │ 5 │ 9 │
└───┴───┴───┘

3. Index Query

Statement:

1	select * from default.mt where c=5

The query first reads primary.idx and skp_idx_idx_c.idx from 3 partitions for granule filtering (dropping unused ones), then locates that only one granule in partition 3_x_x_x meets the condition:

<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "b = 5" moved to PREWHERE
<Debug> default.mt (SelectExecutor): Key condition: unknown
<Debug> default.mt (SelectExecutor): MinMax index condition: unknown
<Debug> default.mt (SelectExecutor): Index `idx_c` has dropped 1 / 1 granules.
<Debug> default.mt (SelectExecutor): Index `idx_c` has dropped 1 / 1 granules.
<Debug> default.mt (SelectExecutor): Index `idx_c` has dropped 2 / 3 granules.
<Debug> default.mt (SelectExecutor): Selected 3 parts by a, 1 parts by key, 5 marks by primary key, 1 marks to read from 1 ranges
┌─a─┬─b─┬─c─┐
│ 3 │ 9 │ 5 │
└───┴───┴───┘

Summary

This article starts from disk storage structure to analyze ClickHouse MergeTree’s storage and index design.

Only by understanding these underlying mechanisms can we optimize our SQL and table structure to make execution more efficient.

ClickHouse MergeTree design is simple and efficient. The primary problem it solves is: with one physical sort, how to achieve fast lookup.

To address this problem, ClickHouse uses sparse indexes.

In the official roadmap, an interesting index direction is listed: Z-Order Indexing. The purpose is to encode multiple dimensions into one-dimensional storage. When we provide multi-dimensional conditions, we can quickly locate the spatial position of this condition point set. Currently, ClickHouse has made no progress on this index design.