ClickHouse and Friends (15) Why Group By Is So Fast [ Bohu's Blog ]

Before revealing the secrets of ClickHouse Group By, let’s talk about database performance comparison testing.
In my view, what information should a “virtuous” performance comparison test provide?

First, respect objective facts. In what scenarios is x faster than y?
Second, why is x faster than y?

If you’ve achieved both of the above, there’s one more important point: how long can x’s advantage last? Is it a long-term advantage from architecture, or just a quick optimization? Can it continue keeping up with its soul?
If you just paste some flashy numbers, that’s not a benchmark — it’s benchmarket.

Okay, back to Group By.
I believe many of you have experienced ClickHouse Group By’s excellent performance. This post analyzes why it’s fast.
First, some reassurance: ClickHouse’s Group By doesn’t use fancy high-tech. It just found a relatively optimal approach.

One SQL

1	SELECT sum(number) FROM numbers(10) GROUP BY number % 3

We’ll use this simple SQL as a thread to see how ClickHouse implements Group By aggregation.

1. Generate AST

EXPLAIN AST
SELECT sum(number)
FROM numbers(10)
GROUP BY number % 3

┌─explain─────────────────────────────────────┐
│ SelectWithUnionQuery (children 1)           │
│  ExpressionList (children 1)                │
│   SelectQuery (children 3)                  │
│    ExpressionList (children 1)              │
│     Function sum (children 1)               │  // sum aggregation
│      ExpressionList (children 1)            │
│       Identifier number                     │
│    TablesInSelectQuery (children 1)         │
│     TablesInSelectQueryElement (children 1) │
│      TableExpression (children 1)           │
│       Function numbers (children 1)         │
│        ExpressionList (children 1)          │
│         Literal UInt64_10                   │
│    ExpressionList (children 1)              │
│     Function modulo (children 1)            │  // number % 3 function
│      ExpressionList (children 2)            │
│       Identifier number                     │
│       Literal UInt64_3                      │
└─────────────────────────────────────────────┘

2. Generate Query Plan

EXPLAIN
SELECT sum(number)
FROM numbers(10)
GROUP BY number % 3

┌─explain───────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))                                   │
│   Aggregating                                                                 │ // sum aggregation
│     Expression (Before GROUP BY)                                              │ // number % 3
│       SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│         ReadFromStorage (SystemNumbers)                                       │
└───────────────────────────────────────────────────────────────────────────────┘

Code mainly in InterpreterSelectQuery::executeImpl@Interpreters/InterpreterSelectQuery.cpp

3. Generate Pipeline

EXPLAIN PIPELINE
SELECT sum(number)
FROM numbers(10)
GROUP BY number % 3

┌─explain───────────────────────┐
│ (Expression)                  │
│ ExpressionTransform           │
│   (Aggregating)               │
│   AggregatingTransform        │  // sum computation
│     (Expression)              │
│     ExpressionTransform       │  // number % 3 computation
│       (SettingQuotaAndLimits) │
│         (ReadFromStorage)     │
└───────────────────────────────┘

4. Execute Pipeline

Pipeline executes from bottom to top one by one.

4.1 ReadFromStorage

First execute from ReadFromStorage, generating a block1 with data as follows:

┌─number─┐
│      0 │
│      1 │
│      2 │
│      3 │
│      4 │
│      5 │
│      6 │
│      7 │
│      8 │
│      9 │
└────────┘
number type is UInt64

4.2 ExpressionTransform

ExpressionTransform contains 2 actions:

Name is number, type is INPUT
Name is modulo(number, 3), type is FUNCTION

After ExpressionTransform processing, a new block2 is generated with data as follows:

┌─number─┬─modulo(number, 3)─┐
│      0 │                 0 │
│      1 │                 1 │
│      2 │                 2 │
│      3 │                 0 │
│      4 │                 1 │
│      5 │                 2 │
│      6 │                 0 │
│      7 │                 1 │
│      8 │                 2 │
│      9 │                 0 │
└────────┴───────────────────┘
number type is UInt64
modulo(number, 3) type is UInt8

Code mainly in ExpressionActions::execute@Interpreters/ExpressionActions.cpp

4.3 AggregatingTransform

AggregatingTransform is the core of Group By’s high performance.
In this example, modulo(number, 3) type is UInt8. For optimization, ClickHouse chooses to use arrays instead of hashtables for grouping. The distinction logic is in Interpreters/Aggregator.cpp

When computing sum, it first generates an array [1024], then does compile-time unrolling (code addBatchLookupTable8@AggregateFunctions/IAggregateFunction.h):

static constexpr size_t UNROLL_COUNT = 4;
std::unique_ptr<Data[]> places{new Data[256 * UNROLL_COUNT]};
bool has_data[256 * UNROLL_COUNT]{}; /// Separate flags array to avoid heavy initialization.

size_t i = 0;

/// Aggregate data into different lookup tables.
size_t batch_size_unrolled = batch_size / UNROLL_COUNT * UNROLL_COUNT;
for (; i < batch_size_unrolled; i += UNROLL_COUNT)
{
    for (size_t j = 0; j < UNROLL_COUNT; ++j)
    {
        size_t idx = j * 256 + key[i + j];
        if (unlikely(!has_data[idx]))
        {
            new (&places[idx]) Data;
            has_data[idx] = true;
        }
        func.add(reinterpret_cast<char *>(&places[idx]), columns, i + j, nullptr);
    }
}

sum(number) … GROUP BY number % 3 computation method:

1
2
3

array[0] = 0 + 3 + 6 + 9 = 18
array[1] = 1 + 4 + 7 = 12
array[2] = 2 + 5 + 8 = 15

This is just an optimization branch for UInt8. How to optimize processing for other types?
ClickHouse provides different hashtables for different types, quite grandiosely (code in Aggregator.h):

using AggregatedDataWithUInt8Key = FixedImplicitZeroHashMapWithCalculatedSize<UInt8, AggregateDataPtr>;
using AggregatedDataWithUInt16Key = FixedImplicitZeroHashMap<UInt16, AggregateDataPtr>;
using AggregatedDataWithUInt32Key = HashMap<UInt32, AggregateDataPtr, HashCRC32<UInt32>>;
using AggregatedDataWithUInt64Key = HashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
using AggregatedDataWithShortStringKey = StringHashMap<AggregateDataPtr>;
using AggregatedDataWithStringKey = HashMapWithSavedHash<StringRef, AggregateDataPtr>;
using AggregatedDataWithKeys128 = HashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
using AggregatedDataWithKeys256 = HashMap<DummyUInt256, AggregateDataPtr, UInt256HashCRC32>;
using AggregatedDataWithUInt32KeyTwoLevel = TwoLevelHashMap<UInt32, AggregateDataPtr, HashCRC32<UInt32>>;
using AggregatedDataWithUInt64KeyTwoLevel = TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>>;
using AggregatedDataWithShortStringKeyTwoLevel = TwoLevelStringHashMap<AggregateDataPtr>;
using AggregatedDataWithStringKeyTwoLevel = TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr>;
using AggregatedDataWithKeys128TwoLevel = TwoLevelHashMap<UInt128, AggregateDataPtr, UInt128HashCRC32>;
using AggregatedDataWithKeys256TwoLevel = TwoLevelHashMap<DummyUInt256, AggregateDataPtr, UInt256HashCRC32>;
using AggregatedDataWithUInt64KeyHash64 = HashMap<UInt64, AggregateDataPtr, DefaultHash<UInt64>>;
using AggregatedDataWithStringKeyHash64 = HashMapWithSavedHash<StringRef, AggregateDataPtr, StringRefHash64>;
using AggregatedDataWithKeys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>;
using AggregatedDataWithKeys256Hash64 = HashMap<DummyUInt256, AggregateDataPtr, UInt256Hash>;

If we change to GROUP BY number*100000, it will choose AggregatedDataWithUInt64Key hashtable for grouping.

Moreover, ClickHouse provides a Two Level approach for handling situations with many grouping keys. Level1 first divides into large groups, Level2 small groups can be computed in parallel.
For String types, hashtables are heavily optimized based on different lengths. Code in HashTable/StringHashMap.h

Summary

ClickHouse chooses an optimal hashtable or array based on the final type of Group By as the grouping basic data structure, keeping memory and computation as optimal as possible.

How is this “optimal solution” found? From test code, it’s through constant trying and testing verification — a strong bottom-up philosophy.

hashtable test code: Interpreters/tests

lookuptable test code: tests/average.cpp