Lesser Known Features of ClickHouse

Publish date: May 30, 2022
Last updated: May 31, 2022

Tags:

clickhouse

Introduction

ClickHouse can do lot more than a typical database. In this post, we will look at some of the features which are built into ClickHouse which you might not be aware of.

Run clickhouse queries without clickhouse server

clickhouse-local is a tool which comes along with clickhouse package that can be used to access clickhouse features without running a clickhouse server. Here are some examples you can use clickhouse-local for:

Data Manipulation

We come data in various formats (JSON, CSV, TSV and XML) and often need to convert one to other since different tools consume specific format only.

Let say you can CSV file like this:

❯ cat us-counties.csv
date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
2020-01-22,Snohomish,Washington,53061,1,0
2020-01-23,Snohomish,Washington,53061,1,0
2020-01-24,Cook,Illinois,17031,1,0
2020-01-24,Snohomish,Washington,53061,1,0
2020-01-25,Orange,California,06059,1,0
2020-01-25,Cook,Illinois,17031,1,0
2020-01-25,Snohomish,Washington,53061,1,0
2020-01-26,Maricopa,Arizona,04013,1,0
....

You can convert it into any of the supported output format using clickhouse-local.

Let try to convert it into JSON format.

❯ clickhouse-local \
     --file=us-counties.csv \
     --input-format=CSVWithNames \
     --table input \
     --structure='date Date, county String, state String, fips UInt32, cases UInt32, deaths UInt32' \
     --query='SELECT * from input FORMAT JSONEachRow'
{"date":"2020-01-21","county":"Snohomish","state":"Washington","fips":53061,"cases":1,"deaths":0}
{"date":"2020-01-22","county":"Snohomish","state":"Washington","fips":53061,"cases":1,"deaths":0}
{"date":"2020-01-23","county":"Snohomish","state":"Washington","fips":53061,"cases":1,"deaths":0}
{"date":"2020-01-24","county":"Cook","state":"Illinois","fips":17031,"cases":1,"deaths":0}
{"date":"2020-01-24","county":"Snohomish","state":"Washington","fips":53061,"cases":1,"deaths":0}
{"date":"2020-01-25","county":"Orange","state":"California","fips":6059,"cases":1,"deaths":0}
{"date":"2020-01-25","county":"Cook","state":"Illinois","fips":17031,"cases":1,"deaths":0}
{"date":"2020-01-25","county":"Snohomish","state":"Washington","fips":53061,"cases":1,"deaths":0}
{"date":"2020-01-26","county":"Maricopa","state":"Arizona","fips":4013,"cases":1,"deaths":0}
....

Data Transformation

To analyse data we need to transform the data to get some meaningful results from it. SQL is very powerful in doing such transformations.

❯ clickhouse-local \
     --file=us-counties.csv \
     --input-format=CSVWithNames \
     --table input \
     --structure='date Date, county String, state String, fips UInt32, cases UInt32, deaths UInt32' \
     --query='SELECT state State, count() Count, bar(count(), 0, 100000, 100) Histogram from input GROUP BY state ORDER BY state FORMAT PrettyCompact'
┌─State────────────────────┬─Count─┬─Histogram─────────────────────────┐
│ Alabama                  │  9700 │ █████████▋                        │
│ Alaska                   │  2878 │ ██▊                               │
│ Arizona                  │  2320 │ ██▎                               │
│ Arkansas                 │ 10696 │ ██████████▋                       │
│ California               │  8797 │ ████████▋                         │
│ Colorado                 │  8862 │ ████████▋                         │
│ ....                     │ ....  │ ....                              │
│ Virginia                 │ 18532 │ ██████████████████▌               │
│ Washington               │  5956 │ █████▊                            │
│ West Virginia            │  7256 │ ███████▎                          │
│ Wisconsin                │ 10133 │ ██████████▏                       │
│ Wyoming                  │  3187 │ ███▏                              │
└──────────────────────────┴───────┴───────────────────────────────────┘

HN User pointed out that we don’t even need to pass the --structure flag and clickhouse will automatically infer it for us. Wow!!

Run clickhouse as http server with custom static and dynamic handlers

ClickHouse listens on port 8123 (by default) for HTTP request. One can add custom http handlers to serve random data whether within clickhouse or outside clickhouse via this endpoint.

Adding the following config to clickhouse config.yml will define an http handler /hello

<clickhouse>
    <http_handlers>
        <rule>
            <url>/hello</url>
            <methods>GET</methods>
            <handler>
                <type>static</type>
                <status>200</status>
                <response_content>Hello from ClickHouse!</response_content>
            </handler>
        </rule>
    </http_handlers>
</clickhouse>

Let’s try by making an HTTP request

❯ curl 'localhost:8123/hello'
Hello from ClickHouse!

Above example is static content but lets try for serving data from a clickhouse table. Adding the following config to clickhouse config.yml will define an http handler /foobar which will read data from foobar table of default database.

<clickhouse>
    <http_handlers>
        <rule>
            <url>/foobar</url>
            <methods>POST,GET</methods>
            <handler>
                <type>predefined_query_handler</type>
                <query>SELECT * FROM default.foobar</query>
            </handler>
        </rule>
    </http_handlers>
</clickhouse>

Now lets create foobar table and insert some data into it.

ch-server :) CREATE TABLE foobar
(
    `name` String
)
ENGINE = MergeTree
ORDER BY name

Ok.

ch-server :) INSERT INTO foobar(name) VALUES ('John'), ('Michael'), ('Ivan');

Ok.

Now let’s make an HTTP request to the clickhouse endpoint.

❯ curl 'localhost:8123/foobar'
Ivan
John
Michael

Note the order change, it is due to ORDER BY name when we created the table.

You can read more about this feature in official documention

Run user-defined functions on clickhouse

One can define custom sql expressions or write scripts in their favourite language and clickhouse will happily call it for you.

ch-server :) CREATE FUNCTION plus_one as (a) -> a + 1

Ok.

ch-server :) SELECT plus_one(10);

┌─plus(10, 1)─┐
│          11 │
└─────────────┘

Adding the follow config in /etc/clickhouse-server/awkplus_function.xml will add user-defined function

<functions>
    <function>
        <type>executable</type>
        <name>awk_plus</name>
        <argument><type>UInt64</type></argument>
        <argument><type>UInt64</type></argument>
        <return_type>UInt64</return_type>
        <format>TabSeparated</format>
        <command>awk '{ print $1 + $2 }'</command>
        <execute_direct>0</execute_direct>
    </function>
</functions>

ch-server :) SELECT awk_plus(1,2);

┌─awk_plus(1, 2)─┐
│             3  │
└────────────────┘

Run clickhouse queries on data stored in other databases

ClickHouse can act as a proxy for other databases like SQLite, MySQL and PostgreSQL.

Following example shows ClickHouse connecting to SQLite.

ch-server :) CREATE DATABASE sqlite_data
ENGINE = SQLite('sqlite3.db');

ch-server :) SHOW TABLES FROM sqlite_data;

┌─name─┐
│ taxi │
└──────┘

ch-server :) SELECT count(*) FROM sqlite_data.taxi;

┌──count()─┐
│ 14092414 │
└──────────┘

Following example shows ClickHouse connecting to PostgreSQL.

ch-server :) CREATE DATABASE postgres_data
ENGINE = PostgreSQL('127.0.0.1:5432', 'postgres', 'postgres', 'changeme');

ch-server :) SHOW TABLES FROM postgres_data;

┌─name─┐
│ taxi │
└──────┘

ch-server :) SELECT count(*) FROM postgres_data.taxi;

┌──count()─┐
│ 14092414 │
└──────────┘

Run clickhouse queries on data stored in file

ClickHouse can run queries on data stored in file using FILE Table Engine.

ch-server :) CREATE TABLE data_file
(
    `id` UInt32,
    `name` String
)
ENGINE = File(TabSeparated);

Ok.

ch-server :) SELECT *
FROM data_file;

0 rows in set. Elapsed: 0.002 sec.

Let’s insert some tab-separated content into file.

❯ cd /var/lib/clickhouse/data/default/data_file
❯ echo "1       Rob" >> data.TabSeparated
❯ echo "2       Michael" >> data.TabSeparated
❯ echo "3       John" >> data.TabSeparated

Now let’s query the clickhouse table.

ch-server :) SELECT *
FROM data_file;

┌─id─┬─name────┐
│  1 │ Rob     │
│  2 │ Michael │
│  3 │ John    │
└────┴─────────┘

3 rows in set. Elapsed: 0.003 sec.

This is very useful if you want to do some kind of analysis on some file data.

Anonymize any data using clickhouse-obfuscator

ClickHouse can anonymize your data whether stored inside or outside clickhouse using clickhouse-obfuscator.

❯ cat users.csv
1,Rob
2,Michael
3,John

❯ clickhouse-obfuscator \
     --seed "$(head -c16 /dev/urandom | base64)" \
     --input-format CSV \
     --output-format CSV \
     --structure 'id UInt32,name String' \
     < users.csv
Training models
Processed 3 rows
Generating data
Processed 3 rows
1,""
2,""
3,""

Read clickhouse server runtime information like queries, settings, users, roles etc

By default, there is a system database which contains multiple tables containing information about clickhouse-server itself.

ch-server :) SHOW TABLES FROM system;

┌─name───────────────────────────┐
│ aggregate_function_combinators │
│ asynchronous_inserts           │
│ ...........                    │
│ ...........                    │
│ storage_policies               │
│ table_engines                  │
│ table_functions                │
│ tables                         │
│ time_zones                     │
│ trace_log                      │
│ user_directories               │
│ users                          │
│ warnings                       │
└────────────────────────────────┘

76 rows in set. Elapsed: 0.003 sec.

ch-server :) SELECT query
FROM system.query_log;

┌─query────────────────────┐
│ SHOW TABLES FROM system; │
│ SHOW TABLES FROM default;│
└──────────────────────────┘

ch-server :) SELECT name, value
FROM system.settings;

┌─name─────────────────────────────────────────────────────────────────────────┬─value──────────────────┐
│ min_compress_block_size                                                      │ 65536                  │
│ max_compress_block_size                                                      │ 1048576                │
│ max_block_size                                                               │ 65505                  │
│ max_insert_block_size                                                        │ 1048545                │
│ ...............                                                              │ .......                │
│ input_format_mysql_dump_map_column_names                                     │ 1                      │
└──────────────────────────────────────────────────────────────────────────────┴────────────────────────┘

Show the execution plan of a clickhouse query

ClickHouse support EXPLAIN inspired from PostgreSQL.

ch-server :) EXPLAIN AST
SELECT *
FROM data_file;

┌─explain─────────────────────────────────────┐
│ SelectWithUnionQuery (children 1)           │
│  ExpressionList (children 1)                │
│   SelectQuery (children 2)                  │
│    ExpressionList (children 1)              │
│     Asterisk                                │
│    TablesInSelectQuery (children 1)         │
│     TablesInSelectQueryElement (children 1) │
│      TableExpression (children 1)           │
│       TableIdentifier data_file             │
└─────────────────────────────────────────────┘

9 rows in set. Elapsed: 0.001 sec.

ch-server :) EXPLAIN PLAN
SELECT *
FROM data_file;

┌─explain───────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))                               │
│   SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│     ReadFromStorage (File)                                                │
└───────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 0.002 sec.

Convert datetime string to clickhouse’s datetime irrespective of its format

ClickHouse provides a function which does the heavy lifting of identifying the timestamp format.

ch-server :) SELECT parseDateTimeBestEffort('31-05-2022 11:11:11') AS result;

┌──────────────result─┐
│ 2022-05-31 11:11:11 │
└─────────────────────┘

1 row in set. Elapsed: 0.002 sec.

ch-server :) SELECT parseDateTimeBestEffort('31/05/2022 11:11:11') AS result;

┌──────────────result─┐
│ 2022-05-31 11:11:11 │
└─────────────────────┘

1 row in set. Elapsed: 0.003 sec.

ch-server :) SELECT parseDateTimeBestEffort('20220531111111') AS result;

┌──────────────result─┐
│ 2022-05-31 11:11:11 │
└─────────────────────┘

Not only that, it even understands unix epoch timestamp.

ch-server :) SELECT parseDateTimeBestEffort('1653995471') AS result;

┌──────────────result─┐
│ 2022-05-31 11:11:11 │
└─────────────────────┘

Create row level policies for fine-granular user access

This feature of ClickHouse is similar to PostgreSQL’s row-level security but only for ClickHouse users with readonly access.

Let’s say you created two users: john and michael and you want to allow john only to read certain lines and likewise for michael.

❯ clickhouser-client --user=default

ch-server :) CREATE USER john;

Ok.

ch-server :) CREATE USER michael;

Ok.

ch-server :) CREATE TABLE users
(
    `id` UInt32,
    `name` String,
    `age` UInt32
)
ENGINE = MergeTree
ORDER BY id;

Ok.

ch-server :) INSERT INTO users VALUES (1, 'John', 20), (2, 'Michael', 18);

Ok.

ch-server :) SELECT *
FROM users;

┌─id─┬─name────┬─age─┐
│  1 │ John    │  20 │
│  2 │ Michael │  18 │
└────┴─────────┴─────┘

ch-server :) GRANT SELECT(id, name, age) ON default.users TO john WITH GRANT OPTION;

Ok.

ch-server :) GRANT SELECT(id, name, age) ON default.users TO michael WITH GRANT OPTION;

Ok.

Now you can configure row-level policies for both of these users

❯ clickhouser-client --user=default

ch-server :) CREATE ROW POLICY john_users ON default.users FOR SELECT USING name = 'John' TO john;

Ok.

ch-server :) CREATE ROW POLICY michael_users ON default.users FOR SELECT USING name = 'Michael' TO michael;

Ok.

This way when each of these users login, they only see the rows which they are allowed to read.

❯ clickhouse-client --user=john

ch-server :) SELECT *
FROM users;

┌─id─┬─name─┬─age─┐
│  1 │ John │  20 │
└────┴──────┴─────┘

❯ clickhouse-client --user=michael

ch-server :) SELECT *
FROM users;

┌─id─┬─name────┬─age─┐
│  2 │ Michael │  18 │
└────┴─────────┴─────┘

Automatic schema inference for JSON documents

Last but not the least, this is my favourite feature. ClickHouse recently added a new JSON Object datatype. This helps in infering the datatype dynamically and as a clickhouse enduser, we don’t need to worry about the schema. This inference even work with Nested JSON.

We can first run DESCRIBE TABLE to check the schema ClickHouse will infer for us.

ch-server :) DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz', 'JSONEachRow')
SETTINGS input_format_max_rows_to_read_for_schema_inference = 100

┌─name───────┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type       │ Nullable(String)         │              │                    │         │                  │                │
│ actor      │ Object(Nullable('json')) │              │                    │         │                  │                │
│ repo       │ Object(Nullable('json')) │              │                    │         │                  │                │
│ created_at │ Nullable(String)         │              │                    │         │                  │                │
│ payload    │ Object(Nullable('json')) │              │                    │         │                  │                │
└────────────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Now we can create the table with a single column of type JSON

ch-server :) CREATE TABLE github_json
(
    `event` JSON
)
ENGINE = MergeTree
ORDER BY tuple()

Ok.

And then import the data

ch-server :) INSERT INTO github_json SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz', JSONAsObject, 'event JSON')

Ok.

0 rows in set. Elapsed: 25.469 sec. Processed 1.00 million rows, 453.17 MB (39.26 thousand rows/s., 17.79 MB/s.)

Now clickhouse will add the inferred columns and you can query by the inferred columns.

ch-server :) SELECT
    event.type,
    event.repo,
    event.actor
FROM github_json
LIMIT 1

┌─event.type─┬─event.repo───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─event.actor──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PushEvent  │ (429298592,'revacprogramming/pps-test1-Lakshmipatil2021','https://api.github.com/repos/revacprogramming/pps-test1-Lakshmipatil2021') │ ('https://avatars.githubusercontent.com/u/93110249?','Lakshmipatil2021',93110249,'Lakshmipatil2021','https://api.github.com/users/Lakshmipatil2021') │
└────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Conclusion

In conclusion, ClickHouse can act as proxy, http-server and whatever you can think of. This is just a subset of things which ClickHouse can do. If you haven’t tried ClickHouse, I will encourage you to try it. Thank you for reading the blog.