OLAP Databases
Last updated: Nov 26, 2021
Introduction
Recently, I came across a new embedded analytics database, DuckDB which I found inspired from two of my favorite ones SQLite and ClickHouse. DuckDB looked very interesting since I find that it filled that gap of not having a strong embedded database for OLAP requirements really well.
In this post, we will load one month of NYC Taxi Dataset in each of these three databases and run some queries to get an idea about each of them.
Don’t take it as a performance comparison in any way since SQLite & DuckDB are embedded databases (they don’t have any separate server running) while ClickHouse runs a separate server, so obviously that won’t be apples-to-apples comparison.
Data Loading
Each of the three databases support loading data from CSV files, we used the feature to import data
$ sqlite3 sqlite3.db
SQLite version 3.33.0 2020-08-14 13:23:32
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import yellow_tripdata_2009-01.csv taxi
$ duckdb duckdb.db
SQLite version DuckDB d9bceddc7
Enter ".help" for usage hints.
duckdb> CREATE TABLE taxi AS SELECT * FROM read_csv_auto('yellow_tripdata_2009-01.csv');
For ClickHouse, we need to create the table beforehand with proper schema. In this case, I used all fields as strings since we aren’t thinking about any performance optimization.
$ clickhouse-client
server :) CREATE TABLE default.taxi
(
`vendor_name` String,
`Trip_Pickup_DateTime` String,
`Trip_Dropoff_DateTime` String,
`Passenger_Count` String,
`Trip_Distance` String,
`Start_Lon` String,
`Start_Lat` String,
`Rate_Code` String,
`store_and_forward` String,
`End_Lon` String,
`End_Lat` String,
`Payment_Type` String,
`Fare_Amt` String,
`surcharge` String,
`mta_tax` String,
`Tip_Amt` String,
`Tolls_Amt` String,
`Total_Amt` String
)
ENGINE = MergeTree()
ORDER BY vendor_name
SETTINGS index_granularity = 8192
$ cat yellow_tripdata_2009-01.csv | clickhouse-client --query="INSERT INTO taxi FORMAT CSVWithNames";
Data Storage
Let’s have a look how much disk each of these took to store this csv in their native data format.
The csv file which we imported above is around 2.4G.
$ du -sh yellow_tripdata_2009-01.csv
2.4G yellow_tripdata_2009-01.csv
In SQLite format, it took around 2.7G.
$ ls -lh
Permissions Size User Date Modified Name
.rw-r--r-- 2.7G pradeep 4 Oct 12:30 sqlite3.db
In DuckDB format, it took around 1.9G.
$ ls -lh
Permissions Size User Date Modified Name
.rw-r--r-- 1.9G pradeep 4 Oct 12:20 duckdb.data
.rw-r--r-- 12 pradeep 4 Oct 13:03 duckdb.data.wal
In ClickHouse format, it took around 780M.
$ clickhouse-client
server :) SELECT
concat(database, '.', table) AS table,
formatReadableSize(sum(bytes)) AS size,
sum(rows) AS rows,
sum(bytes) AS bytes_size,
any(engine) AS engine,
formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_keys_size
FROM system.parts
WHERE active
GROUP BY
database,
table
ORDER BY bytes_size DESC
┌─table────────────────┬─size───────┬─────rows─┬─bytes_size─┬─engine────┬─primary_keys_size─┐
│ default.taxi │ 778.20 MiB │ 14092413 │ 816006123 │ MergeTree │ 20.24 KiB │
└──────────────────────┴────────────┴──────────┴────────────┴───────────┴───────────────────┘
Data Query
Counting number of rows
Let’s make a simple query to count the number of rows in each of three datastores.
In SQLite, it took around 1 second to complete the query.
sqlite> .timer on
sqlite> SELECT COUNT(*) from taxi;
14092413
Run Time: real 1.053 user 0.291249 sys 0.759388
In DuckDB, the same query took around around 0.043 second.
duckdb> .timer on
duckdb> SELECT COUNT(*) from taxi;
14092413
Run Time: real 0.043 user 0.042005 sys 0.000352
Wow, that’s around 20x faster.
In ClickHouse, the same query took around 0.006 second.
$ clickhouse-client
server :) SELECT count(*) FROM taxi;
┌──count()─┐
│ 14092413 │
└──────────┘
1 rows in set. Elapsed: 0.006 sec.
Counting distinct rows
Let’s make a query to count distinct rows for column: vendor_name.
sqlite> SELECT COUNT(DISTINCT vendor_name) FROM taxi;
3
Run Time: real 2.027 user 1.392251 sys 0.630733
duckdb> SELECT COUNT(DISTINCT vendor_name) FROM taxi;
3
Run Time: real 0.485 user 0.482315 sys 0.001113
$ clickhouse-client
server :) SELECT COUNT(DISTINCT vendor_name) FROM taxi;
┌─uniqExact(vendor_name)─┐
│ 3 │
└────────────────────────┘
1 rows in set. Elapsed: 0.121 sec. Processed 14.09 million rows, 169.11 MB (116.89 million rows/s., 1.40 GB/s.)
Let’s make a query to count distinct rows for multiple columns.
sqlite> SELECT
...> COUNT(DISTINCT vendor_name),
...> COUNT(DISTINCT Passenger_Count),
...> COUNT(DISTINCT Trip_Distance),
...> COUNT(DISTINCT Rate_Code),
...> COUNT(DISTINCT store_and_forward),
...> COUNT(DISTINCT Payment_Type)
...> FROM taxi;
3,8,5748,1,3,6
Run Time: real 11.952 user 10.962464 sys 0.896248
duckdb> SELECT
...> COUNT(DISTINCT vendor_name),
...> COUNT(DISTINCT Passenger_Count),
...> COUNT(DISTINCT Trip_Distance),
...> COUNT(DISTINCT Rate_Code),
...> COUNT(DISTINCT store_and_forward),
...> COUNT(DISTINCT Payment_Type)
...> FROM taxi;
3|8|5739|0|2|6
Run Time: real 4.181 user 4.138314 sys 0.023508
$ clickhouse-client
server :) SELECT
COUNTDistinct(vendor_name),
COUNTDistinct(Passenger_Count),
COUNTDistinct(Trip_Distance),
COUNTDistinct(Rate_Code),
COUNTDistinct(store_and_forward),
COUNTDistinct(Payment_Type)
FROM taxi FORMAT CSV;
3,8,5748,1,3,6
1 rows in set. Elapsed: 0.595 sec. Processed 14.09 million rows, 1.07 GB (23.69 million rows/s., 1.80 GB/s.)
Conclusion
Based on above data, we can conclude that DuckDB is a very good fit for analytical purposes. Although, it’s still in early phase of development, I am really excited about it. I am looking forward to use it in my projects soon.