DuckDB: Query SQLite and PostgreSQL Data
Introduction
DuckDB can query data stored in SQLite and PostgreSQL using SQLiteScanner and PostgresScanner respectively. DuckDB claims reads via DuckDB client will be faster than reads via their respective datastore client. Let’s test it out.
Note:
- This is not any kind of full-blown performance analysis, we just trying one random query and comparing its query duration.
- To prevent performance difference due to indexes, we are not utilizing indexes at all.
- We didn’t tweak any of the PRAGMA or configuration variables and used the default configuration for all three datastores.
- We ran all the commands in the same physical host.
Let’s load one month of NYC Taxi Dataset (14M rows) into SQLite(v3.33.0) and try to query it via SQLite and DuckDB.
Querying SQLite data from SQLite
sqlite> SELECT
...> COUNT(DISTINCT vendor_name) AS count_vendor,
...> COUNT(DISTINCT Passenger_Count) AS count_passenger,
...> COUNT(DISTINCT Trip_Distance) AS count_trip,
...> COUNT(DISTINCT Rate_Code) AS count_code,
...> COUNT(DISTINCT store_and_forward) AS count_store,
...> COUNT(DISTINCT Payment_Type) AS count_payment
...> FROM taxi;
4|8|5748|1|3|6
Run Time: real 10.936 user 9.542476 sys 0.840094
Querying SQLite data from DuckDB
duckdb> LOAD 'build/release/sqlite_scanner.duckdb_extension';
duckdb> CALL sqlite_attach('sqlite3.db');
duckdb> SELECT
> COUNT(DISTINCT vendor_name) AS count_vendor,
> COUNT(DISTINCT Passenger_Count) AS count_passenger,
> COUNT(DISTINCT Trip_Distance) AS count_trip,
> COUNT(DISTINCT Rate_Code) AS count_code,
> COUNT(DISTINCT store_and_forward) AS count_store,
> COUNT(DISTINCT Payment_Type) AS count_payment
> FROM taxi;
┌──────────────┬─────────────────┬────────────┬────────────┬─────────────┬───────────────┐
│ count_vendor │ count_passenger │ count_trip │ count_code │ count_store │ count_payment │
├──────────────┼─────────────────┼────────────┼────────────┼─────────────┼───────────────┤
│ 4 │ 8 │ 5748 │ 1 │ 3 │ 6 │
└──────────────┴─────────────────┴────────────┴────────────┴─────────────┴───────────────┘
Run Time: real 7.127 user 12.567104 sys 22.495830
Query Via | Query Duration (sec) |
---|---|
SQLite | 11 |
DuckDB | 7 |
Let’s load same one month of NYC Taxi Dataset into PostgreSQL(v14.3) and try to query it via PostgreSQL and DuckDB.
Querying PostgreSQL data from PostgreSQL
postgres> SELECT
COUNT(DISTINCT "vendor_name") AS count_vendor,
COUNT(DISTINCT "Passenger_Count") AS count_passenger,
COUNT(DISTINCT "Trip_Distance") AS count_trip,
COUNT(DISTINCT "Rate_Code") AS count_code,
COUNT(DISTINCT "store_and_forward") AS count_store,
COUNT(DISTINCT "Payment_Type") AS count_payment
FROM taxi;
count_vendor | count_passenger | count_trip | count_code | count_store | count_payment
--------------+-----------------+------------+------------+-------------+---------------
4 | 8 | 5748 | 1 | 3 | 6
(1 row)
Time: 27258.238 ms (00:27.258)
Querying PostgreSQL data from DuckDB
$ ./duckdb/build/release/duckdb
v0.3.5-dev258 fe433d317
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
duckdb> LOAD 'build/release/postgres_scanner.duckdb_extension';
duckdb> CALL POSTGRES_ATTACH('host=127.0.0.1 port=5432 user=postgres dbname=postgres');
duckdb> SELECT
> COUNT(DISTINCT vendor_name) AS count_vendor,
> COUNT(DISTINCT Passenger_Count) AS count_passenger,
> COUNT(DISTINCT Trip_Distance) AS count_trip,
> COUNT(DISTINCT Rate_Code) AS count_code,
> COUNT(DISTINCT store_and_forward) AS count_store,
> COUNT(DISTINCT Payment_Type) AS count_payment
> FROM taxi;
┌──────────────┬─────────────────┬────────────┬────────────┬─────────────┬───────────────┐
│ count_vendor │ count_passenger │ count_trip │ count_code │ count_store │ count_payment │
├──────────────┼─────────────────┼────────────┼────────────┼─────────────┼───────────────┤
│ 4 │ 8 │ 5748 │ 1 │ 3 │ 6 │
└──────────────┴─────────────────┴────────────┴────────────┴─────────────┴───────────────┘
Run Time: real 3.472 user 5.938940 sys 0.657260
Query Via | Query Duration (sec) |
---|---|
PostgreSQL | 27 |
DuckDB | 3.5 |
Conclusion
Overall DuckDB claims were correct and it is already proving itself faster than well known databases.
Need to debug more
I was expecting PostgreSQL to be fastest among all three but it proved to be slowest for the query we took.