Messing with DuckDB – Part 1 – Quick StackOverflow Testing

(Buzzing)

DuckDB is a portable, single-process OLAP-focused database that is getting a lot of traction in various data/database communities, so I decided to try it out. As I had the Postgres version of the StackOverflow database handy (courtesy of Brent Ozar), I had an easy method of accessing some test data.

My goals in part 1 were as follows:

  • Install it
  • Query an OLTP database (Postgres or SQL Server) via DuckDB
  • Create Parquet files from the OLTP database
  • Query the Parquet files
  • Persist the Parquet files in a DuckDB database
  • Copy the Parquet files to another host and test the portability of DuckDB

Installing DuckDB

DuckDB was ‘installed’ on a Windows 11 VM. I use the word ‘installed’ lightly, as it was simply a case of unzipping the executable and placing it somewhere. In this case, I bunged it on the desktop.

Plugging it into Postgres

Conveniently, I had the Postgres version of the StackOverflow database set up on version 17, courtesy of Brent Ozar. I could have plugged it into SQL Server, but the StackOverflow copy I have on SQL Server across my VM estate at the moment is the small one.

Double-clicking the DuckDB executable opens a bog-standard CLI. The Postgres extension needs to be set up before attempting to connect to your Postgres database:

install postgres

We can now load PostgreSQL:

load postgres

Next, to make connecting to Postgres more convenient, I created a temporary secret.

CREATE SECRET (
    TYPE postgres,
    HOST '192.168.0.160',
    PORT 5432,
    DATABASE StackOverflow,
    USER 'postgres',
    PASSWORD '*********'
);

We can now connect to Postgres. We can leave the connection string blank, as DuckDB will automatically refer to the secret:

ATTACH '' AS StackoverFlow (TYPE postgres);

We are in.

Query Postgres in DuckDB

Querying Postgres in DuckDB is straightforward. We just reference the identifier we used when attaching the database (e.g., StackOverflow) like a schema:

SELECT COUNT(*) FROM StackoverFlow.public.posts;

The real-time progress indicator is frankly cool (I am easily impressed).

Creating Parquet files from Postgres data

The next task was to create some Parquet files on disk. Once Parquet files have been created, we have essentially decoupled the data from our source (Postgres) and made the data portable and easily accessible.

Doing this with DuckDB was easy:

COPY (SELECT * FROM StackoverFlow.public.posts) 
TO '\\io01ehuk\data\posts.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

COPY (SELECT * FROM StackoverFlow.public.users) 
TO '\\io01ehuk\data\users.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

COPY (SELECT * FROM StackoverFlow.public.comments) 
TO '\\io01ehuk\data\comments.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

Generating the Parquet files was nice and quick. I intuitively know how fast my rig is, and I was pleased with the performance. The size of the files was about 30GB in total using ZSTD compression.

If you were doing this with real data, you would likely filter the data you need, not just dump out entire tables.

For testing purposes, I just used a standard UNC path to a local file server. Out in the real world, organisations could use:

  • Azure blob
  • AWS S3
  • Google Cloud
  • S3 Compatible

Querying Parquet files

The beauty of DuckDB is the ability to launch the 30MB DuckDB executable and start querying Parquet files instantly. There is no complex installation of software, no connection strings or permissions to wrestle with. You are up and running straight away.

Querying is easy. Here are three basic queries:

-- Basic query against a single file
SELECT COUNT(*) FROM '\\io01ehuk\data\posts.parquet'; --Instant

-- Query with filtering and aggregation
SELECT 
    DATE_TRUNC('month', creationdate) AS month,
    COUNT(*) AS posts
FROM '\\io01ehuk\data\posts.parquet'
WHERE posttypeid = 1  
GROUP BY 1
ORDER BY 1; --3 seconds

-- Join across files
SELECT 
    u.displayname,
    u.reputation,
    COUNT(*) AS post_count,
    AVG(p.score) AS avg_score
FROM '\\io01ehuk\data\posts.parquet' p
JOIN '\\io01ehuk\data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20; --6.5 seconds

The queries were much faster than querying the Postgres StackOverflow database directly. The large caveat is that I made zero attempt to look at indexes on the StackOverflow Postgres database (there are some present).

ParquetPostgres
COUNT(*)instant2 seconds
GROUP BY3 seconds43 seconds
JOINS6.5 seconds94 seconds

Nice.

Creating a DuckDB Database With Tables and Views

Up until now, we have just worked on connecting to remote sources and querying ad-hoc Parquet files. There may be times when you want to persist a DuckDB database and create objects like tables and views for easier access to your data.

First, create a DuckDB database:

.open '\\io01ehuk\data\stackoverflow.duckdb'

Next, create our views:

CREATE VIEW posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet';
CREATE VIEW users AS SELECT * FROM '\\io01ehuk\data\users.parquet';
CREATE VIEW comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';

We can now query them:

SELECT 
    u.displayname,
    COUNT(c.id) AS comments_made
FROM users u
JOIN comments c ON u.id = c.userid
GROUP BY u.id, u.displayname
ORDER BY comments_made DESC
LIMIT 10;

What do views give us over just hitting the Parquet files directly?

  • Easier to write queries for
    • Self explanatory
  • Abstraction over file location
    • If you move the file location, you only need to update the view, not all your queries, which would point directly at some Parquet files
  • Built-in transformations
    • Bake in some filters or computed columns
  • Makes sharing the DuckDB database easier
    • You can hand it to people and say, “Just query posts, comments, users, etc”

What about persisting the data into actual tables? Let’s try it:

CREATE TABLE posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet';
CREATE TABLE users AS SELECT * FROM '\\io01ehuk\data\users.parquet';
CREATE TABLE comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';

This failed as views and tables share the same namespace in DuckDB. We need to drop the views first:

DROP VIEW posts;
DROP VIEW users;
DROP VIEW comments;

CREATE TABLE posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet';
CREATE TABLE users AS SELECT * FROM '\\io01ehuk\data\users.parquet';
CREATE TABLE comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';

This caused our DuckDB database to grow to 119GB compared to the Parquet files’ total size of 30GB. I do not know the ins and outs of why this is. Maybe due to the fact that the Parquet files were using ZSTD compression (who knows)?

We can check that the tables are present by running the following:

SHOW ALL TABLES;

--OR

SELECT table_name, table_type 
FROM information_schema.tables 
WHERE table_schema = 'main';

Querying the DuckDB Remotely vs Locally

I decided to test the performance of the longest-running test query against a remote location vs local to cover it off. This is not an exhaustive test.

I took the test query, which takes the longest to run in both flavours:

--DuckDB database

SELECT 
    u.displayname,
    u.reputation,
    COUNT(*) AS post_count,
    AVG(p.score) AS avg_score
FROM posts p
JOIN users u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20; 

--Parquet (remote)

SELECT 
    u.displayname,
    u.reputation,
    COUNT(*) AS post_count,
    AVG(p.score) AS avg_score
FROM '\\io01ehuk\data\posts.parquet' p
JOIN '\\io01ehuk\data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20;

--Parquet (local)

SELECT 
    u.displayname,
    u.reputation,
    COUNT(*) AS post_count,
    AVG(p.score) AS avg_score
FROM 'F:\local_data\posts.parquet' p
JOIN 'F:\local_data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20;

Here are the results:

Remote LocationLocal Disk (Windows 11)
DuckDB Database4.5s3.5s
Parquet Files4.0s3.4s

Connecting via DBeaver

Of course, for ad-hoc querying, the CLI probably isn’t going to cut it. Fortunately, you can connect to DuckDB using popular tools such as DBeaver:

Just add a connection, pointing to your DuckDB database (it may download a driver). And you should be good to go:

Conclusion

Because DuckDB is so small and easy to set up and use, I recommend you just fire it up and play with it on your workstation. The documentation is nice and straightforward, and I think it has almost limitless utility for organisations and data people.

I didn’t touch even a fraction of what DuckDB is capable of in part 1. In part 2, I may connect things up to SQL Server and do some testing around cloud and flat files, with the final part possibly concerning Local LLMs or embedded in an application use case.

Resources

https://duckdb.org

One comment

Leave a Reply

Discover more from eheaton.com

Subscribe now to keep reading and get access to the full archive.

Continue reading