Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
A Senior SQL Server DBA based in the North West of England
A Senior SQL Server DBA based in the North West of England

(Buzzing)
DuckDB is a portable, single-process OLAP-focused database that is getting a lot of traction in various data/database communities, so I decided to try it out. As I had the Postgres version of the StackOverflow database handy (courtesy of Brent Ozar), I had an easy method of accessing some test data.
My goals in part 1 were as follows:
DuckDB was ‘installed’ on a Windows 11 VM. I use the word ‘installed’ lightly, as it was simply a case of unzipping the executable and placing it somewhere. In this case, I bunged it on the desktop.

Conveniently, I had the Postgres version of the StackOverflow database set up on version 17, courtesy of Brent Ozar. I could have plugged it into SQL Server, but the StackOverflow copy I have on SQL Server across my VM estate at the moment is the small one.
Double-clicking the DuckDB executable opens a bog-standard CLI. The Postgres extension needs to be set up before attempting to connect to your Postgres database:
install postgres

We can now load PostgreSQL:
load postgres

Next, to make connecting to Postgres more convenient, I created a temporary secret.
CREATE SECRET (
TYPE postgres,
HOST '192.168.0.160',
PORT 5432,
DATABASE StackOverflow,
USER 'postgres',
PASSWORD '*********'
);

We can now connect to Postgres. We can leave the connection string blank, as DuckDB will automatically refer to the secret:
ATTACH '' AS StackoverFlow (TYPE postgres);

We are in.
Querying Postgres in DuckDB is straightforward. We just reference the identifier we used when attaching the database (e.g., StackOverflow) like a schema:
SELECT COUNT(*) FROM StackoverFlow.public.posts;

The real-time progress indicator is frankly cool (I am easily impressed).
The next task was to create some Parquet files on disk. Once Parquet files have been created, we have essentially decoupled the data from our source (Postgres) and made the data portable and easily accessible.
Doing this with DuckDB was easy:
COPY (SELECT * FROM StackoverFlow.public.posts) TO '\\io01ehuk\data\posts.parquet' (FORMAT PARQUET, COMPRESSION ZSTD); COPY (SELECT * FROM StackoverFlow.public.users) TO '\\io01ehuk\data\users.parquet' (FORMAT PARQUET, COMPRESSION ZSTD); COPY (SELECT * FROM StackoverFlow.public.comments) TO '\\io01ehuk\data\comments.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);


Generating the Parquet files was nice and quick. I intuitively know how fast my rig is, and I was pleased with the performance. The size of the files was about 30GB in total using ZSTD compression.
If you were doing this with real data, you would likely filter the data you need, not just dump out entire tables.
For testing purposes, I just used a standard UNC path to a local file server. Out in the real world, organisations could use:
The beauty of DuckDB is the ability to launch the 30MB DuckDB executable and start querying Parquet files instantly. There is no complex installation of software, no connection strings or permissions to wrestle with. You are up and running straight away.
Querying is easy. Here are three basic queries:
-- Basic query against a single file
SELECT COUNT(*) FROM '\\io01ehuk\data\posts.parquet'; --Instant
-- Query with filtering and aggregation
SELECT
DATE_TRUNC('month', creationdate) AS month,
COUNT(*) AS posts
FROM '\\io01ehuk\data\posts.parquet'
WHERE posttypeid = 1
GROUP BY 1
ORDER BY 1; --3 seconds
-- Join across files
SELECT
u.displayname,
u.reputation,
COUNT(*) AS post_count,
AVG(p.score) AS avg_score
FROM '\\io01ehuk\data\posts.parquet' p
JOIN '\\io01ehuk\data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20; --6.5 seconds



The queries were much faster than querying the Postgres StackOverflow database directly. The large caveat is that I made zero attempt to look at indexes on the StackOverflow Postgres database (there are some present).
| Parquet | Postgres | |
| COUNT(*) | instant | 2 seconds |
| GROUP BY | 3 seconds | 43 seconds |
| JOINS | 6.5 seconds | 94 seconds |
Nice.
Up until now, we have just worked on connecting to remote sources and querying ad-hoc Parquet files. There may be times when you want to persist a DuckDB database and create objects like tables and views for easier access to your data.
First, create a DuckDB database:
.open '\\io01ehuk\data\stackoverflow.duckdb'

Next, create our views:
CREATE VIEW posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet'; CREATE VIEW users AS SELECT * FROM '\\io01ehuk\data\users.parquet'; CREATE VIEW comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';

We can now query them:
SELECT
u.displayname,
COUNT(c.id) AS comments_made
FROM users u
JOIN comments c ON u.id = c.userid
GROUP BY u.id, u.displayname
ORDER BY comments_made DESC
LIMIT 10;
What do views give us over just hitting the Parquet files directly?
What about persisting the data into actual tables? Let’s try it:
CREATE TABLE posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet'; CREATE TABLE users AS SELECT * FROM '\\io01ehuk\data\users.parquet'; CREATE TABLE comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';
This failed as views and tables share the same namespace in DuckDB. We need to drop the views first:
DROP VIEW posts; DROP VIEW users; DROP VIEW comments; CREATE TABLE posts AS SELECT * FROM '\\io01ehuk\data\posts.parquet'; CREATE TABLE users AS SELECT * FROM '\\io01ehuk\data\users.parquet'; CREATE TABLE comments AS SELECT * FROM '\\io01ehuk\data\comments.parquet';

This caused our DuckDB database to grow to 119GB compared to the Parquet files’ total size of 30GB. I do not know the ins and outs of why this is. Maybe due to the fact that the Parquet files were using ZSTD compression (who knows)?

We can check that the tables are present by running the following:
SHOW ALL TABLES; --OR SELECT table_name, table_type FROM information_schema.tables WHERE table_schema = 'main';

I decided to test the performance of the longest-running test query against a remote location vs local to cover it off. This is not an exhaustive test.
I took the test query, which takes the longest to run in both flavours:
--DuckDB database
SELECT
u.displayname,
u.reputation,
COUNT(*) AS post_count,
AVG(p.score) AS avg_score
FROM posts p
JOIN users u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20;
--Parquet (remote)
SELECT
u.displayname,
u.reputation,
COUNT(*) AS post_count,
AVG(p.score) AS avg_score
FROM '\\io01ehuk\data\posts.parquet' p
JOIN '\\io01ehuk\data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20;
--Parquet (local)
SELECT
u.displayname,
u.reputation,
COUNT(*) AS post_count,
AVG(p.score) AS avg_score
FROM 'F:\local_data\posts.parquet' p
JOIN 'F:\local_data\users.parquet' u ON p.owneruserid = u.id
GROUP BY u.id, u.displayname, u.reputation
ORDER BY post_count DESC
LIMIT 20;
Here are the results:
| Remote Location | Local Disk (Windows 11) | |
| DuckDB Database | 4.5s | 3.5s |
| Parquet Files | 4.0s | 3.4s |
Of course, for ad-hoc querying, the CLI probably isn’t going to cut it. Fortunately, you can connect to DuckDB using popular tools such as DBeaver:
Just add a connection, pointing to your DuckDB database (it may download a driver). And you should be good to go:


Because DuckDB is so small and easy to set up and use, I recommend you just fire it up and play with it on your workstation. The documentation is nice and straightforward, and I think it has almost limitless utility for organisations and data people.
I didn’t touch even a fraction of what DuckDB is capable of in part 1. In part 2, I may connect things up to SQL Server and do some testing around cloud and flat files, with the final part possibly concerning Local LLMs or embedded in an application use case.
[…] Messing with DuckDB – Part 1 – Quick StackOverflow Testing […]