Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
A Senior SQL Server DBA based in the North West of England
A Senior SQL Server DBA based in the North West of England

(Expectant)
In the first part of this series, I played around with DuckDB, which involved setting it up and plugging it into an existing datasource (Postgres). This data source was the copy of StackOverflow from SmartPostgres.
In this part, I am eager to see if things are as seamless when connecting DuckDB to SQL Server, which is my current ‘main’ (fans of fighting games will understand the reference).
Connecting to SQL Server should simply be a case of installing the official core extension, right? Wrong.

There’s nothing on the list here for SQL Server, so we are going to look at community extensions.
The first one we will try is nanodbc. This is on the Community Extension page here. In addition, here is the GitHub page for the extension. This extension is new and still under active development in its formative stages.
This means we can’t create a secret in DuckDB containing our connection details, which was highly convenient in the Postgres extension.
First, install the extension:
install nanodbc FROM community;

Next, load it:
load nanodbc;

Great, we have some semblance of connectivity to SQL Server.
The first option we have is using odbc_scan on a table. Let’s try dbo.Posts…
SELECT * FROM odbc_scan(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
table_name='dbo.Posts'
);
Oh:

It looks like odbc_scan does not like LOB columns (Body for example). This is, I believe, a limitation of ODBC (not entirely sure).
Let’s try a table with no NVARCHAR(MAX) column:
SELECT * FROM odbc_scan(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
table_name='Votes'
);
Works:

The second and more flexible option is odbc_query. Let’s do a quick count of the posts table:
SELECT * FROM odbc_query(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
query='SELECT COUNT(*) AS cnt FROM dbo.Posts WHERE PostTypeId = 1'
);

This approach works, but there are pitfalls we need to consider. Notice I pushed the count to SQL Server as opposed to sending all the rows over to DuckDB and then counting.
The following is probably not how you want to do it:
SELECT COUNT(*) FROM odbc_query(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
query='SELECT Id, Title, Score FROM dbo.Posts WHERE PostTypeId = 1'
);
Looking at sp_blitzcache, we can see the difference in row counts between pushing the count to SQL Server vs bringing it across:


The first result from sp_blitzcache is from not pushing the count to SQL Server; the second is obviously from doing the count on SQL Server.
Here is an example of another simple query against SQL Server using DuckDB, plus its native, OPENQUERY and vanilla Linked Server equivalents (just as a comparison)
-- DuckDB (ODBC)
SELECT * FROM odbc_query(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
query='SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr'
);
-- SQL Server Native
SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;
-- OPENQUERY from another SQL Server to ensure pushdown
SELECT * FROM OPENQUERY([SKSQS25DV\EH_DV],
'SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM StackOverflow2010.dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr'
);
-- Linked Server (vanilla)
SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM [SKSQS25DV\EH_DV].StackOVerflow2010.dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;
As the data set is pretty small, they all run quickly (<2 secs) even with missing indexes, so this is not a perf test as such.
Linked Servers have pitfalls with regard to performance, this is out of the scope of this blog post
Whilst testing, I noticed that when executing a simple query from DuckDB like the one below, we get two executions logged by sp_blitzcache:
-- DuckDB (ODBC)
SELECT * FROM odbc_query(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
query='SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr'
);
Check it out:

I have tried to research this and have seen mention of ODBC running two queries, a first metadata pass and then the query itself. However, looking at sp_blitzcache, everything points to things actually running twice. I have raised this on the Github page as a question.
Since playing with nanodbc, I realised there was actually an MSSQL community extension developed by Vladimir Gribanov. Here is the Github page.
Like nanodbc, this is also very new, with some things unsupported (named instances).
Let’s get this installed:
INSTALL mssql FROM community;

We can now load it:
Load mssql;

The beauty of this extension is that we can create a secret with our connection details:
CREATE SECRET Stack_MSSQL (
TYPE mssql,
host '192.168.10.144',
port 1433,
database 'StackOverflow2010',
user 'sa',
password '********',
use_encrypt false
);
ATTACH '' AS sqlserver (TYPE mssql, SECRET Stack_MSSQL );

We can now attach it:
ATTACH '' AS sqlserver (TYPE mssql, SECRET Stack_MSSQL);
Oh…


I tried several different SQL Servers and experimented with all the options, but to no avail. I believe this extension is not working on Windows x64 architecture yet. Over to Ubuntu Server…
Works!

Also works on my Mac:

I decided to press forward with DuckDB on Ubuntu Server for now. Checking a simple query:
SELECT COUNT(*) FROM sqlserver.dbo.Posts WHERE PostTypeId =1;

This is already more convenient than nanodbc, as we can omit any connection strings and write the query in a more natural format.
Trying a slightly more complex example (to match our nanodbc example):
SELECT
YEAR(CreationDate) AS yr,
COUNT(*) AS post_count,
AVG(Score) AS avg_score
FROM sqlserver.dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;

Remember, in the nanodbc example, I noticed two executions when only one query was triggered? Using this new community extension, I observe one execution in the plan cache:

I did not have the time to test this extension exhaustively, but it looks very promising. I have decided to hold off on further testing DuckDB with SQL Server to wait until this mssql extension is more mature. In the meantime, I recommend you try the mssql extension first.
When I started playing with DuckDB, I naturally started plugging it into Postgres as I was working with it on an existing project. I naively assumed SQL Server would integrate just as seamlessly. My findings have been as follows:
https://duckdb.org/community_extensions/extensions/nanodbc
https://duckdb.org/community_extensions/extensions/mssql
https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent