Messing with DuckDB – Part 2 – Plugging in SQL Server

(Expectant)

In the first part of this series, I played around with DuckDB, which involved setting it up and plugging it into an existing datasource (Postgres). This data source was the copy of StackOverflow from SmartPostgres.

In this part, I am eager to see if things are as seamless when connecting DuckDB to SQL Server, which is my current ‘main’ (fans of fighting games will understand the reference).

Setting up nanodbc

Connecting to SQL Server should simply be a case of installing the official core extension, right? Wrong.

There’s nothing on the list here for SQL Server, so we are going to look at community extensions.

Trying nanodbc

The first one we will try is nanodbc. This is on the Community Extension page here. In addition, here is the GitHub page for the extension. This extension is new and still under active development in its formative stages.

This means we can’t create a secret in DuckDB containing our connection details, which was highly convenient in the Postgres extension.

First, install the extension:

install nanodbc FROM community;

Next, load it:

load nanodbc;

Great, we have some semblance of connectivity to SQL Server.

Ways to query using nanodbc

The first option we have is using odbc_scan on a table. Let’s try dbo.Posts…

SELECT * FROM odbc_scan(
    connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    table_name='dbo.Posts'
);

Oh:

It looks like odbc_scan does not like LOB columns (Body for example). This is, I believe, a limitation of ODBC (not entirely sure).

Let’s try a table with no NVARCHAR(MAX) column:

SELECT * FROM odbc_scan(
    connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    table_name='Votes'
);

Works:

The second and more flexible option is odbc_query. Let’s do a quick count of the posts table:

SELECT * FROM odbc_query(
    connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    query='SELECT COUNT(*) AS cnt FROM dbo.Posts WHERE PostTypeId = 1'
);

This approach works, but there are pitfalls we need to consider. Notice I pushed the count to SQL Server as opposed to sending all the rows over to DuckDB and then counting.

The following is probably not how you want to do it:

SELECT COUNT(*) FROM odbc_query(
connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    query='SELECT Id, Title, Score FROM dbo.Posts WHERE PostTypeId = 1'
);

Looking at sp_blitzcache, we can see the difference in row counts between pushing the count to SQL Server vs bringing it across:

The first result from sp_blitzcache is from not pushing the count to SQL Server; the second is obviously from doing the count on SQL Server.

Another odbc_query example

Here is an example of another simple query against SQL Server using DuckDB, plus its native, OPENQUERY and vanilla Linked Server equivalents (just as a comparison)

-- DuckDB (ODBC)
SELECT * FROM odbc_query(
    connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    query='SELECT 
               YEAR(CreationDate) AS yr,
               COUNT(*) AS post_count,
               AVG(Score) AS avg_score
           FROM dbo.Posts
           WHERE PostTypeId = 1
           GROUP BY YEAR(CreationDate)
           ORDER BY yr'
);

-- SQL Server Native
SELECT 
    YEAR(CreationDate) AS yr,
    COUNT(*) AS post_count,
    AVG(Score) AS avg_score
FROM dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;

-- OPENQUERY from another SQL Server to ensure pushdown
SELECT * FROM OPENQUERY([SKSQS25DV\EH_DV], 
    'SELECT 
         YEAR(CreationDate) AS yr,
         COUNT(*) AS post_count,
         AVG(Score) AS avg_score
     FROM StackOverflow2010.dbo.Posts
     WHERE PostTypeId = 1
     GROUP BY YEAR(CreationDate)
     ORDER BY yr'
);

-- Linked Server (vanilla)
SELECT 
    YEAR(CreationDate) AS yr,
    COUNT(*) AS post_count,
    AVG(Score) AS avg_score
FROM [SKSQS25DV\EH_DV].StackOVerflow2010.dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;

As the data set is pretty small, they all run quickly (<2 secs) even with missing indexes, so this is not a perf test as such.

Linked Servers have pitfalls with regard to performance, this is out of the scope of this blog post

Anomalous readings – (an aside)

Whilst testing, I noticed that when executing a simple query from DuckDB like the one below, we get two executions logged by sp_blitzcache:

-- DuckDB (ODBC)
SELECT * FROM odbc_query(
    connection='Driver={ODBC Driver 17 for SQL Server};Server=SKSQS25DV\EH_DV;Database=StackOverflow2010;Trusted_Connection=yes;',
    query='SELECT 
               YEAR(CreationDate) AS yr,
               COUNT(*) AS post_count,
               AVG(Score) AS avg_score
           FROM dbo.Posts
           WHERE PostTypeId = 1
           GROUP BY YEAR(CreationDate)
           ORDER BY yr'
);

Check it out:

I have tried to research this and have seen mention of ODBC running two queries, a first metadata pass and then the query itself. However, looking at sp_blitzcache, everything points to things actually running twice. I have raised this on the Github page as a question.

Introducing the mssql community extension

Since playing with nanodbc, I realised there was actually an MSSQL community extension developed by Vladimir Gribanov. Here is the Github page.

Like nanodbc, this is also very new, with some things unsupported (named instances).

Let’s get this installed:

INSTALL mssql FROM community;

We can now load it:

Load mssql;

The beauty of this extension is that we can create a secret with our connection details:

CREATE SECRET Stack_MSSQL (
    TYPE mssql,
    host '192.168.10.144',
    port 1433,
    database 'StackOverflow2010',
    user 'sa',
    password '********',
    use_encrypt false
);

ATTACH '' AS sqlserver (TYPE mssql, SECRET Stack_MSSQL );

We can now attach it:

ATTACH '' AS sqlserver (TYPE mssql, SECRET Stack_MSSQL);

Oh…

I tried several different SQL Servers and experimented with all the options, but to no avail. I believe this extension is not working on Windows x64 architecture yet. Over to Ubuntu Server…

Works!

Also works on my Mac:

I decided to press forward with DuckDB on Ubuntu Server for now. Checking a simple query:

SELECT COUNT(*) FROM sqlserver.dbo.Posts WHERE PostTypeId =1;

This is already more convenient than nanodbc, as we can omit any connection strings and write the query in a more natural format.

Trying a slightly more complex example (to match our nanodbc example):

SELECT
    YEAR(CreationDate) AS yr,
    COUNT(*)           AS post_count,
    AVG(Score)         AS avg_score
FROM sqlserver.dbo.Posts
WHERE PostTypeId = 1
GROUP BY YEAR(CreationDate)
ORDER BY yr;

Remember, in the nanodbc example, I noticed two executions when only one query was triggered? Using this new community extension, I observe one execution in the plan cache:

I did not have the time to test this extension exhaustively, but it looks very promising. I have decided to hold off on further testing DuckDB with SQL Server to wait until this mssql extension is more mature. In the meantime, I recommend you try the mssql extension first.

Summary

When I started playing with DuckDB, I naturally started plugging it into Postgres as I was working with it on an existing project. I naively assumed SQL Server would integrate just as seamlessly. My findings have been as follows:

  • There are at least two options for querying SQL Server via DuckDB
    • nanodbc
    • mssql community extension
  • Nanodbc is much more fiddly and does not support creating a secret, which is more convenient
  • mssql community extension works, but it is early in development
    • It is very promising, and it is something I will keep an eye on

Resources

https://duckdb.org/community_extensions/extensions/nanodbc

https://duckdb.org/community_extensions/extensions/mssql

https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent

Leave a Reply

Discover more from eheaton.com

Subscribe now to keep reading and get access to the full archive.

Continue reading