Duck DB: Query Processing Is King

One of the ways in which modern databases silently helped rather poor programs during the first dot-com boom of the late ’90s was that they handled asynchronous queries — often when the rest of the system didn’t.

This led to some comments from Oracle Co-Founder Larry Ellison, to the Guardian in 2001, that are very interesting in retrospect: “Some of these New Economy companies — take pets.com — it’s good that they’re gone. Selling cat food on the internet was lunacy.”

Coming back to today, the attraction of DuckDB, an open source project that released its version 1.0 in June. is that it is an in-process database. So, I build it as part of my executable; I’m not connecting to some totally separate system. It isn’t for selling cat food.

The first thing to get your head around is that the data persistence is not the central concern here — just query processing. You can create an in-memory database that does not persist (i.e., save data) at all, or you can use a local file. The purpose of a DuckDB database is likely to be sucking up some data, querying over it, maybe making some transformations, then going away.

So for working with DuckDB, we just need a library or plug-in, not a new application or service. If I look at the website’s front page, C# is’t mentioned. However, C# is supported via an open source ADO.NET provider. That this exists already certainly proves the ecosystem is probably already quite healthy.

Design Goals

Let’s contemplate the very first line of the example code below, before we even fire up Visual Studio Code:

var duckDBConnection = new DuckDBConnection("Data Source=file.db");

Obviously, “database” is synonymous with persistence, even though that isn’t the primary purpose for DuckDB. In the example above, we use a file as a persistent data source.

The file format is forward and backward compatible, and obviously this is somewhat important for maintenance. But you would probably be unwise to commit to a long-term data strategy with a company that might have a shorter lifetime than your data. But that just underlines the same concept: persistence is not the mainstay here. If we left the data source argument out, or used the keyword :memory:, then we would have an in-memory database.

What about concurrency? The design goals of this project clearly point to not attempting to support complex scenarios, although they have simple modes to allow for optimistic concurrency. The idea is to set up the data, whack it with queries, then go.

Getting Started

So let’s get started with Visual Studio Code:

So, we get a fresh Visual Studio Code in a new project folder. I’ve written about setting up Visual Studio Code before; for now, I’ll just say these are the relevant extensions that I have installed for working with C#:

Create a new project using “.NET new project” with the command palette then add DuckDB.NET.Data.Full using “nuget: add package” with the palette again. These all have command-line equivalents, but I’ll stay in the IDE for this post.

You should now have an empty project, so let’s add the following into the file “Program.cs”:

using DuckDB.NET.Data;

var duckDBConnection = new DuckDBConnection("Data Source=file.db");
duckDBConnection.Open();

var command = duckDBConnection.CreateCommand();
command.CommandText = "CREATE TABLE integers(foo INTEGER, bar INTEGER);";
var executeNonQuery = command.ExecuteNonQuery();

command.CommandText = "INSERT INTO integers VALUES (3, 4), (5, 6), (7, NULL);";
executeNonQuery = command.ExecuteNonQuery();

command.CommandText = "Select count(*) from integers";
var count = command.ExecuteScalar();
Console.WriteLine($"Rows = {count}");

command.CommandText = "SELECT foo, bar FROM integers";
var reader = command.ExecuteReader();

Console.Write($"Columns: ")
for (var index = 0; index < reader.FieldCount; index++)
{
    var column = reader.GetName(index);
    Console.Write($"{column} ");
}

Console.WriteLine();

while (reader.Read())
{
    for (int index = 0; index < reader.FieldCount; index++)
    {
        if (reader.IsDBNull(index))
        {
            Console.WriteLine("NULL");
            continue;
        }

        var val = reader.GetFieldValue<int>(index);
        Console.Write(val);
        Console.Write(" ");
    }
}

Running this, we get the following in the terminal output:

Rows = 3 
Columns: foo bar 
3 4 5 6 7 NULL

Let’s take a look at what happened.

First, we know we should have made a persistence file called “file.db”, and this is visible in the bin directory. Indeed, if you try to run this same code twice we get:

Unhandled exception. DuckDB.NET.Data.DuckDBException 0x0000000D): 
Catalog Error: Table with name "integers" already exists!

This is correct, but only because we opened up a persistent database. Comment out the Database Definition Language (DDL) and we can run it as many times as we like. Or change the first line to:

var duckDBConnection = new DuckDBConnection("Data Source=:memory:");

for an in-memory db to run multiple times.

We then used some DDL to create the table.

command.CommandText = "CREATE TABLE integers(foo INTEGER, bar INTEGER);"; 
var executeNonQuery = command.ExecuteNonQuery();

I guess, as it isn’t Database Manipulation Language (DML), it is executed as a “nonquery.” Somewhat strange nomenclature, to be sure.

When we want a result, we ask specifically for the result type after execution:

command.CommandText = "Select count(*) from integers"; 
var count = command.ExecuteScalar(); 
Console.WriteLine($"Rows = {count}");

But count clearly resolves into something we can print.

For detailed results, we get a “Reader” after executing the query. (Yes, these commands could be better named.)

command.CommandText = "SELECT foo, bar FROM integers"; 
var reader = command.ExecuteReader();

Before we use it, we kind of know the Reader holds a result array that we will have to access via the API. We can simply summarise the rest of the code by looking at how we do this.

for (var index = 0; index < reader.FieldCount; index++) { 
  var column = reader.GetName(index); 
  Console.Write($"{column} "); 
}

The FieldCount method just counts the columns, and GetName returns the column names. (We could use GetDataTypeName to get the column type.)

Finally, we do a full row-by-row read of the data.

while (reader.Read())
{
    for (int index = 0; index < reader.FieldCount; index++)
    {
        if (reader.IsDBNull(index))
        {
            Console.WriteLine("NULL");
            continue;
        }

        var val = reader.GetFieldValue<int>(index);
        Console.Write(val);
        Console.Write(" ");
    }
}

We see that the iterator is just calling Read() until it returns null. For each row of records we can then just use GetFieldValue to extract the Integer value. This is quicker than using a separate result set or similar, but makes the code less readable as the Reader is holding both temporary and static data simultaneously.

Conclusion

DuckDB directly supports Python, R, Java, Node.js, Go and Rust, so your environment is almost certainly catered for. This should be a useful tool for both testing scenarios and transforming data on the fly. But I also like the idea of using it to gain SQL query support without worrying about the weight of a full database system.

But if you are planning on rewriting pets.com, use something else.

The post Duck DB: Query Processing Is King appeared first on The New Stack.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Duck DB: Query Processing Is King

Design Goals

Getting Started

Conclusion

Is Kubernetes Part of CNCF?

Blockchain and Kubernetes: A Synergistic Approach to Modern Applications

Kubernetes for Beginners: Making Sense of Container Orchestration in…

Leave a Reply Cancel reply