DuckDB Crash Course

DuckDB Crash Course

From Big Data to AI Insights: Why DuckDB is Your Essential Tool

5 min read

#databases, #duckdb, #analytics, #friday3

DuckDB is an in-process analytical database management system, similar to SQLite but optimized for analytical queries (OLAP) rather than transactional workloads (OLTP). It's designed to be embedded within applications and excels at processing large datasets with complex queries quickly, using columnar storage to achieve superior performance for analytical workloads. Let's take a closer look.

Installation of DuckDB is pretty simple and straight forward. If you are on Windows the easiest way is to just download the CLI executable and add it to your PATH. If you want to use DuckDB in Python or R you can just install it with the normal package installer.

# Python
pip install duckdb

# R
install.packages("duckdb")

# CLI - Windows
# Download ZIP from official website
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-windows-amd64.zip
# Extract the ZIP file
Expand-Archive -Path duckdb_cli-windows-amd64.zip -DestinationPath C:\duckdb
# Add to PATH (optional)
$env:PATH += ";C:\duckdb"

# Alternative: Install via Chocolatey
choco install duckdb

# Alternative: Install via Scoop
scoop install duckdb

# CLI - macOS
brew install duckdb

# CLI - Linux
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip

DuckDB vs SQLite

DuckDB differs from SQLite in several key ways:

Storage model: DuckDB uses columnar storage (optimized for analytical queries) vs SQLite's row-based storage
Vectorized execution: DuckDB processes data in batches for better CPU utilization
Query optimization: DuckDB has advanced optimization for complex analytical queries
Parallelism: DuckDB supports multi-threading, SQLite is single-threaded
External data: DuckDB can query external files directly (CSV, Parquet, etc.)

DuckDB is written in C++ and has first-party support for languages like Python, R, Java, and C. Other Languages like JavaScript, Ruby, and Go are supported through maintained bindings or extensions. Here are some examples of using DuckDB.

Python Example

import duckdb

# Connect to an in-memory database
con = duckdb.connect()

# Create and query a table
con.execute("CREATE TABLE items(id INTEGER, name VARCHAR)")
con.execute("INSERT INTO items VALUES (1, 'DuckDB'), (2, 'SQLite')")
result = con.execute("SELECT * FROM items").fetchall()
print(result)

# Query a CSV file directly
result = con.execute("SELECT * FROM read_csv('data.csv') LIMIT 5").fetchdf()
print(result)

R Example

library(duckdb)
con <- dbConnect(duckdb())
dbExecute(con, "CREATE TABLE test (id INTEGER, name VARCHAR)")
dbExecute(con, "INSERT INTO test VALUES (1, 'DuckDB'), (2, 'R')")
dbGetQuery(con, "SELECT * FROM test")

Java Example

import org.duckdb.*;

public class DuckDBExample {
    public static void main(String[] args) throws Exception {
        // Create a connection
        Connection conn = DriverManager.getConnection("jdbc:duckdb:");
        Statement stmt = conn.createStatement();

        // Execute queries
        stmt.execute("CREATE TABLE example(id INTEGER, name VARCHAR)");
        stmt.execute("INSERT INTO example VALUES (1, 'DuckDB'), (2, 'Java')");
        ResultSet rs = stmt.executeQuery("SELECT * FROM example");

        // Print results
        while (rs.next()) {
            System.out.println(rs.getInt("id") + ": " + rs.getString("name"));
        }
    }
}

Rust Example

use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;

    conn.execute(
        "CREATE TABLE test (id INTEGER, name TEXT)",
        []
    )?;

    conn.execute(
        "INSERT INTO test VALUES (1, 'DuckDB'), (2, 'Rust')",
        []
    )?;

    let mut stmt = conn.prepare("SELECT * FROM test")?;
    let rows = stmt.query_map([], |row| {
        Ok((row.get::<_, i32>(0)?, row.get::<_, String>(1)?))
    })?;

    for row in rows {
        let (id, name) = row?;
        println!("{}: {}", id, name);
    }

    Ok(())
}

Extensions in DuckDB

DuckDB also has a robust extension system that adds some cool functionality. Some of the built-in extensions are:

parquet: Read/write Parquet files (loaded by default)
json: JSON processing functions
sqlite: Connect to SQLite databases
postgres_scanner: Connect to PostgreSQL databases
mysql_scanner: Connect to MySQL databases
httpfs: Access remote files over HTTP
excel: Load Excel files

Other extensions are also used often and are super useful.
arrow: Apache Arrow integration
spatial: Geospatial data types and functions
iceberg: Read Apache Iceberg tables
aws: Amazon S3 integration

These extensions must be installed and loaded before use. To Install and Load an extension just add this before your query.

INSTALL extension_name; LOAD extension_name;

Apache Arrow Integration

Apache Arrow is a columnar memory format for cross-language data exchange. DuckDB integrates well with Arrow:

import pyarrow as pa
import duckdb

# Create Arrow table
data = {
    'id': [1, 2, 3],
    'name': ['one', 'two', 'three']
}
arrow_table = pa.Table.from_pydict(data)

# Query Arrow table with DuckDB
con = duckdb.connect()
result = con.execute("SELECT * FROM arrow_table WHERE id > 1").fetchall()
print(result)

# Convert DuckDB result to Arrow
arrow_result = con.execute("SELECT * FROM arrow_table").arrow()

CLI Examples for Data Sources

-- Read CSV with automatic schema detection
SELECT * FROM read_csv('data.csv');

-- Read CSV with schema specification
SELECT * FROM read_csv('data.csv', header=true, columns={'id': 'INTEGER', 'name': 'VARCHAR'});

-- Read Parquet file
SELECT * FROM read_parquet('data.parquet');

-- Read with filters pushed down
SELECT * FROM read_parquet('data.parquet', hive_partitioning=1) WHERE date_column = '2023-01-01';

-- Read JSON file
SELECT * FROM read_json('data.json');

-- Read JSON with specific options
SELECT * FROM read_json('data.json', format='auto', columns={'id': 'INTEGER', 'name': 'VARCHAR'});

-- Read Excel file
INSTALL excel;
LOAD excel;
SELECT * FROM read_excel('data.xlsx', sheet_name='Sheet1');

-- Read from HTTP source
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv('https://example.com/data.csv');

Complex Joins Example

In my opinion one of the coolest features of DuckDB is it's ability to join across multiple data sources and types. Here's how to join multiple data sources in DuckDB:

-- Install required extensions
INSTALL sqlite;
LOAD sqlite;
INSTALL postgres_scanner;
LOAD postgres_scanner;
INSTALL mysql_scanner;
LOAD mysql_scanner;

-- Join multiple sources
SELECT
    c.distid, c.name, c.email,
    j.age, j.color,
    m.code, m.name as country_name,
    count(o.ordernumber) as order_count,
    sum(o.order_amount) as order_total,
    count(oi.itemid) as item_count
-- CSV file
FROM read_csv('d:/temp3/names.csv') c
-- JSON file
JOIN read_json('d:/temp3/ages.json') j on j.distid = c.distid 
-- SQLite DB
JOIN sqlite_attach('orders.db').orders o on o.dist_id = c.distid
-- Postgres DB
JOIN postgres_attach('host=localhost user=postgres password=password dbname=master').order_items oi on oi.distid = c.distid
-- MySQL DB
JOIN mysql_attach('host=localhost user=root password=password dbname=dis_live').states m on m.code = o.state and m.country = 'US'
GROUP BY 
    c.distid, c.name, c.email,
    j.age, j.color,
    m.code, m.name;

DuckDB is quickly becoming essential in modern data workflows due to its ability to process analytical queries on large datasets with impressive speed and minimal setup. As an embedded SQL OLAP database, it excels at columnar analytics while integrating seamlessly with Python and R. For data professionals facing growing data volumes, DuckDB offers a lightweight yet powerful alternative to complex data warehouse setups, enabling fast in-memory processing of gigabyte-scale datasets with familiar SQL syntax. Learning DuckDB today means gaining a valuable tool that bridges the gap between local development and production-scale data engineering.