article thumbnail
PuppyGraph
Graph Queries on Your Existing Data — No ETL Required
16 min read
#databases, #postgresql, #mysql, #snowflake, #graphql, #neo4j

You already have the data. It's sitting in PostgreSQL, MySQL, Snowflake, or a dozen other stores. The problem isn't the data -- it's the relationships buried inside it. Finding how your customers connect, how fraud propagates, or how your supply chain is linked requires graph traversal. And traditionally, that meant standing up a dedicated graph database, migrating data via complex ETL pipelines, keeping everything in sync, and paying the operational cost forever.

PuppyGraph takes a different approach: it sits on top of your existing relational databases and lets you query them as a graph -- zero ETL, zero data migration, operational in under 10 minutes.


What Is PuppyGraph?

PuppyGraph is a graph query engine, not a graph database. That distinction matters. Traditional graph databases (Neo4j, TigerGraph, etc.) require you to load your data into their proprietary storage. PuppyGraph performs the graph computation and pushes down the data fetch to your existing data sources via JDBC or a Parquet file reader, reading directly from tables you already have.

You define a schema -- a JSON document that maps your relational tables to graph vertices and edges -- and PuppyGraph only reads the necessary data related to a query based on the defined schema.

Key capabilities:

Real-world users include Coinbase (crypto graph analytics), Netskope (cybersecurity), and AMD.


Core Concepts

Term Meaning
Catalog A connection to a data source (your PostgreSQL instance, Snowflake account, etc.)
Vertex (Node) A graph node, mapped from a database table
Edge A graph relationship between vertices, also mapped from a table
Schema A JSON document that defines catalogs, vertices, and edges
Gremlin Apache TinkerPop's graph traversal language
openCypher Neo4j's declarative graph query language

PuppyGraph exposes three ports:

Port Purpose
8081 Web UI + REST API
8182 Gremlin WebSocket endpoint
7687 Bolt protocol (openCypher / Neo4j-compatible drivers)

Gremlin vs. openCypher -- Which Should You Use?

PuppyGraph supports both query languages and you can switch between them freely. Understanding what each is optimized for will save you a lot of time.

What Is Gremlin?

Gremlin is the query language of Apache TinkerPop, an open-source graph computing framework. It's a procedural, traversal-based language -- you describe how to walk the graph step by step, chaining operations like a pipeline.

// "Start at marko, walk outward along 'knows' edges,
//  then walk outward along 'created' edges, return names"
g.V().has("name", "marko")
  .out("knows")
  .out("created")
  .values("name")

Every step (out(), has(), values(), where(), etc.) narrows or transforms the traversal. It reads almost like a stream pipeline. Gremlin is:

Gremlin's weakness is readability for people who think in SQL. Pattern matching (finding all nodes that fit a shape) requires more verbose syntax than Cypher.

What Is openCypher?

openCypher originated at Neo4j and is now an open standard. It's a declarative, pattern-matching language -- you describe what you want the graph to look like, and the engine figures out how to find it. The syntax is inspired by SQL and ASCII art.

-- "Find all people marko knows, and what software those people created"
MATCH (p:person {name: 'marko'})-[:knows]->(:person)-[:created]->(s:software)
RETURN DISTINCT s.name

The --> arrow notation visually represents relationships, making schemas easy to read at a glance. Cypher is:

Cypher's weakness is that very long traversal chains can become harder to express than the equivalent Gremlin.

Side-by-Side Comparison

The same query in both languages:

// Gremlin: all software created by people marko knows, with edge weights > 0.5
g.V().has("name", "marko")
  .outE("knows").has("weight", gt(0.5)).inV()
  .out("created")
  .values("name")
-- Cypher: same query
MATCH (p:person {name: 'marko'})-[k:knows]->(friend)-[:created]->(s:software)
WHERE k.weight > 0.5
RETURN s.name
Factor Gremlin openCypher
Style Procedural / traversal Declarative / pattern matching
Learning curve Steeper (new paradigm) Gentler (SQL-like)
Deep multi-hop traversals Natural Workable but verbose
Pattern matching Verbose Concise and visual
Ecosystem TinkerPop-compatible engines Neo4j ecosystem + growing open standard
PuppyGraph CLI access console command cypher-console command
External driver gremlin-python Neo4j Python driver (Bolt)

Rule of thumb: If your team knows SQL and you're doing structural pattern matching, start with openCypher. If you're doing complex multi-hop traversals or working in a TinkerPop ecosystem, use Gremlin. With PuppyGraph you can mix both -- Gremlin for traversal-heavy work, Cypher for quick exploratory queries.


Installation

PuppyGraph runs as a Docker container. The only prerequisite is Docker with Compose.

macOS

brew install --cask docker
# Launch Docker Desktop from Applications, then verify:
docker compose version

Windows

  1. Download Docker Desktop for Windows from docker.com
  2. Run the installer -- it enables WSL 2 integration automatically
  3. Launch Docker Desktop, then verify in PowerShell:
docker compose version

All docker commands work identically in PowerShell or WSL 2 bash.

Run PuppyGraph

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_USERNAME=puppygraph \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  --name puppygraph \
  puppygraph/puppygraph:stable

Open http://localhost:8081 and log in with puppygraph / puppygraph123.

The examples below assume PuppyGraph is running and your databases are already accessible over the network.


The Schema -- How PuppyGraph Sees Your Data

Every PuppyGraph deployment is driven by a schema.json with three top-level sections:

{
  "catalogs": [ ... ],
  "graph": {
    "vertices": [ ... ],
    "edges":    [ ... ]
  }
}

Vertices need at minimum a label, a tableSource, and an id field. Edges additionally need fromVertex, toVertex, fromId, and toId to wire them to vertex tables.

The only thing that changes between database types is the catalog block. The graph section is identical regardless of whether the data lives in PostgreSQL, MySQL, or Snowflake.

Upload schema via REST:

curl -XPOST -H "content-type: application/json" \
  --data-binary @./schema.json \
  --user "puppygraph:puppygraph123" \
  localhost:8081/schema
# {"Status":"OK","Message":"Schema uploaded and gremlin server restarted"}

Connecting to PostgreSQL

Assume you have tables structured like this in an existing PostgreSQL instance:

modern.person   (id, name, age)
modern.software (id, name, lang)
modern.knows    (id, from_id, to_id, weight)
modern.created  (id, from_id, to_id, weight)

schema.json:

{
  "catalogs": [
    {
      "name": "pg_data",
      "type": "postgresql",
      "jdbc": {
        "username": "myuser",
        "password": "mypassword",
        "jdbcUri": "jdbc:postgresql://my-db-host:5432/my_database",
        "driverClass": "org.postgresql.Driver"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "person" },
          "id": { "fields": [{ "type": "String", "field": "id", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "name", "alias": "name" },
            { "type": "Int",    "field": "age",  "alias": "age"  }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "software" },
          "id": { "fields": [{ "type": "String", "field": "id", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "name", "alias": "name" },
            { "type": "String", "field": "lang", "alias": "lang" }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "knows" },
        "id":     { "fields": [{ "type": "String", "field": "id",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "from_id", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "to_id",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "weight", "alias": "weight" }]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "created" },
        "id":     { "fields": [{ "type": "String", "field": "id",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "from_id", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "to_id",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "weight", "alias": "weight" }]
      }
    ]
  }
}

Querying -- Gremlin

docker exec -it puppygraph ./bin/puppygraph console
g.V().has("name", "marko").valueMap()
g.V().has("name", "marko").out("knows").values("name")
g.V().has("name", "marko").out("knows").out("created").values("name")
g.V().hasLabel("person").has("age", gt(30)).valueMap("name", "age")
g.V().has("name", "marko").out("knows").out("created").path().by("name")

Querying -- openCypher

[PuppyGraph]> cypher-console
:> MATCH (p:person {name: 'marko'})-[:knows]->(friend)
   RETURN friend.name, friend.age

:> MATCH (p:person {name: 'marko'})-[:knows]->()-[:created]->(s:software)
   RETURN DISTINCT s.name

:> MATCH (p:person)-[:created]->(s:software)
   RETURN p.name, count(s) AS projects
   ORDER BY projects DESC

:> MATCH (p:person)-[k:knows]->(friend)
   WHERE k.weight > 0.5
   RETURN p.name, friend.name, k.weight

Connecting to MySQL

The MySQL catalog block differs only in type, driverClass, jdbcUri, and driverUrl. The graph section is identical to PostgreSQL.

{
  "name": "mysql_data",
  "type": "mysql",
  "jdbc": {
    "username": "myuser",
    "password": "mypassword",
    "jdbcUri": "jdbc:mysql://my-db-host:3306/my_database",
    "driverClass": "com.mysql.cj.jdbc.Driver",
    "driverUrl": "https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar"
  }
}

Note: The driverUrl field tells PuppyGraph where to download the MySQL JDBC driver at startup. PostgreSQL's driver is bundled by default; MySQL's must be specified.

Swap this catalog block into the PostgreSQL schema.json, updating "catalog": "pg_data" references in each tableSource to "catalog": "mysql_data". All the same Gremlin and Cypher queries work without modification.


Connecting to Snowflake

Snowflake uses key-pair authentication rather than a plain password, which requires a small amount of upfront setup.

Step 1 -- Generate a Key Pair

# Generate encrypted private key
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8

# Derive the public key
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Step 2 -- Register the Public Key in Snowflake

Open a SQL worksheet in Snowsight and run:

ALTER USER <your_snowflake_user>
  SET RSA_PUBLIC_KEY = 'MIIBIjANBgkqh...';
  -- paste the body of rsa_key.pub, excluding the -----BEGIN/END----- lines

Step 3 -- Mount the Key into the PuppyGraph Container

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_USERNAME=puppygraph \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -v /path/to/your/keys:/home \
  --name puppygraph \
  puppygraph/puppygraph:stable

The private key will be available inside the container at /home/rsa_key.p8.

Step 4 -- Configure via Web UI

The Snowflake catalog is best configured through the Schema Builder UI at http://localhost:8081. Click Create graph schema → Add more catalogs and fill in:

Field Value
Catalog type Snowflake
Catalog name Any label you choose
Auth type Key-pair authentication
Server URL <account_identifier>.snowflakecomputing.com
User name Your Snowflake username
Database name Your Snowflake database
Warehouse name Your Snowflake warehouse
Private key file path /home/rsa_key.p8
Private key file password The passphrase set during key generation

Click Save → Submit.

Step 5 -- Or Use schema.json Directly

{
  "catalogs": [
    {
      "name": "snowflake_data",
      "type": "snowflake",
      "snowflake": {
        "username": "my_sf_user",
        "serverUrl": "myorg-myaccount.snowflakecomputing.com",
        "database": "MY_DATABASE",
        "warehouse": "MY_WAREHOUSE",
        "privateKeyPath": "/home/rsa_key.p8",
        "privateKeyPassword": "my_passphrase"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "PERSON" },
          "id": { "fields": [{ "type": "String", "field": "ID", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "NAME", "alias": "name" },
            { "type": "Int",    "field": "AGE",  "alias": "age"  }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "SOFTWARE" },
          "id": { "fields": [{ "type": "String", "field": "ID", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "NAME", "alias": "name" },
            { "type": "String", "field": "LANG", "alias": "lang" }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "KNOWS" },
        "id":     { "fields": [{ "type": "String", "field": "ID",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "FROM_ID", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "TO_ID",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "WEIGHT", "alias": "weight" }]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "CREATED" },
        "id":     { "fields": [{ "type": "String", "field": "ID",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "FROM_ID", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "TO_ID",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "WEIGHT", "alias": "weight" }]
      }
    ]
  }
}

Snowflake identifiers are uppercase by default. Field names in your schema JSON must match the case Snowflake uses -- usually all caps unless you created columns with quoted lowercase names.

After uploading the schema, all the same Gremlin and Cypher queries from the PostgreSQL example work identically.


Multi-Source Schemas

One of PuppyGraph's most powerful features is cross-database graph federation. A single schema can reference multiple catalogs, letting you traverse relationships that span different databases in one query.

{
  "catalogs": [
    { "name": "pg_data", "type": "postgresql", "jdbc": { "jdbcUri": "jdbc:postgresql://pg-host:5432/crm", "...": "..." } },
    { "name": "sf_data", "type": "snowflake",  "snowflake": { "serverUrl": "myorg-myaccount.snowflakecomputing.com", "...": "..." } }
  ],
  "graph": {
    "vertices": [
      { "label": "customer", "oneToOne": { "tableSource": { "catalog": "pg_data", "schema": "public", "table": "customers" }, "...": "..." } },
      { "label": "order",    "oneToOne": { "tableSource": { "catalog": "sf_data", "schema": "SALES",  "table": "ORDERS"    }, "...": "..." } }
    ],
    "edges": [
      { "label": "placed", "fromVertex": "customer", "toVertex": "order",
        "tableSource": { "catalog": "pg_data", "schema": "public", "table": "customer_orders" }, "...": "..." }
    ]
  }
}

A single Gremlin or Cypher query can now traverse from a customer vertex in PostgreSQL to an order vertex in Snowflake in one hop -- no join logic needed in the application layer.


Graph Schema Patterns

One-to-One (most common): One table row = one vertex or edge. All examples above use "oneToOne": { ... }.

Many-to-One: Multiple rows from different tables coalesce into a single vertex type. Useful for denormalized data where the same conceptual entity is spread across several tables.

Edge Tables: Always backed by a table with at minimum three columns -- an ID, a from_id, and a to_id. These reference the ID columns of the vertex tables. Any additional columns become traversable edge properties.


The Web UI

The web UI at http://localhost:8081 provides:

For Snowflake specifically, the Schema Builder's Auto Suggestion feature can inspect your table structure and propose the edge wiring automatically -- a significant time saver on large schemas.


Querying via External Drivers

Python -- Gremlin:

from gremlin_python.driver import client, serializer

c = client.Client(
    "ws://localhost:8182/gremlin",
    "g",
    message_serializer=serializer.GraphSONSerializersV2d0()
)
result = c.submit(
    "g.V().has('name', 'marko').out('knows').values('name')"
).all().result()
print(result)  # ['vadas', 'josh']

Python -- openCypher (neo4j driver):

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("puppygraph", "puppygraph123"))
with driver.session() as session:
    result = session.run(
        "MATCH (p:person {name: $name})-[:knows]->(friend) RETURN friend.name",
        name="marko"
    )
    for record in result:
        print(record["friend.name"])

PuppyGraph vs. Dedicated Graph Databases

Factor PuppyGraph Neo4j / TigerGraph
Data location Your existing store Migrated to graph DB
ETL required No Yes
Setup time ~10 minutes Hours to days
Query languages Gremlin + Cypher Cypher / GSQL
Write operations Read-only Full read/write
Multi-source federation Yes No (single store)
Best for Graph analytics on existing data Transactional graph workloads requiring graph-native writes

The primary trade-off is write semantics. PuppyGraph is a query engine -- you read through it, you write through your existing databases using their normal clients. If your use case is analytics on data that already lives in relational stores, PuppyGraph removes an entire category of operational overhead.


Common Use Cases

Fraud detection -- model accounts, transactions, devices, and IP addresses as a graph. Multi-hop traversal reveals rings and shared infrastructure that flat SQL can't surface efficiently.

Network & IT security -- graph traversal across asset, vulnerability, and connection tables identifies blast radius and lateral movement paths.

Supply chain analysis -- model suppliers, factories, distributors, and customers as connected nodes. Path queries expose single points of failure and find alternate routes.

Knowledge graphs for LLMs -- wire PuppyGraph's Gremlin endpoint into a RAG pipeline to give LLMs structured relationship traversal over your enterprise data.

Organizational hierarchies -- distributor downline trees, org charts, and referral networks traversed directly from your existing CRM or ERP tables.


Resources