PuppyGraph

PuppyGraph

Graph Queries on Your Existing Data — No ETL Required

16 min read

#databases, #postgresql, #mysql, #snowflake, #graphql, #neo4j

You already have the data. It's sitting in PostgreSQL, MySQL, Snowflake, or a dozen other stores. The problem isn't the data -- it's the relationships buried inside it. Finding how your customers connect, how fraud propagates, or how your supply chain is linked requires graph traversal. And traditionally, that meant standing up a dedicated graph database, migrating data via complex ETL pipelines, keeping everything in sync, and paying the operational cost forever.

PuppyGraph takes a different approach: it sits on top of your existing relational databases and lets you query them as a graph -- zero ETL, zero data migration, operational in under 10 minutes.

What Is PuppyGraph?

PuppyGraph is a graph query engine, not a graph database. That distinction matters. Traditional graph databases (Neo4j, TigerGraph, etc.) require you to load your data into their proprietary storage. PuppyGraph performs the graph computation and pushes down the data fetch to your existing data sources via JDBC or a Parquet file reader, reading directly from tables you already have.

You define a schema -- a JSON document that maps your relational tables to graph vertices and edges -- and PuppyGraph only reads the necessary data related to a query based on the defined schema.

Key capabilities:

Zero ETL -- no data duplication or sync pipelines
Petabyte-scale -- designed for large analytical workloads
Multi-hop queries -- complex relationship traversals in seconds
Dual query language support -- Gremlin and openCypher
Broad data source support -- PostgreSQL, MySQL, Snowflake, Databricks, ClickHouse, Oracle, MongoDB, S3, and more
Deployable anywhere -- Docker, AWS Marketplace, GCP Marketplace, Kubernetes/Helm

Real-world users include Coinbase (crypto graph analytics), Netskope (cybersecurity), and AMD.

Core Concepts

Term	Meaning
Catalog	A connection to a data source (your PostgreSQL instance, Snowflake account, etc.)
Vertex (Node)	A graph node, mapped from a database table
Edge	A graph relationship between vertices, also mapped from a table
Schema	A JSON document that defines catalogs, vertices, and edges
Gremlin	Apache TinkerPop's graph traversal language
openCypher	Neo4j's declarative graph query language

PuppyGraph exposes three ports:

Port	Purpose
`8081`	Web UI + REST API
`8182`	Gremlin WebSocket endpoint
`7687`	Bolt protocol (openCypher / Neo4j-compatible drivers)

Gremlin vs. openCypher -- Which Should You Use?

PuppyGraph supports both query languages and you can switch between them freely. Understanding what each is optimized for will save you a lot of time.

What Is Gremlin?

Gremlin is the query language of Apache TinkerPop, an open-source graph computing framework. It's a procedural, traversal-based language -- you describe how to walk the graph step by step, chaining operations like a pipeline.

// "Start at marko, walk outward along 'knows' edges,
//  then walk outward along 'created' edges, return names"
g.V().has("name", "marko")
  .out("knows")
  .out("created")
  .values("name")

Every step (out(), has(), values(), where(), etc.) narrows or transforms the traversal. It reads almost like a stream pipeline. Gremlin is:

Imperative -- you control the exact traversal path
Composable -- steps chain together naturally
Powerful for deep traversals -- multi-hop path finding, recursive patterns, subgraph extraction
Vendor-neutral -- runs on JanusGraph, AWS Neptune, and PuppyGraph without modification

Gremlin's weakness is readability for people who think in SQL. Pattern matching (finding all nodes that fit a shape) requires more verbose syntax than Cypher.

What Is openCypher?

openCypher originated at Neo4j and is now an open standard. It's a declarative, pattern-matching language -- you describe what you want the graph to look like, and the engine figures out how to find it. The syntax is inspired by SQL and ASCII art.

-- "Find all people marko knows, and what software those people created"
MATCH (p:person {name: 'marko'})-[:knows]->(:person)-[:created]->(s:software)
RETURN DISTINCT s.name

The --> arrow notation visually represents relationships, making schemas easy to read at a glance. Cypher is:

Declarative -- describe the pattern, not the traversal steps
Readable -- SQL developers pick it up quickly
Excellent for pattern matching -- finding subgraphs, shortest paths, complex relationship conditions
Widely adopted -- Neo4j, Amazon Neptune, Memgraph, and PuppyGraph all support it

Cypher's weakness is that very long traversal chains can become harder to express than the equivalent Gremlin.

Side-by-Side Comparison

The same query in both languages:

// Gremlin: all software created by people marko knows, with edge weights > 0.5
g.V().has("name", "marko")
  .outE("knows").has("weight", gt(0.5)).inV()
  .out("created")
  .values("name")

-- Cypher: same query
MATCH (p:person {name: 'marko'})-[k:knows]->(friend)-[:created]->(s:software)
WHERE k.weight > 0.5
RETURN s.name

Factor	Gremlin	openCypher
Style	Procedural / traversal	Declarative / pattern matching
Learning curve	Steeper (new paradigm)	Gentler (SQL-like)
Deep multi-hop traversals	Natural	Workable but verbose
Pattern matching	Verbose	Concise and visual
Ecosystem	TinkerPop-compatible engines	Neo4j ecosystem + growing open standard
PuppyGraph CLI access	`console` command	`cypher-console` command
External driver	gremlin-python	Neo4j Python driver (Bolt)

Rule of thumb: If your team knows SQL and you're doing structural pattern matching, start with openCypher. If you're doing complex multi-hop traversals or working in a TinkerPop ecosystem, use Gremlin. With PuppyGraph you can mix both -- Gremlin for traversal-heavy work, Cypher for quick exploratory queries.

Installation

PuppyGraph runs as a Docker container. The only prerequisite is Docker with Compose.

macOS

brew install --cask docker
# Launch Docker Desktop from Applications, then verify:
docker compose version

Windows

Download Docker Desktop for Windows from docker.com
Run the installer -- it enables WSL 2 integration automatically
Launch Docker Desktop, then verify in PowerShell:

docker compose version

All docker commands work identically in PowerShell or WSL 2 bash.

Run PuppyGraph

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_USERNAME=puppygraph \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  --name puppygraph \
  puppygraph/puppygraph:stable

Open http://localhost:8081 and log in with puppygraph / puppygraph123.

The examples below assume PuppyGraph is running and your databases are already accessible over the network.

The Schema -- How PuppyGraph Sees Your Data

Every PuppyGraph deployment is driven by a schema.json with three top-level sections:

{
  "catalogs": [ ... ],
  "graph": {
    "vertices": [ ... ],
    "edges":    [ ... ]
  }
}

Vertices need at minimum a label, a tableSource, and an id field. Edges additionally need fromVertex, toVertex, fromId, and toId to wire them to vertex tables.

The only thing that changes between database types is the catalog block. The graph section is identical regardless of whether the data lives in PostgreSQL, MySQL, or Snowflake.

Upload schema via REST:

curl -XPOST -H "content-type: application/json" \
  --data-binary @./schema.json \
  --user "puppygraph:puppygraph123" \
  localhost:8081/schema
# {"Status":"OK","Message":"Schema uploaded and gremlin server restarted"}

Connecting to PostgreSQL

Assume you have tables structured like this in an existing PostgreSQL instance:

modern.person   (id, name, age)
modern.software (id, name, lang)
modern.knows    (id, from_id, to_id, weight)
modern.created  (id, from_id, to_id, weight)

schema.json:

{
  "catalogs": [
    {
      "name": "pg_data",
      "type": "postgresql",
      "jdbc": {
        "username": "myuser",
        "password": "mypassword",
        "jdbcUri": "jdbc:postgresql://my-db-host:5432/my_database",
        "driverClass": "org.postgresql.Driver"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "person" },
          "id": { "fields": [{ "type": "String", "field": "id", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "name", "alias": "name" },
            { "type": "Int",    "field": "age",  "alias": "age"  }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "software" },
          "id": { "fields": [{ "type": "String", "field": "id", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "name", "alias": "name" },
            { "type": "String", "field": "lang", "alias": "lang" }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "knows" },
        "id":     { "fields": [{ "type": "String", "field": "id",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "from_id", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "to_id",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "weight", "alias": "weight" }]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": { "catalog": "pg_data", "schema": "modern", "table": "created" },
        "id":     { "fields": [{ "type": "String", "field": "id",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "from_id", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "to_id",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "weight", "alias": "weight" }]
      }
    ]
  }
}

Querying -- Gremlin

docker exec -it puppygraph ./bin/puppygraph console

g.V().has("name", "marko").valueMap()
g.V().has("name", "marko").out("knows").values("name")
g.V().has("name", "marko").out("knows").out("created").values("name")
g.V().hasLabel("person").has("age", gt(30)).valueMap("name", "age")
g.V().has("name", "marko").out("knows").out("created").path().by("name")

Querying -- openCypher

[PuppyGraph]> cypher-console

:> MATCH (p:person {name: 'marko'})-[:knows]->(friend)
   RETURN friend.name, friend.age

:> MATCH (p:person {name: 'marko'})-[:knows]->()-[:created]->(s:software)
   RETURN DISTINCT s.name

:> MATCH (p:person)-[:created]->(s:software)
   RETURN p.name, count(s) AS projects
   ORDER BY projects DESC

:> MATCH (p:person)-[k:knows]->(friend)
   WHERE k.weight > 0.5
   RETURN p.name, friend.name, k.weight

Connecting to MySQL

The MySQL catalog block differs only in type, driverClass, jdbcUri, and driverUrl. The graph section is identical to PostgreSQL.

{
  "name": "mysql_data",
  "type": "mysql",
  "jdbc": {
    "username": "myuser",
    "password": "mypassword",
    "jdbcUri": "jdbc:mysql://my-db-host:3306/my_database",
    "driverClass": "com.mysql.cj.jdbc.Driver",
    "driverUrl": "https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar"
  }
}

Note: The driverUrl field tells PuppyGraph where to download the MySQL JDBC driver at startup. PostgreSQL's driver is bundled by default; MySQL's must be specified.

Swap this catalog block into the PostgreSQL schema.json, updating "catalog": "pg_data" references in each tableSource to "catalog": "mysql_data". All the same Gremlin and Cypher queries work without modification.

Connecting to Snowflake

Snowflake uses key-pair authentication rather than a plain password, which requires a small amount of upfront setup.

Step 1 -- Generate a Key Pair

# Generate encrypted private key
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8

# Derive the public key
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Step 2 -- Register the Public Key in Snowflake

Open a SQL worksheet in Snowsight and run:

ALTER USER <your_snowflake_user>
  SET RSA_PUBLIC_KEY = 'MIIBIjANBgkqh...';
  -- paste the body of rsa_key.pub, excluding the -----BEGIN/END----- lines

Step 3 -- Mount the Key into the PuppyGraph Container

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_USERNAME=puppygraph \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -v /path/to/your/keys:/home \
  --name puppygraph \
  puppygraph/puppygraph:stable

The private key will be available inside the container at /home/rsa_key.p8.

Step 4 -- Configure via Web UI

The Snowflake catalog is best configured through the Schema Builder UI at http://localhost:8081. Click Create graph schema → Add more catalogs and fill in:

Field	Value
Catalog type	`Snowflake`
Catalog name	Any label you choose
Auth type	`Key-pair authentication`
Server URL	`<account_identifier>.snowflakecomputing.com`
User name	Your Snowflake username
Database name	Your Snowflake database
Warehouse name	Your Snowflake warehouse
Private key file path	`/home/rsa_key.p8`
Private key file password	The passphrase set during key generation

Click Save → Submit.

Step 5 -- Or Use schema.json Directly

{
  "catalogs": [
    {
      "name": "snowflake_data",
      "type": "snowflake",
      "snowflake": {
        "username": "my_sf_user",
        "serverUrl": "myorg-myaccount.snowflakecomputing.com",
        "database": "MY_DATABASE",
        "warehouse": "MY_WAREHOUSE",
        "privateKeyPath": "/home/rsa_key.p8",
        "privateKeyPassword": "my_passphrase"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "PERSON" },
          "id": { "fields": [{ "type": "String", "field": "ID", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "NAME", "alias": "name" },
            { "type": "Int",    "field": "AGE",  "alias": "age"  }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "SOFTWARE" },
          "id": { "fields": [{ "type": "String", "field": "ID", "alias": "id" }] },
          "attributes": [
            { "type": "String", "field": "NAME", "alias": "name" },
            { "type": "String", "field": "LANG", "alias": "lang" }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "KNOWS" },
        "id":     { "fields": [{ "type": "String", "field": "ID",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "FROM_ID", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "TO_ID",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "WEIGHT", "alias": "weight" }]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": { "catalog": "snowflake_data", "schema": "MODERN", "table": "CREATED" },
        "id":     { "fields": [{ "type": "String", "field": "ID",      "alias": "id"      }] },
        "fromId": { "fields": [{ "type": "String", "field": "FROM_ID", "alias": "from_id" }] },
        "toId":   { "fields": [{ "type": "String", "field": "TO_ID",   "alias": "to_id"   }] },
        "attributes": [{ "type": "Double", "field": "WEIGHT", "alias": "weight" }]
      }
    ]
  }
}

Snowflake identifiers are uppercase by default. Field names in your schema JSON must match the case Snowflake uses -- usually all caps unless you created columns with quoted lowercase names.

After uploading the schema, all the same Gremlin and Cypher queries from the PostgreSQL example work identically.

Multi-Source Schemas

One of PuppyGraph's most powerful features is cross-database graph federation. A single schema can reference multiple catalogs, letting you traverse relationships that span different databases in one query.

{
  "catalogs": [
    { "name": "pg_data", "type": "postgresql", "jdbc": { "jdbcUri": "jdbc:postgresql://pg-host:5432/crm", "...": "..." } },
    { "name": "sf_data", "type": "snowflake",  "snowflake": { "serverUrl": "myorg-myaccount.snowflakecomputing.com", "...": "..." } }
  ],
  "graph": {
    "vertices": [
      { "label": "customer", "oneToOne": { "tableSource": { "catalog": "pg_data", "schema": "public", "table": "customers" }, "...": "..." } },
      { "label": "order",    "oneToOne": { "tableSource": { "catalog": "sf_data", "schema": "SALES",  "table": "ORDERS"    }, "...": "..." } }
    ],
    "edges": [
      { "label": "placed", "fromVertex": "customer", "toVertex": "order",
        "tableSource": { "catalog": "pg_data", "schema": "public", "table": "customer_orders" }, "...": "..." }
    ]
  }
}

A single Gremlin or Cypher query can now traverse from a customer vertex in PostgreSQL to an order vertex in Snowflake in one hop -- no join logic needed in the application layer.

Graph Schema Patterns

One-to-One (most common): One table row = one vertex or edge. All examples above use "oneToOne": { ... }.

Many-to-One: Multiple rows from different tables coalesce into a single vertex type. Useful for denormalized data where the same conceptual entity is spread across several tables.

Edge Tables: Always backed by a table with at minimum three columns -- an ID, a from_id, and a to_id. These reference the ID columns of the vertex tables. Any additional columns become traversable edge properties.

The Web UI

The web UI at http://localhost:8081 provides:

Schema Builder -- visual drag-and-drop schema editor; connect a catalog, click tables to define vertices and edges, export the resulting JSON
Query Panel -- interactive Gremlin and Cypher console with result visualization
Graph Explorer -- browse vertices and expand relationships visually
Dashboard -- vertex/edge counts and query history

For Snowflake specifically, the Schema Builder's Auto Suggestion feature can inspect your table structure and propose the edge wiring automatically -- a significant time saver on large schemas.

Querying via External Drivers

Python -- Gremlin:

from gremlin_python.driver import client, serializer

c = client.Client(
    "ws://localhost:8182/gremlin",
    "g",
    message_serializer=serializer.GraphSONSerializersV2d0()
)
result = c.submit(
    "g.V().has('name', 'marko').out('knows').values('name')"
).all().result()
print(result)  # ['vadas', 'josh']

Python -- openCypher (neo4j driver):

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("puppygraph", "puppygraph123"))
with driver.session() as session:
    result = session.run(
        "MATCH (p:person {name: $name})-[:knows]->(friend) RETURN friend.name",
        name="marko"
    )
    for record in result:
        print(record["friend.name"])

PuppyGraph vs. Dedicated Graph Databases

Factor	PuppyGraph	Neo4j / TigerGraph
Data location	Your existing store	Migrated to graph DB
ETL required	No	Yes
Setup time	~10 minutes	Hours to days
Query languages	Gremlin + Cypher	Cypher / GSQL
Write operations	Read-only	Full read/write
Multi-source federation	Yes	No (single store)
Best for	Graph analytics on existing data	Transactional graph workloads requiring graph-native writes

The primary trade-off is write semantics. PuppyGraph is a query engine -- you read through it, you write through your existing databases using their normal clients. If your use case is analytics on data that already lives in relational stores, PuppyGraph removes an entire category of operational overhead.

Common Use Cases

Fraud detection -- model accounts, transactions, devices, and IP addresses as a graph. Multi-hop traversal reveals rings and shared infrastructure that flat SQL can't surface efficiently.

Network & IT security -- graph traversal across asset, vulnerability, and connection tables identifies blast radius and lateral movement paths.

Supply chain analysis -- model suppliers, factories, distributors, and customers as connected nodes. Path queries expose single points of failure and find alternate routes.

Knowledge graphs for LLMs -- wire PuppyGraph's Gremlin endpoint into a RAG pipeline to give LLMs structured relationship traversal over your enterprise data.

Organizational hierarchies -- distributor downline trees, org charts, and referral networks traversed directly from your existing CRM or ERP tables.

Resources

Documentation: docs.puppygraph.com
Docker Hub: puppygraph/puppygraph:stable
Gremlin reference: tinkerpop.apache.org/gremlin.html
openCypher reference: opencypher.org
Snowflake JDBC docs: docs.snowflake.com/en/developer-guide/jdbc/jdbc-configure
Community Slack: puppygraph-community.slack.com
GitHub: github.com/puppygraph