How to Configure Cassandra for JanusGraph Storage

This guide walks an on-call engineer through provisioning Apache Cassandra as a JanusGraph storage backend end to end — keyspace, janusgraph-cql.properties, a transactional gremlin-python pipeline, and index verification — so that graph writes commit deterministically instead of triggering the StorageException cascades and silent index drift that mismatched consistency levels and exhausted connection pools produce under sustained ingestion. It is the ground-level procedure under Cassandra Backend Setup; if you need the architectural reasoning behind the write path and sync boundary, read that reference first, because the values below assume you already understand why LOCAL_QUORUM and bounded pools matter.

The write path this guide configures splits into two branches after the storage commit. The top branch is synchronous and durable the moment commit() returns; the lower branch dispatches mixed-index updates asynchronously, and that fork — storage first, index later — is exactly where drift originates.

Prerequisites

Confirm every item before you create the keyspace. Skipping the topology and reachability checks is the most common cause of a “working” config that fails on the first concurrent traversal load.

JanusGraph 0.6.x or 1.0.x with the CQL storage adapter (storage.backend=cql). The legacy Thrift adapter is removed in 1.0 and must not be used.
Apache Cassandra 3.11.x or 4.x with native_transport_port 9042 reachable from every JanusGraph node. Verify with nc -zv <host> 9042 before starting.
A defined physical topology. Know your datacenter names and rack layout from nodetool status — the replication factor and local-datacenter hint below must match reality, not a guess.
gremlinpython on the operator host, matching your server’s TinkerPop line (3.5.x for JG 0.6, 3.6.x for JG 1.0). A mismatched driver silently breaks transaction semantics.
Keyspace-creation permission on the Cassandra cluster, or a DBA to run the CQL in Step 1.
A sized driver pool plan. Align pool limits with your Gremlin Server thread pool per the connection pooling model before bulk ingestion, so thread starvation is not later misdiagnosed as a storage fault.

Step 1 — Provision the keyspace and replication baseline

JanusGraph requires a pre-provisioned keyspace with explicit datacenter routing. Relying on auto-creation bypasses production validation and introduces unpredictable compaction behavior. Create the keyspace with NetworkTopologyStrategy so replica placement follows your replication strategies rather than the naive SimpleStrategy default.

cql

CREATE KEYSPACE janusgraph_graph
WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'dc1': 3,
  'dc2': 1
} AND DURABLE_WRITES = true;

USE janusgraph_graph;

DURABLE_WRITES = true prevents commit-log truncation during an unclean Cassandra shutdown, which is a direct cause of index drift. Set each datacenter’s replication factor to match its physical replica count from nodetool status. For the syntax on older cluster versions, cross-check the official Apache Cassandra CQL DDL reference.

Step 2 — Write janusgraph-cql.properties

The properties file must explicitly declare consistency levels, connection limits, and timeouts. Default DataStax driver settings exhaust connection pools under concurrent Gremlin traversal loads and default to a consistency model that does not match the keyspace you just created.

properties

# Core Storage Mapping
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_graph
storage.cql.local-datacenter=dc1

# Consistency & Durability
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.replication-strategy-class=NetworkTopologyStrategy
storage.cql.replication-strategy-options=dc1,3,dc2,1

# Connection Pool & Timeouts
storage.cql.max-connections-per-host=32
storage.cql.core-connections-per-host=8
storage.cql.connection-timeout=5000
storage.cql.request-timeout=15000

# Index & Schema Sync
storage.cql.atomic-batch-mutate=true
ids.block-size=100000

Operational constraints for the non-default values:

local-datacenter must equal the datacenter you weighted in Step 1. An empty or wrong value forces the driver into a datacenter-unaware policy and adds cross-DC latency to every bulk commit.
write-consistency-level=LOCAL_QUORUM keeps writes within the local datacenter while still requiring a replica majority, avoiding the split-brain visibility that LOCAL_ONE allows. Reserve LOCAL_ONE only for read-heavy traversal paths where stale reads are acceptable — the tradeoff is covered under eventual vs strong consistency.
max-connections-per-host must equal or exceed your Gremlin Server thread pool size (gremlinPool). A smaller pool produces NoHostAvailableException under concurrent traversal load.
atomic-batch-mutate=true folds vertex/edge mutations and composite (storage-backed) index updates into a single Cassandra LOGGED BATCH, preventing orphaned index entries on partial failure. It does not cover mixed indexes to Elasticsearch/OpenSearch, which are always asynchronous — those are governed by mixed-index routing.
ids.block-size controls ID pre-allocation. Raise it to 500000 for ingestion above 10k vertices/sec, but watch heap to avoid OOM kills.

Step 3 — Build the transactional gremlin-python pipeline

Python ingestion pipelines must manage transaction boundaries explicitly to prevent partial commits and index desynchronization. The pattern below opens a transaction-bound traversal source, commits per batch, and rolls back on any failure so a failed batch can be requeued instead of leaving half-written state.

python

from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class JanusGraphPipeline:
    def __init__(self, ws_url="ws://gremlin-server:8182/gremlin"):
        self.connection = DriverRemoteConnection(ws_url, "g")
        self.g = traversal().with_remote(self.connection)
        self.batch_size = 500
        self.committed = 0

    def ingest_batch(self, vertices):
        # begin() spawns the transaction-bound source; mutations run on gtx.
        tx = self.g.tx()
        gtx = tx.begin()
        try:
            for v_data in vertices:
                gtx.addV("entity").property("id", v_data["id"]).next()
            tx.commit()
            self.committed += len(vertices)
            logger.info("Committed %d vertices. Total: %d",
                        len(vertices), self.committed)
        except Exception as exc:
            logger.error("Batch failed, rolling back: %s", exc)
            tx.rollback()  # explicit rollback; caller requeues the batch
            raise

    def close(self):
        self.connection.close()

Step 4 — Wire and verify mixed-index synchronization

JanusGraph commits graph mutations to the CQL storage backend first, then enqueues mixed-index updates for asynchronous dispatch to Elasticsearch/OpenSearch. That storage-first ordering is exactly where drift originates: storage can be correct while the index lags. After the first ingestion run, force the indexes to a REGISTERED/ENABLED state and reindex any that are stuck.

groovy

// Gremlin Console, against the same properties file
mgmt = graph.openManagement()
mgmt.printIndexes()      // inspect status of every index
// If an index is INSTALLED or REGISTERED rather than ENABLED:
mgmt.updateIndex(mgmt.getGraphIndex("byName"),
                 SchemaAction.REINDEX).get()
mgmt.commit()

Verification commands

Confirm each step landed before moving on. Do not treat a successful commit() as proof the graph is queryable — verify storage, then the index, independently.

bash

# Step 1: keyspace uses NetworkTopologyStrategy with the intended RF
cqlsh -e "DESCRIBE KEYSPACE janusgraph_graph;" | grep -i replication

# Every node must report UN (Up/Normal) in the target datacenters
nodetool status janusgraph_graph

python

# Step 3: assert the write is visible and the pool survived the batch
count = pipeline.g.V().hasLabel("entity").count().next()
assert count == pipeline.committed, (
    f"drift: storage has {count}, pipeline committed {pipeline.committed}"
)

In the Gremlin Console, mgmt.printIndexes() must show every mixed index as ENABLED; any index left at INSTALLED or REGISTERED will silently fall back to full storage scans, so a label count can look correct while the index is empty. Watch nodetool compactionstats until pending tasks reach zero before declaring the run healthy.

Explicit fallback procedures

Each fallback maps to the step most likely to produce it. Run the diagnosis command first; do not raise consistency or pool limits blindly.

Step 1 — DESCRIBE KEYSPACE shows SimpleStrategy or the wrong RF. Drop and recreate the keyspace before any data exists. Do not run ALTER KEYSPACE on an active JanusGraph instance — it triggers full re-tokenization and temporary UnavailableException spikes. If data already exists, add the missing datacenter with ALTER KEYSPACE during a maintenance window, then nodetool rebuild -- <source_dc> on the new replicas.
Step 2 — persistent NoHostAvailableException or connection timeouts. Reduce max-connections-per-host to 16, raise storage.cql.request-timeout to 30000, and confirm reachability with nc -zv <host> 9042. Cross-reference nodetool tpstats for MutationStage queue depth before increasing pool size again.
Step 3 — StorageException during commit. Verify no network partition with nodetool status (all nodes UN), check commit-log disk with df -h /var/lib/cassandra/commitlog and clear it above 85% utilization, then temporarily drop write-consistency-level to LOCAL_ONE only until the Cassandra cluster stabilizes.
Step 4 — REINDEX stalls on tombstone accumulation. Run nodetool repair -pr janusgraph_graph, then switch the affected table to leveled compaction — ALTER TABLE janusgraph_graph.edgestore WITH compaction = {'class': 'LeveledCompactionStrategy'}; — to force SSTable reorganization, and re-run the reindex.
Index parity lost beyond repair. Export with graph.io(IoCore.graphson()).writer().create().writeGraph("backup.graphson"), truncate the JanusGraph tables (TRUNCATE janusgraph_graph.edgestore; and TRUNCATE janusgraph_graph.graphindex;), then re-import and rebuild indexes with sequential REINDEX calls. The same detect-rebuild-validate loop for the index side is detailed in Resolving OpenSearch Index Drift in Production.

FAQ

Why LOCAL_QUORUM instead of QUORUM for writes? QUORUM requires a replica majority across all datacenters, so every write pays cross-DC round-trip latency and stalls when a remote DC is degraded. LOCAL_QUORUM keeps the majority requirement inside the coordinator’s datacenter, which is the correct default for a graph that ingests in one region and replicates asynchronously to others.

Does atomic-batch-mutate=true guarantee my Elasticsearch/OpenSearch documents are consistent? No. It only makes storage mutations and composite (storage-backed) index updates atomic within Cassandra. Mixed indexes are dispatched asynchronously after the storage commit, so a separate parity check is always required — see mixed-index routing and the drift-resolution runbook.

Can I let JanusGraph auto-create the keyspace instead of running Step 1? You can, but you lose control over replication strategy and durable-writes, and auto-creation defaults to SimpleStrategy, which breaks multi-datacenter placement. Provision the keyspace explicitly for any production cluster.

My g.V().count() looks correct but index queries return stale results. That is the classic degraded-mixed-index signature: JanusGraph falls back to a full storage scan for the count while the index itself is empty or lagging. Verify with mgmt.printIndexes() and reindex, rather than trusting the count.

Up a level: Cassandra Backend Setup — the parent reference for consistency, index-sync, and connection lifecycle on the CQL backend.
JanusGraph Connection Pool Tuning Guide — size max-connections-per-host against the Gremlin Server thread pool so Step 2 does not starve.
Configuring Multi-Datacenter Replication for Graph Data — the replica-placement companion to the keyspace you build in Step 1.
Optimizing ScyllaDB Read/Write Consistency for Graphs — the same consistency tuning when migrating this backend to ScyllaDB.
Resolving OpenSearch Index Drift in Production — the repair loop for the async index side that Step 4 verifies.