How to Configure Cassandra for JanusGraph Storage

Configuring Cassandra as the Apache JanusGraph Storage Backend & Index Synchronization layer requires strict alignment of replication factors, connection pool sizing, and consistency models. Misconfigured pools or mismatched consistency levels trigger StorageException cascades and index drift during high-throughput ingestion. This guide provides production-ready CQL provisioning, exact janusgraph-cql.properties mappings, and a Python pipeline template with explicit fallback procedures for index parity.

1. Cassandra Keyspace & Replication Baseline

JanusGraph requires a pre-provisioned keyspace with explicit datacenter routing. Relying on auto-creation bypasses production validation checks and introduces unpredictable compaction behavior. Execute the following CQL to establish the storage foundation:

cql
CREATE KEYSPACE janusgraph_graph 
WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'dc1': 3,
  'dc2': 1
} AND DURABLE_WRITES = true;

USE janusgraph_graph;

-- Verify partitioner and compaction strategy defaults
DESCRIBE KEYSPACE janusgraph_graph;

DURABLE_WRITES = true prevents commit log truncation during unclean Cassandra shutdowns, which directly causes index drift. Align the replication factor with your physical topology. For multi-region deployments, consult the Cassandra Backend Setup reference to map local datacenter hints and eliminate cross-DC latency during bulk vertex commits. Reference the official Apache Cassandra CQL Reference for syntax validation on older cluster versions.

Fallback: If DESCRIBE KEYSPACE returns SimpleStrategy or mismatched RF values, drop and recreate the keyspace. Do not use ALTER KEYSPACE on active JanusGraph instances, as it triggers full cluster re-tokenization and temporary UnavailableException spikes.

2. JanusGraph Storage Properties

The janusgraph-cql.properties file must explicitly declare consistency levels, connection limits, and timeout thresholds. Default DataStax driver settings exhaust connection pools under concurrent Gremlin traversal loads.

properties
# Core Storage Mapping
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_graph
storage.cql.local-datacenter=dc1

# Consistency & Durability
storage.cql.read-consistency-level=QUORUM
storage.cql.write-consistency-level=QUORUM
storage.cql.replication-strategy=NetworkTopologyStrategy
storage.cql.replication-factor={"dc1": 3, "dc2": 1}

# Connection Pool & Timeouts
storage.cql.max-connections-per-host=32
storage.cql.core-connections-per-host=8
storage.cql.connection-timeout=5000
storage.cql.read-timeout=12000
storage.cql.write-timeout=12000
storage.cql.request-timeout=15000

# Index & Schema Sync
storage.cql.atomic-batch-mutate=true
ids.block-size=100000

Operational constraints:

  • max-connections-per-host must equal or exceed your Gremlin Server thread pool size (gremlinserver.gremlinPool). Mismatched values cause ConnectionPoolExhaustedException.
  • atomic-batch-mutate=true forces vertex/edge mutations and secondary index updates into a single Cassandra UNLOGGED BATCH. This prevents orphaned index entries during partial failures.
  • ids.block-size controls ID pre-allocation. Increase to 500000 for ingestion rates >10k vertices/sec, but monitor JanusGraph heap usage to prevent OOM kills.

Review the JanusGraph Storage Backend Architecture & Configuration documentation for driver-level connection pooling overrides when deploying behind service meshes.

Fallback: If connection timeouts persist, reduce max-connections-per-host to 16 and increase storage.cql.request-timeout to 30000. Verify Cassandra rpc_address and native_transport_port are reachable via nc -zv <host> 9042.

3. Python Pipeline Integration & Index Synchronization

Python ingestion pipelines must manage transaction boundaries explicitly to prevent partial commits and index desynchronization. The following gremlinpython implementation handles batch ingestion and enforces transactional isolation.

python
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class JanusGraphPipeline:
    def __init__(self, ws_url="ws://gremlin-server:8182/gremlin"):
        self.connection = DriverRemoteConnection(ws_url, "g")
        self.g = traversal().withRemote(self.connection)
        self.batch_size = 5000
        self.committed = 0

    def ingest_batch(self, vertices):
        # Explicit transaction boundary ensures atomicity. begin() spawns the
        # transaction-bound traversal source; mutations must run on it.
        tx = self.g.tx()
        gtx = tx.begin()
        try:
            for v_data in vertices:
                gtx.addV("entity").property("id", v_data["id"]).next()
            tx.commit()
            self.committed += len(vertices)
            logger.info(f"Committed {len(vertices)} vertices. Total: {self.committed}")
        except Exception as e:
            logger.error(f"Batch failed: {e}")
            tx.rollback()  # explicit rollback; log for retry queue
            raise

    def close(self):
        self.connection.close()

Index Synchronization Protocol: JanusGraph relies on Cassandra’s write-ahead log and background compaction to materialize mixed indexes. If query latency degrades or index queries return stale results:

  1. Verify index status via Gremlin Console: mgmt = graph.openManagement(); mgmt.printIndexes().
  2. If status is INSTALLED or REGISTERED, run mgmt.updateIndex(mgmt.getGraphIndex("myIndex"), SchemaAction.REINDEX).get().
  3. Monitor Cassandra nodetool compactionstats until pending tasks drop to zero.

Fallback: If REINDEX stalls due to tombstone accumulation, execute nodetool repair -pr janusgraph_graph followed by a manual ALTER TABLE janusgraph_graph.edgestore WITH compaction = {'class': 'LeveledCompactionStrategy'}; to force SSTable reorganization. Consult the Gremlin Python Driver Reference for connection lifecycle management in long-running workers.

4. Diagnostics & Operational Fallbacks

Production deployments require immediate triage paths for storage exceptions and connection exhaustion.

Symptom: org.janusgraph.core.JanusGraphException: Could not execute query due to StorageException Diagnosis: Run cqlsh -e "SELECT * FROM system.local;" on each Cassandra node. Check system_traces for WriteTimeoutException or UnavailableException. Resolution:

  1. Verify network partition: nodetool status must show UN for all nodes.
  2. Check Cassandra commitlog disk space: df -h /var/lib/cassandra/commitlog. Clear if >85% utilized.
  3. Temporarily lower consistency: storage.cql.write-consistency-level=LOCAL_QUORUM until cluster stabilizes.

Symptom: ConnectionPoolExhaustedException or NoHostAvailableException Diagnosis: Inspect JanusGraph logs for PoolState metrics. Cross-reference with Cassandra nodetool tpstats for MutationStage queue depth. Resolution:

  1. Increase storage.cql.max-connections-per-host incrementally by +8.
  2. Enable driver-level retry policies: storage.cql.retry-attempts=3.
  3. If queues remain backed up, throttle ingestion rate at the Python pipeline level using exponential backoff.

Explicit Fallback for Index Parity Loss: When Cassandra and JanusGraph indexes diverge beyond repair:

  1. Disable mixed index queries: graph.getConfiguration().setProperty("index.search.backend", "none").
  2. Export graph to GraphSON: graph.io(IoCore.graphson()).writer().create().writeGraph("backup.graphson").
  3. Truncate JanusGraph tables: TRUNCATE janusgraph_graph.edgestore; TRUNCATE janusgraph_graph.graphindex;.
  4. Re-import and rebuild indexes via OLAP Spark jobs or sequential REINDEX commands.