Why do traversals miss vertices that JanusGraph just committed to Cassandra?

The storage commit is synchronous, but the mixed-index update is asynchronous. A committed vertex is durable in Cassandra before it becomes searchable through a has() predicate. The gap is governed mainly by the Elasticsearch refresh_interval and the IndexProvider queue depth. Poll the index with a bounded deadline, route the read through a storage-backed id lookup, or raise the refresh cadence to shrink the window.

What write and read consistency levels should I use for JanusGraph on Cassandra?

Use LOCAL_QUORUM for writes to get strong, read-repairable consistency without cross-datacenter latency, and LOCAL_ONE for latency-sensitive read traversals where a briefly stale vertex is acceptable. Avoid global QUORUM unless cross-DC writes are strictly required, because it widens write-latency variance across every mutation.

Does raising the Cassandra consistency level fix index staleness?

No. Cassandra consistency governs durability and read repair inside the storage cluster only. Search-index visibility is a separate downstream concern. Raising the storage level increases write latency without shrinking the index replication window.

How do I stop NoHostAvailableException during ingestion bursts?

It is usually pool exhaustion, not node death. Confirm nodes are UN via nodetool status, then raise storage.cql.max-connections-per-host toward node capacity, keep a warm core-connections baseline, and shorten connection-timeout so backpressure surfaces to the producer instead of queuing indefinitely.

Why is my Cassandra coordinator CPU spiking during batch writes?

Oversized multi-partition batches force the coordinator to fan out to many replicas and accelerate tombstone accumulation. Lower storage.cql.batch-statement-size and keep unlogged batches within a single partition so the coordinator's MutationStage pending count stays bounded.

Cassandra Backend Setup

A production-grade Cassandra backend for Apache JanusGraph lives or dies on the alignment between storage consistency guarantees, index synchronization latency, and connection lifecycle management. Get any one of the three wrong and the symptom is the same on-call page: writes succeed, but traversals return stale or missing data. This guide sits under the JanusGraph Storage Backend Architecture & Configuration reference and narrows it to a single storage engine — Apache Cassandra behind the CQL driver. The primary failure modes here are mismatched consistency levels, unbounded connection churn, and unverified mixed-index propagation. Everything below prioritizes deterministic write paths, bounded connection pools, and observable sync workflows over default configuration that will not survive sustained ingestion.

The diagram below shows the LOCAL_QUORUM write path: the coordinator acknowledges once a majority of local replicas confirm the mutation.

Core Configuration & Consistency Tuning

JanusGraph delegates storage semantics to the underlying DataStax CQL driver. Default configurations rarely survive production ingestion loads. You must explicitly define replication topology, consistency boundaries, and compaction strategies before cluster initialization. The baseline janusgraph-cql.properties should enforce LOCAL_QUORUM for writes to prevent split-brain graph mutations, while reserving LOCAL_ONE for read-heavy traversal workloads where eventual visibility is acceptable.

properties

# Core Storage Binding
storage.backend=cql
storage.hostname=cassandra-node-01,cassandra-node-02,cassandra-node-03
storage.port=9042
storage.cql.keyspace=janusgraph_prod
storage.cql.local-datacenter=us-east-1

# Replication & Consistency
storage.cql.replication-strategy-class=NetworkTopologyStrategy
storage.cql.replication-strategy-options=us-east-1,3
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.read-consistency-level=LOCAL_ONE
storage.cql.only-use-local-consistency-for-system-operations=true

# Compaction & Performance
storage.cql.compression=LZ4Compressor
storage.cql.batch-statement-size=50
storage.cql.max-requests-per-connection=1024

For granular control over driver behavior and keyspace provisioning — including the CQL DDL that must match your replica counts before JanusGraph initializes its schema — follow the detailed walkthrough in How to Configure Cassandra for JanusGraph Storage.

Treat these as hard operational constraints, not suggestions:

Never let JanusGraph auto-create the keyspace in production. Auto-creation defaults to SimpleStrategy, which ignores rack and datacenter topology and silently guarantees cross-DC write amplification. Provision the keyspace with NetworkTopologyStrategy first, then point JanusGraph at it.
batch-statement-size must align with partition-key distribution. Oversized batches spanning many partitions trigger coordinator overload and accelerate tombstone accumulation. Keep unlogged batches to a single partition where possible; 50 is a ceiling, not a target.
Pin local-datacenter explicitly. The DataStax 4.x driver requires an explicit local DC for its default load-balancing policy. Omit it and the driver refuses to route, throwing at startup rather than degrading gracefully.
Set only-use-local-consistency-for-system-operations=true. This keeps JanusGraph’s internal ID-block allocation and schema locks on LOCAL_QUORUM instead of escalating to global QUORUM, which would stall every schema mutation on cross-DC round trips.

The consistency level is the single most consequential line in the file. The three you will actually choose between:

LOCAL_QUORUM — requires acknowledgment from a majority of replicas within the local datacenter. Guarantees strong, read-repairable consistency for graph mutations without paying cross-DC latency. This is the correct write default.
LOCAL_ONE — routes to the nearest replica. Acceptable for traversals where a briefly stale vertex does not corrupt business logic. This is the correct read default for latency-sensitive traversals.
QUORUM (global) — avoid unless cross-DC writes are strictly required. Global coordination introduces unpredictable commit delays and widens write-latency variance across every mutation, not just the ones that need it.

Whichever levels you choose, remember that Cassandra consistency governs durability and read repair inside the storage cluster only. It says nothing about search-index visibility — that boundary is a separate, downstream concern covered under eventual vs strong consistency. Raising the storage level does not shrink index lag; it only adds write latency.

For quorum math, a write at LOCAL_QUORUM with replication factor $RF$ in the local datacenter blocks until $\lfloor RF/2 \rfloor + 1$ replicas acknowledge. With $RF=3$ that is 2 replicas, tolerating one node down; the write latency is bounded by the second-fastest replica, not the slowest. Multi-datacenter topologies change this arithmetic and are worked through in Configuring Multi-Datacenter Replication for Graph Data; align your replication strategies before you tune consistency, because the two are a single decision.

Index Synchronization & Lag Metrics

JanusGraph separates storage mutations from mixed-index updates. When you commit a vertex or edge, the CQL transaction completes synchronously — the storage write is the point of no return. Only after that commit does JanusGraph enqueue the corresponding Elasticsearch/OpenSearch index mutation for asynchronous application by a background worker. This decoupling opens a measurable sync window: the vertex is durable in Cassandra but not yet searchable through a has() predicate. Production pipelines must account for this lag with explicit sync verification and idempotent retry logic rather than assuming a committed write is immediately queryable.

The length of that window is governed by three knobs, none of which JanusGraph guarantees for you:

Elasticsearch refresh_interval — the dominant term. Set index.search.elasticsearch.create.ext.refresh_interval to 5s or 10s for batch pipelines to cut segment-flush I/O; leave it low only where near-real-time search is a hard requirement. This is the same lever tuned in OpenSearch Sync Patterns for OpenSearch clusters.
IndexProvider queue depth — the count of pending mutations awaiting dispatch. A rising queue under steady load means the index write pool cannot keep pace with storage commits; the window stretches without bound until the producer throttles.
Client-only wiring — set index.search.elasticsearch.client-only=true so JanusGraph never joins the search cluster as a data node, keeping index-node lifecycle out of your graph’s failure domain.

Instrument the gap, do not guess at it. Poll the Elasticsearch /_nodes/stats/indices/indexing endpoint for indexing latency and watch the /_cat/thread_pool/write?v write pool for non-zero rejections — a rejection is the index telling you the window just grew. Correlate that with Cassandra-side mutation stages from nodetool tpstats; drift shows up as divergence between the two series before any user files a ticket. When a query must observe its own write, the cheapest discipline is a bounded poll against the mixed index (below) rather than forcing every commit to block on a refresh.

Python Integration Pattern

The following pattern demonstrates a batch ingestion workflow using gremlin-python with bounded execution, exponential backoff, explicit transaction boundaries, and post-commit sync polling. The verify_index_sync method isolates the read-your-writes cost to the callers that genuinely need it, leaving the bulk of ingestion on the fast asynchronous path.

python

import logging
import time
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.driver.protocol import GremlinServerError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class JanusGraphIngestionPipeline:
    def __init__(self, ws_url: str):
        self.ws_url = ws_url
        self.connection = None
        self.g = None

    def connect(self):
        try:
            self.connection = DriverRemoteConnection(self.ws_url, 'g')
            self.g = traversal().with_remote(self.connection)
            logger.info("Connected to JanusGraph WebSocket endpoint.")
        except Exception as e:
            logger.error(f"Connection failed: {e}")
            raise

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        retry=retry_if_exception_type((ConnectionError, TimeoutError, GremlinServerError)),
    )
    def batch_ingest(self, vertices: list[dict]):
        tx = self.g.tx()
        gtx = tx.begin()  # begin() returns the transaction-bound traversal source
        try:
            for v in vertices:
                # mergeV keeps the write idempotent so a retry cannot double-insert
                gtx.merge_v({"id": v["id"]}) \
                   .option("onCreate", {"label": v["label"], "name": v["name"]}) \
                   .iterate()
            tx.commit()
            logger.info("Batch of %d committed at LOCAL_QUORUM.", len(vertices))
        except Exception as e:
            logger.error(f"Batch ingestion failed: {e}")
            try:
                tx.rollback()
            except Exception:
                logger.exception("Rollback failed after batch error.")
            raise

    def verify_index_sync(self, vertex_id: str, timeout: float = 5.0,
                          poll_interval: float = 0.5) -> bool:
        """Read-your-writes fallback: poll the mixed index until the just-written
        vertex is searchable, bounded by a deadline. Cheaper than blocking every
        commit on a synchronous index refresh when only some reads need freshness."""
        start = time.time()
        while time.time() - start < timeout:
            try:
                # Force the query through the mixed index (has predicate),
                # not the storage-backed id lookup.
                if self.g.V().has("id", vertex_id).has_next():
                    return True
            except Exception as e:
                logger.warning(f"Index check failed: {e}")
            time.sleep(poll_interval)
        logger.warning(f"Index sync timeout for vertex {vertex_id}")
        return False

    def close(self):
        if self.connection:
            self.connection.close()

Two disciplines make this pattern safe under load. First, every mutation is idempotent — mergeV on a stable key means a retry after a partial failure reconciles instead of double-writing, which matters because a retried non-idempotent insert propagates into the index as a phantom document. Second, verify_index_sync polls the mixed index rather than the storage-backed ID lookup, so it actually confirms search visibility, not just durability. Keep property keys and their index bindings stable across deploys; changing a key’s type or index mapping mid-flight is a schema evolution concern that must be gated in CI, not discovered in production.

Connection Lifecycle & Pool Management

Unmanaged WebSocket and CQL connections degrade throughput and exhaust cluster resources. JanusGraph relies on the underlying driver connection pools to multiplex requests across each Cassandra node. You must configure connection ceilings, idle timeouts, and retry policies to prevent cascading failures during network partitions — improper pool sizing correlates directly with NoHostAvailableException and traversal timeouts.

Key pool parameters for the DataStax CQL driver:

storage.cql.max-connections-per-host — cap physical sockets per Cassandra node to align with node capacity and native_transport_max_threads. Oversizing here converts a downstream slowdown into a thundering-herd of concurrent requests that saturates the coordinator.
storage.cql.core-connections-per-host — maintain a baseline of warm connections to avoid cold-start latency when traffic spikes; a pool that starts at zero pays a TLS-handshake tax on the first request of every burst.
storage.cql.connection-timeout — fail fast rather than queue indefinitely during saturation. A short timeout surfaces backpressure to the producer; a long one hides it until the pool is fully exhausted.

Two sizing rules keep the pool honest. Size to write concurrency, then add headroom. Set max-connections-per-host to peak concurrent writers per node, plus roughly 20% for retry and reconciliation traffic. Bound the client idle timeout below the server’s. Keep the driver’s idle reaping under Cassandra’s native_transport_idle_timeout so the client closes dead sockets first; a client reusing a server-closed socket surfaces as a spurious GremlinServerError that the retry policy will burn attempts on. The full sizing model — pool minimums, maximums, and starvation symptoms — lives in Connection Pooling.

Teams evaluating a CQL-compatible alternative to Apache Cassandra must verify JanusGraph’s compatibility matrix before migrating: schema evolution, compaction behavior, and consistency guarantees differ between Cassandra and drop-in replacements. If your roadmap includes moving off Cassandra, the ScyllaDB migration guide details the driver overrides and index-backend checks required to preserve write semantics.

Diagnostics & Operational Fallbacks

Instrument the coordinator, the pool, and the index together; each of the top failure modes on this backend looks like the others until you read the right metric. The table maps alert to diagnosis to fix.

Symptom	Diagnose	Resolve
`NoHostAvailableException` under load	`nodetool status` shows nodes `UN`, but driver pool utilization is at 100%	Pool exhaustion, not node death — raise `max-connections-per-host` toward node capacity and shorten `connection-timeout` per the Connection Pooling model
Traversals miss vertices committed seconds ago	Vertex present via `g.V(id)` but absent via `has()`; `/_nodes/stats/indices/indexing` latency climbing	Index replication window is stretching — raise `refresh_interval`, throttle the producer, or gate the read behind `verify_index_sync`
Write latency spikes and coordinator CPU pegs	`nodetool tpstats` shows a growing `MutationStage` pending count; large multi-partition batches in flight	Lower `batch-statement-size` and keep unlogged batches single-partition to stop coordinator fan-out and tombstone buildup
Schema mutations hang on `apply`	Global `QUORUM` on system operations forcing cross-DC round trips	Set `only-use-local-consistency-for-system-operations=true` so ID allocation and schema locks stay on `LOCAL_QUORUM`
Duplicate / phantom vertices after retries	Retried non-idempotent `addV` committed twice; index count exceeds storage count	Switch ingestion to `mergeV` on a stable key; reconcile existing duplicates by reindexing from the authoritative storage view

When drift persists beyond what a producer throttle or refresh-interval change can close, run a REINDEX through the JanusGraph Management API during a maintenance window rather than dropping and rebuilding the index live. Verify configurations against the official Apache Cassandra Documentation and the JanusGraph Reference Documentation, and keep continuous monitoring on coordinator CPU, GC pause duration, and index-lag metrics — those three series predict every failure in the table above.

Up a level: JanusGraph Storage Backend Architecture & Configuration — the storage tier this backend plugs into.
How to Configure Cassandra for JanusGraph Storage — step-by-step keyspace and properties provisioning.
Connection Pooling — the full CQL pool sizing model and starvation symptoms.
Replication Strategies — datacenter topology and replica-count decisions that precede consistency tuning.
ScyllaDB Migration — driver overrides for moving off Cassandra to a CQL-compatible backend.
Eventual vs Strong Consistency — where to place the index acknowledgment boundary the sync window depends on.