Eventual vs Strong Consistency Tradeoffs in JanusGraph
Decoupling the graph traversal engine from its persistence and indexing layers is JanusGraph’s primary architectural advantage, but it establishes a hard operational boundary around transactional guarantees. When evaluating Eventual vs Strong Consistency, you are explicitly trading write throughput for read accuracy across the Apache JanusGraph Storage Backend & Index Synchronization boundary. The storage layer (Cassandra, ScyllaDB, HBase) commits vertex/edge mutations using tunable quorum semantics, while the external index (Elasticsearch, OpenSearch, Solr) processes mixed-index queries asynchronously. No native distributed two-phase commit spans both systems. Platform teams must enforce consistency boundaries at the configuration layer, implement application-level verification, and define explicit fallback procedures for SLA breaches.
Configuration Matrix: Storage vs Index Alignment
JanusGraph exposes consistency controls through janusgraph.properties. The following configurations isolate the exact properties required to shift between tightened consistency (near-strong) and high-throughput eventual consistency postures.
Tightened Consistency Posture
This configuration minimizes index lag by forcing rapid segment refreshes and enforcing strict storage quorums. It increases backend I/O pressure and reduces ingestion throughput. Use only for low-volume, read-critical workloads requiring immediate query accuracy.
# Storage Backend (Cassandra/ScyllaDB)
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.write-consistency-level=QUORUM
storage.cql.read-consistency-level=QUORUM
# External Index (Elasticsearch/OpenSearch)
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.http-connection-timeout=10000
index.search.elasticsearch.ext.index.refresh_interval=1s
index.search.elasticsearch.ext.index.number_of_replicas=1
index.search.elasticsearch.ext.index.number_of_shards=5
Eventual Consistency Posture
Default production posture for high-throughput ingestion pipelines. Storage commits locally for minimal latency, and index updates are batched asynchronously. Read-after-write queries will return stale results until the next index refresh cycle completes. Refer to External Index Synchronization & Consistency Tuning for backend-specific queue tuning.
# Storage Backend
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.write-consistency-level=LOCAL_ONE
storage.cql.read-consistency-level=LOCAL_ONE
# External Index
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.http-connection-timeout=10000
index.search.elasticsearch.ext.index.refresh_interval=30s
index.search.elasticsearch.ext.index.number_of_replicas=2
index.search.elasticsearch.ext.index.number_of_shards=10
Diagnostic Procedures & Lag Measurement
Consistency boundaries must be validated against actual ingestion rates. Use the following reproducible steps to measure index synchronization lag and verify storage-to-index alignment.
Step 1: Baseline Index Lag Measurement
Execute this Gremlin traversal immediately after a known write operation. It compares the timestamp of the committed mutation against the timestamp returned by the mixed-index query.
// 1. Write a test vertex with a timestamp property
g.addV('test_node').property('id', 'diag-001').property('ts', System.currentTimeMillis()).next()
// 2. Immediately query via mixed index
g.V().has('test_node', 'id', 'diag-001').values('ts').next()
Calculate delta: System.currentTimeMillis() - returned_ts. A delta exceeding refresh_interval * 2 indicates index queue backlog or segment merge contention.
Step 2: Python Pipeline Verification
Embed this validation routine into ingestion pipelines to detect stale reads before downstream consumers process stale graph state.
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.graph_traversal import __
import time
def verify_consistency(graph, vertex_label, prop_key, prop_value, timeout_sec=10):
start = time.time()
while time.time() - start < timeout_sec:
try:
result = graph.V().has(vertex_label, prop_key, prop_value).hasNext()
if result:
return True
except Exception:
pass
time.sleep(0.5)
raise TimeoutError(f"Index synchronization exceeded {timeout_sec}s SLA for {prop_key}={prop_value}")
Step 3: Backend Metrics Correlation
Monitor Elasticsearch segment refresh latency directly. High refresh_total_time_in_millis relative to refresh_total indicates disk I/O saturation. Query the _stats API:
curl -s "http://10.0.2.10:9200/janusgraph_mixed_index/_stats/refresh?pretty" | jq '.indices[].total.refresh'
Cross-reference with Cassandra/ScyllaDB WriteLatency and ReadLatency metrics. If storage latency is stable but index lag grows, the bottleneck is the external index segment merge process, not JanusGraph’s transaction manager.
Fallback & Incident Response Protocols
When consistency boundaries degrade or SLAs breach, execute the following fallback procedures in order. Do not attempt to bypass the index without explicit validation of storage state.
Fallback 1: Forced Index Refresh
If read-after-write queries consistently time out or return stale data, trigger an immediate segment refresh on the mixed index. This forces pending mutations into searchable segments.
curl -X POST "http://10.0.2.10:9200/janusgraph_mixed_index/_refresh"
Warning: Frequent forced refreshes degrade indexing throughput. Use only during incident response or scheduled maintenance windows.
Fallback 2: Storage-Only Read Bypass
When the external index is unavailable or severely desynchronized, route critical read operations directly to the storage backend using graph traversals that avoid has() predicates on indexed properties.
// Bypass mixed index: scan by label and filter in-memory (use with LIMIT)
g.V().hasLabel('critical_entity').limit(1000).filter { it.get().property('status').value() == 'active' }
Constraint: This bypasses index-backed range and text queries. Apply strict limit() clauses to prevent full-table scans in production.
Fallback 3: Index Rebuild Procedure
If index corruption or persistent desynchronization occurs, rebuild the mixed index from storage. This operation is blocking and requires a maintenance window.
- Disable automatic index updates:
index.search.elasticsearch.ext.index.auto_expand_replicas=false
- Execute JanusGraph management API reindex:
JanusGraphManagement mgmt = graph.openManagement();
PropertyKey key = mgmt.getPropertyKey("indexed_property");
mgmt.updateIndex(mgmt.getGraphIndex("mixed_index"), SchemaAction.REINDEX).get();
mgmt.commit();
- Monitor reindex progress via
mgmt.getGraphIndex("mixed_index").getIndexStatus(key). - Re-enable automatic updates and verify consistency using Step 1 diagnostics.
Operational Guardrails
- Never set
refresh_intervalbelow1sin production. Sub-second refreshes cause excessive segment creation and trigger Elasticsearch circuit breakers. - Align
storage.cql.write-consistency-levelwith your cluster topology.QUORUMon a 3-node cluster tolerates 1 node failure;ALLprovides zero fault tolerance. - Implement idempotent write patterns in Python pipelines. JanusGraph does not guarantee exactly-once semantics across the storage/index boundary during network partitions.
- Log index lag metrics alongside application traces. Correlate
refresh_total_time_in_milliswith pipeline ingestion rates to predict SLA breaches before they impact consumers.