Why does a plain g.V().count() hide OpenSearch index drift?

JanusGraph silently falls back to a full storage scan when a mixed index is degraded, so a label count returns the correct number even when the index is nearly empty. Detection must query OpenSearch directly and force a traversal that must use the index (for example textContainsRegex) so it fails fast when the index is degraded.

How do I stop a replayed reconciliation from duplicating documents?

Derive the OpenSearch _id deterministically from the JanusGraph vertex id in the bulk action line. Every write becomes a create-or-replace, so rerunning the extract or replaying a failed bulk chunk overwrites the existing document instead of creating a second one.

The OpenSearch count is higher than storage — what does that mean?

Stale deletions or orphaned documents from failed transaction rollbacks remain in the index. Because storage is authoritative, a full delete-recreate-and-reingest from storage removes the orphans; incremental deletes are error-prone by comparison.

When should I restore from a snapshot instead of running the rebuild loop?

When the mixed index is unqueryable (the forced traversal errors rather than returning a count) or cluster health is red, treat it as corruption and restore the latest verified snapshot rather than attempting an incremental rebuild. Verify snapshot state is SUCCESS before restoring.

Resolving OpenSearch Index Drift in Production

This guide walks an on-call engineer through a deterministic reconciliation loop — detect, quarantine, extract, rebuild, re-ingest, validate — that resolves OpenSearch index drift in a live JanusGraph cluster and closes the specific failure where a graph mutation commits to storage but its matching document never becomes searchable. It is the repair runbook under OpenSearch Sync Patterns; if you have not yet wired the backend or tuned refresh semantics, do that there first, because the parameters below assume a working async dispatch path. In Apache JanusGraph, mixed indexes are asynchronous by design: the storage backend (Cassandra, ScyllaDB, or HBase) commits transactional graph mutations first, then index mutations are dispatched to OpenSearch on a separate thread pool. Network partitions, shard-allocation failures, and misaligned refresh intervals routinely open a document-level gap, and once drift crosses an acceptable threshold, query accuracy degrades and read-after-write assumptions fail. Treat drift as a capacity-and-configuration problem, not a transient network event — heuristic retries will not converge, a deterministic loop will.

The reconciliation loop: quarantine halts drift, then extract → rebuild → re-ingest replays until the storage-to-index delta reaches zero.

Prerequisites

Confirm every item before you touch a production index. Skipping the health and permission checks is the most common cause of a “repair” that widens the delta instead of closing it.

JanusGraph 0.6.x or 1.0.x running against a CQL storage backend. If storage itself is unstable, stabilize it via Cassandra backend setup before attempting index repair — a drifting storage layer makes any authoritative extract meaningless.
OpenSearch 1.x or 2.x reachable from every JanusGraph node, addressed through JanusGraph’s Elasticsearch-compatible backend (the index.search.backend value stays elasticsearch). Cluster health must be at least yellow; a red-status cluster is a fallback scenario, not a reconciliation one.
jq, curl, and gremlinpython on the operator host, with gremlinpython matching your server’s TinkerPop line (3.5.x for JG 0.6, 3.6.x for JG 1.0).
Write access to janusgraph.properties and the ability to route writes to read-only or a standby during a maintenance window.
A known-good driver pool. Size it per the connection pooling model so thread starvation during the extract phase is not misdiagnosed as fresh drift.
A recent OpenSearch snapshot you can restore from. Verify it exists before you delete anything.

Step 1 — Detect and quantify divergence

Drift detection must bypass JanusGraph’s query layer and compare raw storage state against OpenSearch document state directly. Relying on g.V().hasLabel(...).count() masks index-level failures, because JanusGraph silently falls back to full storage scans when a mixed index is degraded — the count looks correct while the index is empty.

Run a direct count comparison: query OpenSearch for its document count, then force a Gremlin traversal that must use the mixed index so it fails fast if the index is degraded.

bash

# 1. Query OpenSearch directly for the indexed document count
curl -s --fail -X GET "https://opensearch-cluster:9200/janusgraph_vertex/_count" \
  -H "Content-Type: application/json" \
  -d '{"query": {"match_all": {}}}' | jq -r '.count'

# 2. Force JanusGraph to use the mixed index (fails fast if the index is degraded)
curl -s --fail -X POST "https://janusgraph-server:8182/gremlin" \
  -H "Content-Type: application/json" \
  -d '{"gremlin": "g.V().hasLabel(\"entity\").has(\"name\", textContainsRegex(\".*\")).count().next()", "bindings": {}}'

Interpret the delta immediately:

OpenSearch count < storage count: index mutations are dropping or queued indefinitely. Dispatch thread exhaustion or bulk-request rejections are the primary suspects.
OpenSearch count > storage count: stale deletions or orphaned documents from failed transaction rollbacks remain in the index.
Counts match but queries fail: index mapping corruption or analyzer misconfiguration, not a count problem.

Cross-reference the delta against JanusGraph logs for the failure signature that tells you why it drifted:

bash

grep -E "IndexMutation|BulkRequest|RejectedExecution|circuit_breaking_exception" \
  /var/log/janusgraph/server.log | tail -40

Correlate the timestamps with OpenSearch thread-pool stats. RejectedExecutionException on the JanusGraph side almost always pairs with a non-zero rejected count on the OpenSearch write pool — that pairing is the root cause, and it is the one this loop fixes. Automate this count-comparison as a scheduled worker and alert when divergence exceeds ~0.5% of indexed cardinality.

Step 2 — Quarantine write traffic

Stop new mutations from compounding the delta before you extract state, or the extract races the live write path and can never reach parity. Route writes to a standby graph or put the affected cluster into read-only mode:

bash

# Force JanusGraph to reject index-backed writes cleanly while you reconcile
curl -s -X POST "https://janusgraph-server:8182/gremlin" \
  -H "Content-Type: application/json" \
  -d '{"gremlin": "mgmt = graph.openManagement(); mgmt.setConsistency(mgmt.getGraphIndex(\"searchByEntity\"), ConsistencyModifier.LOCK); mgmt.commit()"}'

If you cannot quarantine at the graph, freeze the drift window at the index by disabling refresh so no half-written state churns during the rebuild:

bash

curl -s -X PUT "https://opensearch-cluster:9200/janusgraph_vertex/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.refresh_interval": "-1"}'

Step 3 — Extract authoritative state from storage

The storage backend is the source of truth; the index is a derived projection. Pull the complete set of indexed properties straight from storage with a full-scan traversal and stream it to newline-delimited JSON for bulk ingestion. Deriving the OpenSearch _id deterministically from the vertex id is what makes the whole loop idempotent — a replayed extract overwrites rather than duplicates.

python

from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
import json

conn = DriverRemoteConnection('ws://janusgraph-server:8182/gremlin', 'g')
g = traversal().withRemote(conn)

# Full scan of indexed properties straight from the storage backend.
# valueMap(True) returns T.id / T.label enum keys; stringify them for JSON.
with open('authoritative_state.ndjson', 'w') as out:
    for doc in g.V().hasLabel('entity').valueMap(True).toList():
        vid = doc.get('id') or doc.get(list(doc.keys())[0])
        source = {str(k): (v[0] if isinstance(v, list) else v) for k, v in doc.items()}
        # Bulk NDJSON: deterministic _id makes re-ingestion a create-or-replace
        out.write(json.dumps({"index": {"_id": str(vid)}}) + "\n")
        out.write(json.dumps(source, default=str) + "\n")

conn.close()

For a very large graph, page this scan by a monotonic cursor property instead of a single toList() so a crashed extract resumes rather than restarting.

Step 4 — Rebuild the OpenSearch index

Delete the drifted index and recreate it with the correct mapping. Pin dynamic: strict so a stray field cannot silently reshape the mapping mid-repair; the shard count you choose here should match the predicate layout described in mixed-index routing.

bash

# 1. Delete the drifted index
curl -s -X DELETE "https://opensearch-cluster:9200/janusgraph_vertex"

# 2. Recreate with an explicit mapping (adjust fields to your schema)
curl -s -X PUT "https://opensearch-cluster:9200/janusgraph_vertex" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {"number_of_shards": 5, "number_of_replicas": 1, "refresh_interval": "-1"},
    "mappings": {"dynamic": "strict", "properties": {"name": {"type": "text"}, "id": {"type": "keyword"}}}
  }'

Step 5 — Bulk re-ingest the authoritative state

Feed the NDJSON from Step 3 into the OpenSearch _bulk endpoint. Because every action line carries a deterministic _id, this write is a create-or-replace — rerunning it after a partial failure converges instead of duplicating.

bash

# Chunk large files to stay under the bulk payload ceiling (~5MB per request)
split -l 20000 authoritative_state.ndjson bulk_chunk_

for chunk in bulk_chunk_*; do
  curl -s -X POST "https://opensearch-cluster:9200/_bulk" \
    -H "Content-Type: application/x-ndjson" \
    --data-binary "@${chunk}" | jq -e '.errors == false' > /dev/null \
    || echo "ERRORS in ${chunk} — inspect before continuing"
done

# Restore refresh so freshly ingested documents become searchable
curl -s -X PUT "https://opensearch-cluster:9200/janusgraph_vertex/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.refresh_interval": "30s"}'

Step 6 — Verify parity, then resume writes

Re-run the Step 1 count comparison. The loop is complete only when the delta is 0 and the write pool is quiet.

bash

# Storage count
STORAGE=$(curl -s -X POST "https://janusgraph-server:8182/gremlin" \
  -H "Content-Type: application/json" \
  -d '{"gremlin": "g.V().hasLabel(\"entity\").count().next()"}' | jq -r '.result.data["@value"][0]["@value"]')

# Index count after refresh has run
INDEX=$(curl -s "https://opensearch-cluster:9200/janusgraph_vertex/_count" | jq -r '.count')

echo "storage=${STORAGE} index=${INDEX} delta=$((STORAGE - INDEX))"

# Confirm the write pool is not still shedding load
curl -s "https://opensearch-cluster:9200/_cat/thread_pool/write?v&h=node_name,queue,rejected"

A delta of 0 with rejected=0 means the projection is rebuilt and stable. Lift the quarantine from Step 2, re-enable writes, and watch _cat/thread_pool/write for 15 minutes to confirm dispatch stays clean under live traffic. If rejected climbs again immediately, the drift will recur — fix capacity in the Hardening section below before you consider the incident closed.

Fallback and rollback procedures

Each step has a defined recovery path. Do not skip verification between recovery actions.

Step 1 (traversal errors instead of returning a count). The mixed index is not merely drifted, it is unqueryable — treat this as corruption and go straight to snapshot restore rather than an incremental rebuild. Verify snapshot integrity first: curl -s "https://opensearch-cluster:9200/_snapshot/repo/snap-latest/_status" | jq -r '.snapshots[].state' must report SUCCESS.
Step 2 (cannot quarantine at the graph). Fall back to disabling refresh at the index (shown above) and, if the delta is still growing, put JanusGraph into index-bypass mode with query.force-index=false so reads fall back to storage scans. Bound scan cost with query.page-size to avoid OOM, and accept the latency hit to preserve availability.
Step 3 (extract stalls or the JVM is under memory pressure). Page the scan by a monotonic cursor and resume from the last committed value instead of restarting the full toList(). Confirm the driver pool is not starved before blaming storage — thread starvation looks identical to a hung extract.
Step 4 (delete succeeds but recreate fails). Do not resume writes against a missing index — JanusGraph will fall back to storage scans and mask the problem. Restore the mapping from your snapshot: curl -s -X POST "https://opensearch-cluster:9200/_snapshot/repo/snap-latest/_restore" -H "Content-Type: application/json" -d '{"indices": "janusgraph_vertex"}'.
Step 5 (bulk returns "errors": true). Inspect the per-item errors in the response. A mapper_parsing_exception means Step 4’s mapping does not match the extracted fields — fix the mapping and re-ingest (idempotent). A 429/es_rejected_execution_exception means the write pool is saturated; raise thread_pool.write.queue_size, lower the chunk size, and replay only the failed chunks.
Step 6 (delta never reaches 0). If parity refuses to converge after a clean re-ingest, a recent mapping change is the likely culprit. Roll back to the previous index alias, point index.search.hostname at the stable cluster, and restart the JanusGraph server pool. Then re-run the loop from Step 3 against the reverted mapping.

Hardening against recurrence

Reconciliation that is not followed by capacity alignment just schedules the next incident. Align JanusGraph’s dispatch parameters with OpenSearch’s ingest limits:

Thread pool and queue sizing. Keep rejected at 0 on _cat/thread_pool/write; raise thread_pool.write.queue_size in OpenSearch when rejections appear under normal load.
Bulk request limits. Set index.search.elasticsearch.bulk-size so payloads stay below ~5MB; larger payloads trip circuit breakers and inflate retry latency.
Refresh interval. Run index.refresh_interval at 30s–60s for high-throughput pipelines and apply wait_for only to the narrow set of writes that need immediate visibility. See the official OpenSearch index settings for cluster-wide tuning.
Persistent retry queue. Back the pipeline with a disk-durable queue (Kafka or Redis) so index mutations survive a JVM restart and replay after OpenSearch recovers, closing the silent-drop gap that no in-memory retry can cover.

Record the final drift metrics, the exact commands you ran, and the last-good snapshot id in the incident runbook so the next on-call engineer inherits a converged baseline.

Up a level: OpenSearch Sync Patterns — the parent reference for the JanusGraph-to-OpenSearch boundary this runbook repairs.
Syncing JanusGraph with Elasticsearch Step by Step — the initial-sync procedure whose parity check surfaces the drift this page resolves.
Configuring Mixed Index Fallback Chains — shard alignment so a rebuilt index lands documents on balanced shards.
Eventual vs Strong Consistency Tradeoffs in JanusGraph — choosing the acknowledgment and refresh boundary that determines how quickly drift can appear.
JanusGraph Connection Pool Tuning Guide — sizing the driver pool so extract-phase starvation is not mistaken for fresh index lag.