OpenSearch Sync Patterns

The specific failure surface this page owns is the seam where JanusGraph commits a vertex or edge to storage, then dispatches the matching document to an OpenSearch backend that speaks the Elasticsearch REST protocol but ships its own version skew, security defaults, and circuit-breaker behaviour. This page sits under External Index Synchronization & Consistency Tuning and covers the OpenSearch half of that boundary: version-aware transport wiring, bulk-refresh semantics, the idempotent sync pipeline, connection lifecycle, and the triage path when the index falls behind storage. The synchronization model is asynchronous by design — JanusGraph acknowledges the storage commit before the index mutation is searchable — so misaligned refresh intervals, unbounded bulk queues, or missing idempotency guarantees turn into query-latency spikes and silent data drift. Everything below assumes an on-call engineer moving from a janusgraph.properties line to a resolution, not a tutorial.

The critical detail unique to OpenSearch is that JanusGraph has no native opensearch backend value. It addresses the search cluster through its Elasticsearch-compatible index backend, so both index.search.backend and every index.search.elasticsearch.* property key stay elasticsearch even when the target is OpenSearch. Version negotiation is where this leaks: JanusGraph inspects the reported cluster version to choose request formats, and OpenSearch 1.x/2.x report a version lineage that some JanusGraph builds mis-detect. Pin the behaviour explicitly rather than trusting auto-detection.

Version Negotiation & Compatibility

JanusGraph’s Elasticsearch REST client issues a GET / against the target on startup and reads the returned version.number to select request and mapping formats. OpenSearch 1.x and 2.x return their own version lineage — for example 2.11.1 — which a JanusGraph build expecting an Elasticsearch 7.x/8.x number can reject outright or silently mis-handle at mapping-creation time. Do not rely on auto-detection. Force OpenSearch to advertise the Elasticsearch version its API is wire-compatible with:

yaml

# opensearch.yml — make the cluster report as Elasticsearch 7.10.2
compatibility.override_main_response_version: true

With the override on, GET / returns "number": "7.10.2", which the JanusGraph Elasticsearch backend detects cleanly and pins to the ES7 request format that OpenSearch’s _bulk, _mapping, and _search endpoints implement. Confirm the negotiated version before wiring the graph, never after a failed index build:

bash

# What JanusGraph reads on startup — check number AND distribution
curl -sk -u janusgraph_svc:"$OPENSEARCH_SVC_PASSWORD" \
  https://opensearch-cluster-01:9200/ | jq '.version.number, .version.distribution'

If distribution reports opensearch while number still shows a 2.x value, the override is not applied on every node. A rolling cluster with mixed settings mis-detects intermittently as client connections land on different nodes, producing index writes that succeed against one node and fail against another. Roll compatibility.override_main_response_version to every node and restart before the first mixed-index mutation, and keep it in the node bootstrap template so a replaced node cannot rejoin without it.

Core Configuration & Consistency Tuning

Index synchronization behaviour is governed entirely by the index.search.* namespace. The block below is a hardened production baseline for an OpenSearch target — every non-default value changes a failure mode, so do not copy it blind. The numbered constraints after it explain the ones that bite in production.

properties

# Storage & index backend binding
storage.backend=cql
storage.hostname=graph-db-cluster-01,graph-db-cluster-02
index.search.backend=elasticsearch
index.search.hostname=opensearch-cluster-01,opensearch-cluster-02
index.search.port=9200

# Transport: REST client only, TLS, basic auth
index.search.elasticsearch.client-only=true
index.search.elasticsearch.ssl.enabled=true
index.search.elasticsearch.http.auth.type=basic
index.search.elasticsearch.http.auth.basic.username=janusgraph_svc
index.search.elasticsearch.http.auth.basic.password=${OPENSEARCH_SVC_PASSWORD}

# Index-creation settings (applied ONCE, at index creation time)
index.search.elasticsearch.create.ext.number_of_shards=5
index.search.elasticsearch.create.ext.number_of_replicas=1
index.search.elasticsearch.create.ext.refresh_interval=30s

# Bulk ingestion & consistency controls
index.search.elasticsearch.bulk-refresh=false
index.search.elasticsearch.bulk-size=1000
index.search.elasticsearch.max-retry-time=300000
storage.batch-loading=true

bulk-refresh=false protects the segment-merge pipeline. It disables a forced refresh on every bulk request, so OpenSearch batches segment creation on the refresh_interval instead of thrashing merges under high-throughput graph mutation. Set it to wait_for only for the narrow set of writes that require read-your-writes visibility — applied globally it serializes throughput behind refresh.
create.ext.* settings are inert after index creation. number_of_shards, number_of_replicas, and refresh_interval are fixed the moment the mixed index is first materialized. Changing the property later does nothing; adjust live values through the OpenSearch _settings API where allowed, or rebuild the index in a maintenance window.
storage.batch-loading=true is a bulk-import-only flag. It disables the storage-side consistency checks and transaction-log overhead that make concurrent writes safe. Enable it for initial loads or controlled reindex windows, then turn it off — leaving it on in a live cluster invites lost-update anomalies.
max-retry-time must be tuned against OpenSearch circuit-breaker thresholds. This governs how long JanusGraph retries a failed index commit. Set it too high and a breaker-tripping cluster propagates backpressure into graph transaction timeouts; too low and transient 429s abort commits that would have succeeded on retry.
TLS and basic auth are not optional on OpenSearch. The OpenSearch security plugin is enabled by default, so ssl.enabled=true plus a scoped service account (janusgraph_svc with indices:data/write/bulk and index-management privileges) is the minimum. Inject the password from a secret store, never inline.

Index Synchronization Protocol

JanusGraph writes through a two-phase pattern. Phase one commits the mutation to the storage backend at the configured consistency level. Phase two queues the corresponding index mutation in a local transaction log and flushes it to OpenSearch asynchronously. The gap between phase-one acknowledgement and phase-two visibility is the replication window, and it is the origin of every stale-read incident on this seam. Deciding how tight that window may safely be is the Eventual vs Strong Consistency trade-off applied to the index side.

The total time from commit to searchable is the sum of the queue-drain latency, the bulk round-trip, and the refresh interval:

t_{visible} = t_{queue} + t_{bulk} + t_{refresh}

With refresh_interval=30s, the refresh term dominates and a document can be committed-but-invisible for tens of seconds. During that window, read-after-write queries return stale results. Contain it with three moves:

Route time-sensitive lookups directly to storage — use g.V().hasId(...) or a has() predicate on a composite index rather than the mixed index, so the query never touches OpenSearch.
Issue an explicit _refresh in ingestion pipelines only after a batch completes, never per document, so you pay the refresh cost once per batch instead of per write.
Poll the lag directly. GET /_nodes/stats/indices/indexing exposes index_current and index_total; the JanusGraph side exposes queue depth through the org.janusgraph.diskstorage.indexing.IndexProvider JMX bean. When the producer rate exceeds the drain rate for more than one refresh interval, the queue is growing and drift is accumulating.

Sample the drain side on an interval and alert on sustained in-flight work rather than a single spike — index_current is the count of operations mid-flight, so a value that never returns to near-zero between refreshes is the signature of a queue outrunning the search cluster:

bash

# Two samples one refresh_interval apart; flag if in-flight work does not drain
prev=$(curl -sk -u janusgraph_svc:"$OPENSEARCH_SVC_PASSWORD" \
  https://opensearch-cluster-01:9200/_nodes/stats/indices/indexing \
  | jq '[.nodes[].indices.indexing.index_current] | add')
sleep 30   # one refresh_interval
curr=$(curl -sk -u janusgraph_svc:"$OPENSEARCH_SVC_PASSWORD" \
  https://opensearch-cluster-01:9200/_nodes/stats/indices/indexing \
  | jq '[.nodes[].indices.indexing.index_current] | add')
echo "in-flight: was ${prev}, now ${curr}"
# curr >= prev AND both non-trivial => drain is not keeping up; throttle the producer

For the shard-alignment work that keeps this dispatch cheap, pin routing values as described in Mixed Index Routing — a scatter-gather query multiplies the bulk round-trip term across every shard and inflates the window.

Python Integration Pattern

Platform teams frequently bypass JanusGraph’s native dispatch for high-volume backfill and run the sync from a Python worker against the OpenSearch client directly. The pattern that survives production has three properties: deterministic _id derivation for idempotency, explicit retry with backoff on transport failures, and a partial-failure path that raises for reconciliation instead of silently dropping documents.

python

import logging
from opensearchpy import OpenSearch, helpers
from opensearchpy.exceptions import TransportError, ConnectionTimeout
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)

logger = logging.getLogger(__name__)


def _doc_id(graph_element_id: str) -> str:
    """Deterministic _id straight from the JanusGraph identifier.

    Reusing the graph id as the document id makes every write a
    create-or-replace, so a replayed batch cannot duplicate documents.
    """
    return str(graph_element_id)


@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((TransportError, ConnectionTimeout)),
)
def sync_graph_mutations(client: OpenSearch, mutations: list[dict]) -> int:
    """Bulk-index graph mutations with idempotency and bounded retry.

    Raises on partial failure so the caller can trigger reconciliation
    rather than proceeding on an index that silently dropped documents.
    """
    actions = [
        {
            "_op_type": "index",          # create-or-replace = idempotent
            "_index": m["index"],
            "_id": _doc_id(m["graph_id"]),
            "_routing": m.get("routing"),  # align with storage partition key
            "_source": m["source"],
        }
        for m in mutations
    ]

    try:
        success, errors = helpers.bulk(
            client,
            actions,
            chunk_size=1000,
            raise_on_error=True,
            raise_on_exception=True,
        )
        logger.info("Indexed %d graph mutations", success)
        return success
    except helpers.BulkIndexError as exc:
        # Document-level rejections: mapping conflicts, breaker trips.
        for item in exc.errors[:20]:
            logger.error("Bulk item rejected: %s", item)
        raise RuntimeError(
            "Partial bulk failure — trigger reconciliation"
        ) from exc

Production requirements for this pattern:

Use _op_type="index" for idempotent overwrite (create-or-replace). Switch to _op_type="update" with doc_as_upsert=True only when you need a partial-field merge instead of a full replace.
Derive _id from the JanusGraph vertex or edge identifier, never from a hash of the payload — the id must be stable across reruns so a replayed batch overwrites rather than duplicates.
Set _routing to the same value that keys the storage partition (tenant, label, or temporal bucket) so a document and its storage row land on aligned shards.
Treat BulkIndexError as a reconciliation trigger, not a log line. A document rejected for a mapping conflict is drift the moment it happens.

Connection Lifecycle & Pool Management

The OpenSearch Python client multiplexes bulk requests over an HTTP connection pool, and mis-sizing it is the most common cause of “index is slow” reports that are actually client-side starvation. Size the pool to the concurrency you actually drive, keep connections warm across batches, and cap retries so a wedged node fails fast instead of holding a worker hostage.

python

from opensearchpy import OpenSearch, RequestsHttpConnection

client = OpenSearch(
    hosts=[
        {"host": "opensearch-cluster-01", "port": 9200},
        {"host": "opensearch-cluster-02", "port": 9200},
    ],
    http_auth=("janusgraph_svc", SERVICE_PASSWORD),
    use_ssl=True,
    verify_certs=True,
    # Pool sizing: one slot per concurrent bulk worker, plus headroom.
    maxsize=16,
    # Fail fast on a wedged node instead of blocking a worker.
    timeout=30,
    max_retries=3,
    retry_on_timeout=True,
    connection_class=RequestsHttpConnection,
)

Sizing and lifecycle rules:

Pool size follows worker concurrency, not node count. Set maxsize to the number of concurrent bulk workers plus a small buffer. A pool smaller than your worker count serializes requests and looks exactly like index lag; a pool far larger wastes file descriptors and hides backpressure you want to feel.
Keep connections warm. The client reuses HTTP keep-alive connections across helpers.bulk calls, so construct one client per worker process and reuse it — do not build a fresh client per batch, which pays a TLS handshake every time.
Bound the retry budget. max_retries=3 with retry_on_timeout=True covers transient node blips; the application-level tenacity retry above covers transport exceptions. Layer them, but keep both bounded so a genuinely down cluster surfaces as a failure instead of an infinite stall.
Match the storage-side pool discipline. The same starvation model governs the CQL driver feeding graph mutations; sizing them together is the Connection Pooling concern, and an under-sized storage pool upstream will throttle the index pipeline before OpenSearch ever sees load.

Diagnostics & Operational Fallbacks

Network partitions, breaker trips, and JVM garbage-collection pauses cause drift: missing vertices, stale edge properties, or phantom documents left behind by a rolled-back storage transaction. The triage table maps the symptom you observe to the command that confirms it and the resolution that clears it.

Symptom	Diagnosis	Resolution
Read-after-write returns stale results	`GET /_nodes/stats/indices/indexing` shows `index_current` climbing; IndexProvider JMX queue depth rising	Producer outruns drain — throttle the ingestion worker, or drop `refresh_interval` for the affected index; route time-critical reads to a composite index on storage
Bulk requests return `429 Too Many Requests`	`GET /_nodes/stats/breaker` shows `tripped > 0` on the `parent` or `fielddata` breaker	Reduce `bulk-size`, back off the producer, raise node heap or `indices.memory.index_buffer_size`; confirm `max-retry-time` is long enough to ride out the trip
Document counts diverge from graph cardinality	Compare storage row counts against `GET /_cat/indices?v` doc counts and per-shard `_seq_no`	Reindex only the affected id range through the idempotent Python pipeline; verify with a follow-up count
Searches miss recently committed vertices after a partition	JanusGraph log shows index-mutation replay failures past the transaction-log horizon	Run a `REINDEX` through the Management API in a maintenance window with `bulk-refresh=false`; if the mapping is corrupt, drop and recreate the mixed index and full-sync from storage
Every bulk write rejected with a mapping error	`BulkIndexError` items report `mapper_parsing_exception`; `GET /<index>/_mapping` shows `dynamic` field creation	Enforce `"dynamic": "strict"` on the mapping so unregistered keys fail fast; reconcile the offending property key against the registered graph schema before replaying

Automate the count-comparison and repair as a scheduled worker rather than a manual runbook — the full detection-and-repair loop is documented in Resolving OpenSearch Index Drift in Production. Alert when divergence exceeds roughly 0.5% of total indexed cardinality, scan during low-traffic windows, and always reindex through the idempotent path so a repair pass can never itself create duplicates.

Standing operational discipline for this seam: keep ingestion pipelines and query workloads on separate resource quotas so a backfill cannot starve live reads, enforce a bounded OpenSearch bulk queue, and validate every schema change against the existing mixed-index mapping before it ships.

Frequently Asked Questions

Why does the backend value stay elasticsearch for an OpenSearch cluster? JanusGraph has no native opensearch backend. It talks to OpenSearch through its Elasticsearch-compatible index backend, so index.search.backend and every index.search.elasticsearch.* property key remain elasticsearch even when the target cluster is OpenSearch. Only the hostnames and security settings change.

How do I stop JanusGraph from mis-detecting the OpenSearch version? Set compatibility.override_main_response_version: true in opensearch.yml on every node so the search cluster reports Elasticsearch 7.10.2 to the JanusGraph REST client. Verify with curl .../ | jq '.version.number' before wiring the graph. A search cluster with the override applied unevenly across nodes mis-detects intermittently, so roll it everywhere and bake it into the node bootstrap template.

Should I set bulk-refresh=wait_for globally? No. Applied globally it serializes ingestion throughput behind index refresh and multiplies thread contention on the search cluster. Keep bulk ingestion on false and apply wait_for only to the narrow set of writes whose immediate visibility is a business requirement.

How do I stop a replayed backfill from duplicating documents? Derive the OpenSearch _id deterministically from the JanusGraph vertex or edge identifier and use _op_type="index". That makes every write a create-or-replace, so rerunning a batch overwrites the existing document instead of creating a second one.

How do I recover after a partition between JanusGraph and OpenSearch? JanusGraph tracks pending index mutations in its transaction log and replays them on reconnect. If drift persists beyond the log horizon, run a REINDEX through the Management API in a maintenance window with bulk-refresh=false; for a corrupt mapping, drop and recreate the mixed index and full-sync from storage.

Up a level: External Index Synchronization & Consistency Tuning — the parent reference for the storage-to-index boundary this page details.
Resolving OpenSearch Index Drift in Production — the automated detection-and-repair loop for the drift this page triages.
Elasticsearch Integration — the Elasticsearch equivalent of this wiring, including legacy transport-client notes.
Eventual vs Strong Consistency — choosing the acknowledgment boundary against workload SLAs.
Mixed Index Routing — shard alignment and predicate routing that keeps the dispatch cheap.
Connection Pooling — the driver pool sizing model that keeps starvation from looking like index lag.
Graph Schema Validation & Modeling — registering property keys so a mixed-index mapping can be pinned to dynamic: strict without rejecting live writes.