Elasticsearch Integration

Wiring Apache JanusGraph to an Elasticsearch mixed-index backend is the specific failure surface where a graph commit succeeds but the corresponding full-text document never becomes searchable — or becomes searchable seconds later than the application assumes. This page sits under External Index Synchronization & Consistency Tuning and covers the Elasticsearch half of that boundary: transport wiring, bulk-refresh semantics, the sync pipeline, connection pooling, and the triage path when index lag turns into stale reads. In production the integration is a dual-write architecture — graph mutations commit to the storage backend first, then propagate asynchronously to Elasticsearch through the mixed-index subsystem. Misaligned refresh intervals, transport-client routing mistakes, or unbounded bulk queues manifest as index lag, phantom documents, and transaction timeouts. Everything below assumes an on-call engineer who owns that seam and needs to move from a janusgraph.properties line to a resolution, not a tutorial.

The diagram below traces the dispatch path from a graph mutation to a searchable document, including the bulk retry loop that absorbs 429 and 5xx responses.

The forward path assembles and ships each _bulk request; a 429 or 5xx response diverts to the backoff loop and resubmits, and a write is searchable only after the index refresh.

Core Configuration & Consistency Tuning

JanusGraph delegates every mixed-index operation through the index.search.* configuration namespace. The legacy embedded transport client is deprecated for modern deployments because of classpath-isolation conflicts and JVM heap pressure inside the JanusGraph process; production systems must use the Elasticsearch Java REST client with explicit connection pooling and retry semantics. The block below is a hardened baseline — every non-default value changes a failure mode, so do not copy it blind.

properties

# janusgraph.properties

# --- Storage layer (authoritative writes) ---
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=graph_prod

# --- Index layer (eventual by design) ---
index.search.backend=elasticsearch
index.search.hostname=es-cluster-01.internal:9200,es-cluster-02.internal:9200,es-cluster-03.internal:9200
# REST client mode: connect to existing ES nodes, never instantiate an embedded index node.
index.search.elasticsearch.client-only=true

# --- Index creation settings (applied once, at index creation) ---
index.search.elasticsearch.create.ext.refresh_interval=1s
index.search.elasticsearch.create.ext.number_of_shards=3
index.search.elasticsearch.create.ext.number_of_replicas=1

# --- Bulk ingestion / visibility tuning ---
# wait_for = block the bulk request until the primary shard makes the write searchable.
index.search.elasticsearch.bulk-refresh=wait_for
index.search.elasticsearch.bulk-size=500

Numbered operational constraints for this block:

client-only=true is mandatory for REST client mode. JanusGraph connects to running Elasticsearch nodes rather than joining the Elasticsearch cluster as an embedded index node. Omit it and JanusGraph attempts to form a node, colliding with your cluster’s discovery and consuming heap inside the graph JVM.
refresh_interval, number_of_shards, and number_of_replicas are fixed at index creation. The create.ext.* prefix applies these only when the mixed index is first built. Changing them later in janusgraph.properties has no effect — you must go through the Elasticsearch _settings API or rebuild the index. Plan shard count against your write topology before the first mutation, since over-sharding inflates heap and slows recovery while under-sharding bottlenecks concurrent bursts.
bulk-refresh=wait_for trades a bounded latency add for read-your-writes visibility. It forces the bulk request to block until the primary shard acknowledges the write is searchable, collapsing the refresh term of the replication window to near zero for those writes. It adds roughly 15–40 ms per batch and multiplies thread contention on the search cluster — apply it deliberately, not reflexively (see the FAQ on setting it globally).
bulk-size caps documents per _bulk request. Keep it under the Elasticsearch payload ceiling to avoid a TooLargeRequestException; 500–1000 is the safe production band for typical vertex/edge documents.

Because the storage and index consistency models are independent, bulk-refresh tunes only index visibility and says nothing about storage durability. The choice between acknowledging-before-indexing and blocking-until-searchable is analyzed against workload SLAs under Eventual vs Strong Consistency. For clusters requiring cross-datacenter parity or non-default shard assignment, align routing with your graph partitioning strategy per Mixed Index Routing.

Index Synchronization Protocol

JanusGraph relies on eventual consistency for mixed indexes by default. A commit returns as soon as the storage backend acknowledges; the index becomes searchable only after the replication window elapses. That window is the sum of three terms — the time a mutation waits in the index queue, the bulk transport time, and the Elasticsearch refresh interval:

W_{\text{drift}} = t_{\text{queue}} + t_{\text{bulk}} + t_{\text{refresh}}

With the default refresh_interval=1s, the refresh term alone imposes a roughly one-second floor on visibility even when the queue and transport are idle. bulk-refresh=wait_for collapses t_refresh for the specific writes it decorates, which is why it is the correct lever for read-after-write paths and the wrong lever for bulk ingestion. Lucene segment merging and translog flushing behave identically across Elasticsearch and OpenSearch, so the same window model applies if you later migrate to OpenSearch Sync Patterns; only index lifecycle management differs between the engines.

Sync polling and lag metrics. Do not infer visibility from wall-clock guesses — measure it. Poll these signals to keep the window observable:

Elasticsearch /_nodes/stats/indices/indexing — the index_total_time_in_millis growth rate is the leading indicator of a widening window. Alert when its slope exceeds your ingestion SLA.
Elasticsearch /_cat/indices?v — per-index document counts, compared against storage-side counts to detect drift.
The JanusGraph org.janusgraph.diskstorage.indexing.IndexProvider JMX bean — index queue size and bulk-flush duration. A monotonically rising queue is producer backpressure; a rising flush duration is the search cluster falling behind.

When a read genuinely depends on a just-committed write, gate that single path with bulk-refresh=wait_for or issue an explicit _refresh at the transaction boundary — never widen the policy across the whole pipeline to fix one read. Consult the official Elasticsearch Bulk API documentation for payload limits and error-response formats when you set retry boundaries.

Python Integration Pattern

Direct JanusGraph-to-Elasticsearch writes suffice for low-throughput workloads. Production pipelines that consume a mutation stream (CDC feed or transaction journal) need explicit backpressure handling, idempotent document generation, and failure isolation. The worker below submits chunked bulk requests, derives deterministic document IDs from graph identifiers so a replayed batch updates rather than duplicates, and retries transient 429/5xx responses with exponential backoff. The retry decorator wraps the transport call; the fallback logic re-raises BulkIndexError so a partial failure triggers reconciliation instead of silently dropping documents.

python

import logging
from elasticsearch import Elasticsearch, helpers
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logger = logging.getLogger("janusgraph_sync")

ES_CLIENT = Elasticsearch(
    hosts=["https://es-cluster-01.internal:9200"],
    api_key=("YOUR_API_KEY_ID", "YOUR_API_KEY_SECRET"),
    retry_on_timeout=True,
    max_retries=3,
    verify_certs=True
)

CHUNK_SIZE = 500

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((ConnectionError, TimeoutError, helpers.BulkIndexError))
)
def push_to_index(doc_batch: list[dict]) -> bool:
    try:
        success, errors = helpers.bulk(
            ES_CLIENT,
            doc_batch,
            chunk_size=CHUNK_SIZE,
            raise_on_error=True,
            raise_on_exception=True,
            refresh="wait_for"
        )
        logger.info(f"Indexed {success} documents successfully")
        return True
    except helpers.BulkIndexError as e:
        # Document-level rejection (mapping conflict, circuit breaker) — reconcile, do not drop.
        logger.error(f"Bulk indexing failed: {e.errors}")
        raise
    except Exception as e:
        logger.error(f"Unexpected transport error: {e}")
        raise

def process_mutation_stream(mutation_queue: list[dict]):
    batch = []
    for mutation in mutation_queue:
        # Deterministic _id from graph identifiers enforces idempotent overwrite semantics.
        doc_id = f"{mutation['graph_id']}_{mutation['element_id']}"
        batch.append({
            "_op_type": "index",
            "_index": "janusgraph_mixed",
            "_id": doc_id,
            "_routing": mutation["partition_key"],
            "_source": mutation["payload"]
        })

        if len(batch) >= CHUNK_SIZE:
            push_to_index(batch)
            batch.clear()

    if batch:
        push_to_index(batch)

Rules that keep this pipeline aligned with storage:

Derive _id from the JanusGraph vertex/edge identifier so a retried batch overwrites in place. Non-deterministic IDs turn every retry into a duplicate document, which surfaces later as phantom search hits.
Chunked submission prevents heap exhaustion; keep chunk_size in step with the bulk-size you set in janusgraph.properties.
Set _routing to the storage partition key so related documents co-locate on the same shard, which cuts query fan-out — the routing decision logic lives in Mixed Index Routing.
For the full end-to-end workflow, including index definition and a backfill script, follow Syncing JanusGraph with Elasticsearch Step by Step.

Connection Lifecycle & Pool Management

The REST client holds a pool of persistent HTTP connections to the Elasticsearch nodes, and a mis-sized pool is indistinguishable from index lag until you inspect it. Size and bound the pool explicitly rather than accepting client defaults.

properties

# Fail fast on a partition rather than pinning worker threads on dead sockets.
index.search.elasticsearch.http.connection-timeout=10000
index.search.elasticsearch.http.socket-timeout=60000

# Cap concurrent bulk requests so a write burst cannot exhaust the ES write pool.
index.search.elasticsearch.http.max-connections=50
index.search.elasticsearch.http.max-connections-per-route=20

# Retry budget for transient index-commit failures (ms).
index.search.elasticsearch.max-retry-time=300000

Sizing and lifecycle rules:

Size max-connections to your write concurrency, not higher. An oversized pool lets a burst open enough sockets to saturate the Elasticsearch write thread pool, converting producer pressure into cluster-wide rejections. max-connections-per-route exists so one hot shard route cannot consume the entire budget.
Set connection-timeout low (10 s) and socket-timeout high (60 s). Fast connection failure sheds load during a partition; a generous socket timeout tolerates a slow but healthy bulk flush without spurious retries.
Bound the retry budget. max-retry-time should expire before your upstream transaction timeout, so a dead index node surfaces as a clean failure the pipeline can reconcile rather than an indefinite stall.
Watch for pool starvation masquerading as lag. A starved graph-side driver pool throws TimeoutException that looks exactly like index lag; the sizing model for the storage-side driver is covered under Connection Pooling. Keep the storage keyspace Replication Strategies aligned with the index topology so a replication mismatch does not surface as one-sided stale reads.

Diagnostics & Operational Fallbacks

The four failure modes below account for most Elasticsearch-integration incidents. Each row is symptom → diagnosis → resolution so an on-call engineer can act without a design discussion.

Symptom	Diagnose	Resolve
Recent writes missing from full-text results	`curl /_nodes/stats/indices/indexing` shows rising `index_total_time_in_millis`; `IndexProvider` queue climbing	Replication window is stretching under load — throttle the producer, or apply `bulk-refresh=wait_for` only to the writes that need read-your-writes
`EsRejectedExecutionException` / bulk rejections	`/_cat/thread_pool/write?v` shows non-zero `rejected`; `queue` above 80% of `thread_pool.write.queue_size`	Lower `bulk-size` and `max-connections`; add producer-side backpressure; scale index write threads before retrying into a saturated cluster
Phantom documents outlive deleted graph elements	`getIndexStatus(key)` not `ENABLED`; index doc count exceeds storage row count	Reindex via Management API `SchemaAction.REINDEX`; if corrupted, drop/recreate the mixed index and full-sync from storage
Traversal timeouts during ingestion bursts	`nodetool tpstats` clean but driver throws `TimeoutException`; pool at 100%	Starved driver pool, not index lag — resize per the Connection Pooling model and cap batch concurrency

Additional guardrails that keep the integration inside its operating envelope:

Circuit-break on bulk saturation. Buffer or shed mutations when /_cat/thread_pool/write?v shows queue above 80% of thread_pool.write.queue_size, rather than retrying into an Elasticsearch cluster that is already rejecting.
Resolve split-brain deterministically. Use version_type=external with a graph-side monotonic version so out-of-order deliveries cannot resurrect stale documents.
Tune the index buffer for write-heavy nodes. Set indices.memory.index_buffer_size to 15–20% of node heap during sustained ingestion.
Reject schema drift at ingestion. Enforce "dynamic": "strict" on mixed-index mappings so an unregistered property key fails fast instead of silently creating a new field.
Retain transaction logs for recovery. Archive stale logs only after a successful sync, and keep at least 24 hours of history to support point-in-time replay. JanusGraph replays queued index mutations from its transaction log on reconnect after a partition; if drift persists beyond the log horizon, run a REINDEX during a maintenance window. For the storage-side consistency benchmarks that determine how tight the index window can safely be, review ScyllaDB Migration.

Frequently Asked Questions

Do I still need the embedded transport client for Elasticsearch? No. The embedded transport client is deprecated for modern deployments because of classpath-isolation conflicts and heap pressure inside the JanusGraph JVM. Use the REST client by setting index.search.elasticsearch.client-only=true and pointing index.search.hostname at your running Elasticsearch nodes.

Should I set bulk-refresh=wait_for globally? No. Applied globally it serializes throughput behind index refresh and multiplies thread contention on the search cluster. Apply it selectively to the small set of writes whose immediate visibility is a business requirement, and leave bulk ingestion on the default.

Why can I no longer change the shard count in janusgraph.properties? Because create.ext.number_of_shards and its siblings apply only at index creation. After the mixed index exists, the property is inert. Change shard/replica counts through the Elasticsearch _settings API where allowed, or rebuild the index during a maintenance window.

How do I recover after a network partition between JanusGraph and Elasticsearch? JanusGraph tracks pending index mutations in its transaction log and replays queued updates on reconnect. If drift persists beyond the log horizon, run a REINDEX through the Management API in a maintenance window with bulk-refresh=false; for corruption, drop and recreate the mixed index and full-sync from storage.

Up a level: External Index Synchronization & Consistency Tuning — the parent reference for the storage-to-index boundary this page details.
Syncing JanusGraph with Elasticsearch Step by Step — the end-to-end setup and backfill workflow.
OpenSearch Sync Patterns — the version-aware equivalent for OpenSearch backends.
Eventual vs Strong Consistency — choosing the acknowledgment boundary against workload SLAs.
Mixed Index Routing — shard alignment and predicate routing to prevent hot shards.
Connection Pooling — the driver pool sizing model that keeps starvation from looking like index lag.