Automating Property Index Collision Resolution

This page gives you a repeatable, automated procedure that detects conflicting JanusGraph index definitions on a single property key and reconciles them to one canonical definition — preventing the SchemaViolationException at mgmt.commit() that stalls an ingestion pipeline when two deployments register incompatible indexes against the same key. It is the deterministic detection-and-remediation loop referenced from the parent Property Indexing Rules cluster, and it assumes you have already settled which index type each predicate should route to there before automating any repair.

A collision is not a clean crash you can retry through. It leaves the schema in a mixed state — one index ENABLED, a colliding one stuck at INSTALLED or REGISTERED — and a naive retry loop force-commits over the inconsistency instead of resolving it. The procedure below halts, detects the specific divergence, disables and removes the losing index atomically, registers the canonical definition, and only then confirms the key is queryable again.

The colliding index is torn down across two committed transactions before the canonical index is rebuilt and backfilled to ENABLED — the key is never left serving from more than one index.

Prerequisites

Confirm every item before running the automation — a partial precondition is how a repair pass becomes a second outage.

JanusGraph 0.6.x or 1.0.x with the CQL storage backend and a mixed-index backend wired (elasticsearch value, targeting Elasticsearch or OpenSearch). The management API calls below are stable across both minor lines.
Gremlin Server reachable over HTTP at a known host, with the /gremlin endpoint accepting Groovy scripts. The orchestrator talks to that endpoint, not to a Gremlin console.
A service account with schema-management privileges — the ability to open management transactions, call updateIndex, removeIndex, and buildIndex, and commit them. A read-only Gremlin user cannot mutate schema.
A written canonical definition for the key under repair: exact index name, Vertex.class or Edge.class, the property key, its data type and cardinality, and the target index backend. Automation without a single source of truth just picks a winner arbitrarily.
Ingestion writers you can pause. You must be able to quiesce writes against the affected key for the duration of the repair; resolving a collision under live mutation races the removal against new documents. Settle the pause mechanics against your Connection Pooling drain policy so in-flight traversals finish before you cut the index.
python 3.9+ with requests on the runner that executes the orchestrator.
A recent storage repair on ScyllaDB-backed clusters (nodetool repair) so the canonical index is not rebuilt from an under-replicated view — the consistency envelope is bounded by your ScyllaDB Migration benchmarks.

Collision Vectors

Automated detection has to classify the divergence, because the three vectors resolve differently:

Type mismatch. An index registered against String.class conflicts with a pipeline expecting Text.class (full-text) or Integer.class. The data types must match the canonical definition exactly; a String/Text mismatch silently changes which predicates route to the index.
Cardinality conflict. SINGLE versus SET or LIST cardinality declared on the same property key. This is settled upstream during Vertex and Edge Validation; a collision here means two deployments disagree on the key’s cardinality contract.
Backend assignment divergence. The same index name built as a mixed index against different backends across environments, or a composite index in one deployment colliding with a mixed index of the same name in another.

Step-by-Step Procedure

Step 1 — Detect active collisions

Query the JanusGraphManagement interface directly and return a structured report your orchestrator can parse. Run this Groovy diagnostic against the Gremlin Server; it opens a read-only management transaction and rolls it back, so it never mutates schema.

groovy

mgmt = graph.openManagement()
targetKey = mgmt.getPropertyKey('user_id')
if (targetKey == null) { throw new IllegalArgumentException("Property key 'user_id' not found in schema") }

// Inspect graph (composite/mixed) indexes that include the target key
collisionReport = []
for (idx in mgmt.getGraphIndexes(Vertex.class)) {
    if (!idx.getFieldKeys().contains(targetKey)) { continue }
    collisionReport.add([
        name: idx.name(),
        backend: idx.getBackingIndex(),
        status: idx.getIndexStatus(targetKey).toString(),
        unique: idx.isUnique(),
        keyTypes: idx.getFieldKeys().collect { it.name() + ":" + it.dataType().getSimpleName() }
    ])
}
mgmt.rollback()
collisionReport

Any index where status != "ENABLED", or where backend, keyTypes, or unique diverges from your canonical definition, is a collision requiring remediation.

Step 2 — Pause writers against the key

Quiesce every ingestion pipeline that writes the affected property key before mutating schema. Resolving a collision while writes land against the losing index leaves orphaned documents in the search cluster. Drain in-flight traversals rather than killing them mid-transaction.

Step 3 — Run the resolution orchestrator

The Python orchestrator connects to the Gremlin Server, identifies conflicting indexes, disables and removes each one, registers the canonical definition, and commits. It uses the /gremlin HTTP endpoint for reliable management-API execution in CI/CD contexts, with exponential backoff and explicit error handling. Note that JanusGraph requires the DISABLE_INDEX action and the subsequent removeIndex to land in two separate committed transactions — the orchestrator honours that boundary.

python

import requests
import time
import sys
from typing import List, Dict, Any

GREMLIN_SERVER = "http://janusgraph-gremlin:8182"
CANONICAL_INDEX = "canonical_user_id_idx"
PROPERTY_KEY = "user_id"
INDEX_BACKEND = "search"

def submit_gremlin(script: str, retries: int = 3) -> Any:
    """Execute a Groovy management script via the Gremlin Server HTTP endpoint."""
    payload = {"gremlin": script}
    for attempt in range(retries):
        try:
            resp = requests.post(
                f"{GREMLIN_SERVER}/gremlin",
                json=payload,
                timeout=30,
                headers={"Content-Type": "application/json"},
            )
            resp.raise_for_status()
            result = resp.json()
            # The Gremlin Server HTTP API wraps results in result.data
            data = result.get("result", {}).get("data", {})
            # GraphSON v2/v3 wraps lists in {"@type": "g:List", "@value": [...]}
            if isinstance(data, dict) and "@value" in data:
                return data["@value"]
            return data
        except requests.exceptions.RequestException as e:
            if attempt == retries - 1:
                raise RuntimeError(
                    f"Gremlin Server execution failed after {retries} attempts: {e}"
                )
            time.sleep(2 ** attempt)

def resolve_collision() -> None:
    diagnostic_script = f"""
    mgmt = graph.openManagement()
    key = mgmt.getPropertyKey('{PROPERTY_KEY}')
    def report = []
    for (idx in mgmt.getGraphIndexes(Vertex.class)) {{
        if (!idx.getFieldKeys().contains(key)) {{ continue }}
        report << [name: idx.name(), status: idx.getIndexStatus(key).toString()]
    }}
    mgmt.rollback()
    report
    """
    current_indexes = submit_gremlin(diagnostic_script)
    conflicting = [i for i in current_indexes if i["status"] != "ENABLED"]

    if not conflicting:
        print("No collisions detected.")
        return

    for idx in conflicting:
        # DISABLE and REMOVE must be two separate committed transactions.
        removal_script = f"""
        mgmt = graph.openManagement()
        idx = mgmt.getGraphIndex('{idx['name']}')
        if (idx != null) {{
            mgmt.updateIndex(idx, SchemaAction.DISABLE_INDEX).get()
            mgmt.commit()
            mgmt = graph.openManagement()
            idx = mgmt.getGraphIndex('{idx['name']}')
            mgmt.removeIndex(idx)
            mgmt.commit()
        }}
        """
        submit_gremlin(removal_script)
        print(f"Removed conflicting index: {idx['name']}")

    registration_script = f"""
    mgmt = graph.openManagement()
    key = mgmt.getPropertyKey('{PROPERTY_KEY}')
    mgmt.buildIndex('{CANONICAL_INDEX}', Vertex.class).addKey(key).buildMixedIndex('{INDEX_BACKEND}')
    mgmt.commit()
    """
    submit_gremlin(registration_script)
    print(f"Registered canonical index: {CANONICAL_INDEX}")

if __name__ == "__main__":
    try:
        resolve_collision()
    except Exception as e:
        print(f"Resolution failed: {e}", file=sys.stderr)
        sys.exit(1)

Step 4 — Backfill and enable the canonical index

A freshly built mixed index starts at REGISTERED; it does not serve existing data until a reindex backfills it. Trigger the backfill through the management API and let the index transition REGISTERED → ENABLED.

groovy

import org.janusgraph.core.schema.SchemaAction
import org.janusgraph.graphdb.database.management.ManagementSystem

mgmt = graph.openManagement()
idx = mgmt.getGraphIndex('canonical_user_id_idx')
mgmt.updateIndex(idx, SchemaAction.REGISTER_INDEX).get()
mgmt.commit()

ManagementSystem.awaitGraphIndexStatus(graph, 'canonical_user_id_idx').call()

mgmt = graph.openManagement()
idx = mgmt.getGraphIndex('canonical_user_id_idx')
mgmt.updateIndex(idx, SchemaAction.REINDEX).get()
mgmt.commit()

Verification Commands

Confirm the repair before resuming writers. Do not trust the orchestrator’s exit code alone — verify the observable schema state.

Check the canonical index reached ENABLED and no colliding index remains:

groovy

mgmt = graph.openManagement()
mgmt.printIndexes()
key = mgmt.getPropertyKey('user_id')
println mgmt.getGraphIndex('canonical_user_id_idx').getIndexStatus(key)
mgmt.rollback()

The status line must read ENABLED. Re-run the Step 1 diagnostic and assert the report contains exactly one index for the key — the canonical one — with matching backend, keyTypes, and unique values:

python

report = submit_gremlin(diagnostic_script)
assert len(report) == 1, f"expected 1 index for user_id, found {len(report)}"
assert report[0]["status"] == "ENABLED", f"index not enabled: {report[0]}"
print("collision resolved: single ENABLED canonical index confirmed")

Finally, confirm a predicate on the key routes to an index step rather than a full scan by profiling a representative traversal:

groovy

g.V().has('user_id', 'u-1001').profile()

The profile output must show an index step; a JanusGraphStep with no index name means the query still full-scans and the repair is incomplete.

Fallback Procedures

If any step fails, do not force-commit over the inconsistency. Work from the failure point:

Detection returns “property key not found.” The key was renamed or never declared. Reconcile the expected key name against the registered schema through Schema Evolution and CI Gating before re-running; automating a repair against a phantom key does nothing.
DISABLE_INDEX never completes. The action returns a future that only resolves once every JanusGraph instance acknowledges the state change. If it hangs, an instance is unreachable or holding a stale schema view. Wait for the JanusGraph management lock timeout to expire, then re-run the diagnostic. If an index remains stuck at INSTALLED after the timeout, restart the JanusGraph instances sequentially to force schema re-propagation, then retry.
removeIndex fails after a successful disable. The index is disabled but not removed — a safe intermediate state. Re-run only the removal transaction; because disable and remove are separate commits, replaying removal is idempotent and cannot corrupt the enabled canonical index.
buildIndex throws SchemaViolationException on registration. A colliding index of the same name still exists — removal did not fully commit. Return to Step 1, confirm the losing index is gone, and only then re-register.
Reindex stalls in REGISTERED. The backfill job did not start or was interrupted. Confirm storage is healthy (nodetool status), run nodetool repair on ScyllaDB-backed clusters so the backfill reads a fully-replicated view, then re-issue the REINDEX action. The visibility mechanics are the same ones covered under OpenSearch Sync Patterns.
Full rollback. If the repair leaves the schema worse than it started, halt all writers, disable and remove the canonical index with the same two-commit pattern, restore the previously-ENABLED index name from your schema source of truth, reindex it, and confirm ENABLED before resuming. Never resume ingestion against a key with more than one non-disabled index.

Once verification passes and writers resume, integrate the Step 1 diagnostic into your CI/CD validation gates so a colliding index definition fails the pipeline before it reaches production. Validate against the upstream JanusGraph schema management documentation before deviating from the actions above.

Frequently Asked Questions

Why do I have to disable an index before removing it? JanusGraph will not remove an index that is still ENABLED or serving traffic. SchemaAction.DISABLE_INDEX moves it to DISABLED and stops it accepting writes and reads; only then can removeIndex delete its metadata. The two operations must also land in separate committed transactions — the orchestrator commits the disable, reopens management, and commits the removal.

Why does the automation return a future and call .get()? updateIndex returns a GraphIndexStatusFuture that resolves only once every running JanusGraph instance acknowledges the schema change. Calling .get() blocks until the state has propagated everywhere, which is what prevents one node from operating on a stale index view mid-repair. If it never resolves, an instance is unreachable or holding a stale schema.

Can I resolve a collision without pausing writers? No. Removing the losing index while writes still land against it leaves orphaned documents in the search cluster and can race new mutations against the canonical index before it is ENABLED. Drain and pause the writers for the affected key, resolve, verify ENABLED, then resume.

My canonical index is stuck at REGISTERED — is it broken? Not necessarily. REGISTERED means the index exists but has not been backfilled. A mixed index only serves existing data after a REINDEX completes. Run the reindex action, and on ScyllaDB-backed storage run nodetool repair first so the backfill reads a fully-replicated view instead of a partial one.

How do I stop these collisions from recurring? Register index changes through a single gated path so two deployments cannot define the same key differently, and run the Step 1 diagnostic as a CI check that fails the pipeline on any non-ENABLED or divergent index. That converts a production SchemaViolationException into a caught build failure.

Up a level: Property Indexing Rules — the parent reference for index-type selection and the consistency window this repair restores.
Vertex and Edge Validation — the cardinality and data-type enforcement that decides which key definition is canonical.
Schema Evolution and CI Gating — gating index changes so a colliding definition fails the build instead of committing to production.
Resolving OpenSearch Index Drift in Production — the search-side analogue when the collision leaves orphaned documents behind.
Connection Pooling — the drain policy that lets you pause writers cleanly before mutating schema.