If ScyllaDB speaks CQL, can I just repoint JanusGraph's storage.hostname and migrate?

No. Protocol compatibility lets JanusGraph reuse the cql adapter, but you must re-create the keyspace with NetworkTopologyStrategy, pin local-datacenter for shard-aware routing, resize the connection pool against cores instead of nodes, and backfill mixed indexes. Repointing the hostname alone leaves round-robin routing, an under-sized pool, and stale search indexes.

How should I size the JanusGraph connection pool for ScyllaDB versus Cassandra?

ScyllaDB pins each shard to a core, so the shard-aware driver benefits from connection-per-core alignment rather than Cassandra's few-connections-per-node model. Start max-connections-per-host near vCPU count divided by two and raise toward one-per-shard only if pool utilization stays at 100% under peak writers, keeping a warm core-connections floor for spikes.

Why do traversals miss vertices right after cutover to ScyllaDB?

Storage writes commit synchronously, but mixed-index updates are asynchronous. During bulk load with batch-loading enabled the search tier lags far behind on purpose. A committed vertex is durable and reachable via g.V(id) before it is searchable via has(). Finish the full mixed-index reindex and confirm the document-count delta is zero before routing any read traffic.

What replaces DowngradingConsistencyRetryPolicy when migrating to the 4.x driver?

That policy is deprecated in DataStax Java Driver 4.x and removed in 4.14. Implement application-level exponential backoff with jitter, classify only transient errors (overloaded, unavailable, timeout, connection reset) as retriable, and keep the attempt budget small so an overloaded shard sheds load instead of amplifying a retry wave.

What should trigger a rollback during the ScyllaDB cutover?

Revert routing to the standby cluster immediately if P99 traversal latency exceeds the captured baseline by more than 20% or mixed-index staleness passes 60 seconds. Both usually indicate a compaction backlog on the new cluster; let compaction drain before retrying the staged cutover.

ScyllaDB Migration

Moving a production JanusGraph deployment from Cassandra to ScyllaDB looks like a drop-in swap because both speak the CQL wire protocol — and that assumption is exactly what pages the on-call engineer at cutover. This guide sits under the JanusGraph Storage Backend Architecture & Configuration reference and narrows it to a single, high-risk operation: a live backend switch under a zero-downtime constraint. The failure surface is narrow but unforgiving — mismatched shard-aware routing, an index backfill run out of order, or a connection pool sized for Cassandra’s thread-per-connection model instead of ScyllaDB’s shard-per-core model. The symptom is always the same at 2 a.m.: storage writes commit, but traversals return stale or missing data while Overloaded exceptions climb. Everything below prioritizes deterministic configuration mapping, explicit index-sync sequencing, and observable cutover gates over the “it’s CQL-compatible, just repoint the hostname” path that will not survive the first ingestion burst.

The diagram below outlines the end-to-end migration path from Cassandra to ScyllaDB with index reconciliation gated before cutover.

Core Configuration & Consistency Tuning

ScyllaDB implements the CQL wire protocol, so JanusGraph reuses its Cassandra-compatible cql storage adapter with no code changes. Protocol compatibility is not operational parity. The real work is mapping legacy topology settings to ScyllaDB’s shard-aware routing model and confirming that compaction strategy still aligns with graph traversal access patterns. Align janusgraph.properties with ScyllaDB’s partitioner and local datacenter before any data moves.

properties

# Core storage binding
storage.backend=cql
storage.hostname=scylla-node-01,scylla-node-02,scylla-node-03
storage.port=9042
storage.cql.keyspace=janusgraph_prod
storage.cql.local-datacenter=us-east-1

# Consistency boundaries
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.only-use-local-consistency-for-system-operations=true

# Shard-aware pool sizing (see Connection Lifecycle below)
storage.cql.max-connections-per-host=16
storage.cql.core-connections-per-host=8
storage.cql.request-timeout=15000
storage.batch-loading=false

These are not defaults; each line closes a specific failure mode. Observe the following operational constraints when you fill in the values for your topology:

Pin local-datacenter explicitly. ScyllaDB’s shard-aware driver routes token-owning replicas per core. An unset or wrong datacenter name silently falls back to round-robin routing and doubles cross-shard hops for every mutation.
Hold writes at LOCAL_QUORUM. It acknowledges once a majority of local replicas confirm, which keeps read-repairable durability without the cross-datacenter latency variance of a global QUORUM. This is the same acknowledgment boundary you tuned for the Cassandra Backend Setup, and it must not change during migration — moving the boundary and the backend at the same time makes every latency regression ambiguous.
Keep only-use-local-consistency-for-system-operations=true. JanusGraph’s ID allocation and schema locks otherwise escalate to global QUORUM and stall on cross-DC round trips exactly when the new cluster is warming.
Set request-timeout above your worst-case compaction or index-flush duration. ScyllaDB’s incremental compaction is fast, but a mistimed timeout during initial load surfaces as retriable WriteTimeout storms that mask the real backlog.
Do not carry over DowngradingConsistencyRetryPolicy. It is deprecated in the DataStax Java Driver 4.x and removed in 4.14. Replace it with application-level exponential backoff (shown below), so consistency degradation is an explicit decision, not a silent driver behavior.

Because topology decisions precede consistency tuning, confirm your Replication Strategies — NetworkTopologyStrategy with explicit per-datacenter replica counts — are re-created on the ScyllaDB keyspace before bulk load rather than relying on auto-creation, which defaults to SimpleStrategy and ignores rack awareness. ScyllaDB’s own CQL compatibility documentation lists the protocol-level deltas that affect keyspace and compaction declarations.

Index Synchronization Protocol

The critical path in this migration is not moving vertices — it is keeping JanusGraph’s index state coherent across the transition window. JanusGraph splits index responsibility: graph indexes live in the storage backend and inherit the configured consistency level, while mixed indexes (Elasticsearch/OpenSearch) update asynchronously through a transaction-log dispatch. That split is why a byte-perfect storage copy can still serve stale has() predicates the moment you cut over.

During bulk load into ScyllaDB you deliberately open the async window wider, then close it under control before traffic shifts:

Open the window. Set storage.batch-loading=true for the initial load only. This relaxes consistency checks and lets the storage tier ingest at maximum throughput; mixed-index updates are expected to lag far behind and are reconciled afterward, not inline.
Stabilize storage first. Run ManagementSystem.awaitGraphIndexStatus(graph, indexName).status(SchemaStatus.ENABLED).call() to confirm every graph index is ENABLED on the new cluster before you touch the search tier.
Backfill the search tier in one pass. Trigger a full mixed-index reindex through the JanusGraph Management API after storage stabilizes, rather than incrementally during load — an incremental backfill races the bulk writes and leaves gaps.
Measure the lag, do not assume it. Track the difference between ManagementSystem.getGraphIndex(name) document expectations and the live search index count, and watch the Elasticsearch/OpenSearch _cluster/health and /_nodes/stats/indices/indexing latency series. The window is closed only when that delta is stable at zero.

This decouple-then-backfill sequence is the same principle covered in depth for Optimizing ScyllaDB Read/Write Consistency for Graphs, and it depends on where you place the acknowledgment boundary between storage durability and search visibility — the trade-off analyzed under Eventual vs Strong Consistency. If your read path resolves predicates through the search tier, verify parity with the Mixed-Index Routing rules before routing any production traffic, or committed vertices will be durable yet unsearchable at exactly the wrong moment.

Python Integration Pattern

Production migration pipelines need deterministic retry logic to absorb coordinator handoffs during rebalancing, transient node unavailability, and ScyllaDB overload signals — all of which spike precisely during bulk load and cutover. The pattern below executes a gremlin-python traversal with explicit connection lifecycle, transient-error classification, and exponential backoff with jitter, so a retriable Overloaded never cascades into a thundering herd.

python

import logging
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception,
    reraise,
)

logger = logging.getLogger(__name__)

TRANSIENT_MARKERS = ("overloaded", "unavailable", "timeout", "connection reset")


class GraphPipelineError(Exception):
    """Raised when a traversal fails after the retry budget is exhausted."""


def is_transient_error(exc: BaseException) -> bool:
    """Classify recoverable ScyllaDB / JanusGraph errors by message."""
    msg = str(exc).lower()
    return any(marker in msg for marker in TRANSIENT_MARKERS)


@retry(
    retry=retry_if_exception(is_transient_error),
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=2, max=16),
    reraise=True,
)
def execute_traversal(connection_url, traversal_func):
    """Run a Gremlin traversal with a fresh, explicitly closed connection."""
    conn = DriverRemoteConnection(connection_url, "g")
    try:
        g = traversal().with_remote(conn)
        return traversal_func(g).to_list()
    finally:
        conn.close()


def fetch_user_neighbors(g):
    return g.V().has("user", "id", "u_123").out("follows").values("name")


try:
    neighbors = execute_traversal(
        "ws://janusgraph-gateway:8182/gremlin",
        fetch_user_neighbors,
    )
    logger.info("Retrieved %d neighbors.", len(neighbors))
except Exception as exc:  # retry budget exhausted, error was transient
    logger.error("Pipeline halted after retries: %s", exc)
    raise GraphPipelineError("Traversal failed after retries") from exc
    # Downstream: route to a dead-letter queue or trip a circuit breaker here.

Two design points make this fallback-safe rather than merely decorated. First, retry_if_exception(is_transient_error) guarantees that a genuine schema or query error surfaces immediately instead of burning the retry budget — you never want to retry a malformed traversal five times against an already-loaded cluster. Second, wait_exponential_jitter spreads retry timing across producers, which is what actually prevents a synchronized retry wave from re-overloading a shard mid-rebalance. Open a new DriverRemoteConnection per unit of work and close it in finally; sharing a long-lived connection across a retrying batch loader is the most common source of leaked sessions during migration. Driver-level connection management details are in the official DataStax Java Driver documentation.

Connection Lifecycle & Pool Management

ScyllaDB’s per-shard architecture is where the Cassandra-tuned pool goes wrong. Cassandra multiplexes many requests over few connections per node; ScyllaDB pins each shard to a core and rewards a connection-per-core alignment so the shard-aware driver can route a request straight to the token-owning shard without an internal hop. Size the pool against cores, not nodes. The broader starvation model lives under Connection Pooling; the migration-specific rules are:

Ceiling on cores, not a fixed constant. Set storage.cql.max-connections-per-host to roughly the node’s vCPU count divided by two as a starting point — a 16-vCPU node warrants about 8 — then raise toward one-per-shard only if pool utilization stays pinned at 100% under peak writers.
Keep a warm floor. storage.cql.core-connections-per-host maintains idle connections so a traffic spike does not pay cold-start latency mid-cutover. Set it to your steady-state concurrency, not zero.
Bound the idle timeout below the server’s. Keep the client idle timeout under ScyllaDB’s native transport idle timeout so the driver, not the server, closes stale connections — server-side closes surface to the producer as spurious connection reset errors.
Make backpressure visible. Expose driver pool-utilization and in-flight-request JMX metrics. Pool exhaustion masquerades as node death (NoHostAvailableException) while nodetool status shows every node UN — the pool, not the ScyllaDB cluster, is the bottleneck. Shorten request-timeout so saturation surfaces to the producer instead of queuing unboundedly.
Retry policy is application-level. With DowngradingConsistencyRetryPolicy gone, the backoff-with-jitter shown above is the entire retry contract. Keep the attempt budget small (5 is plenty) so a genuinely overloaded shard sheds load instead of amplifying it.

Diagnostics & Operational Fallbacks

Every dangerous failure in this migration looks like another until you read the right metric. Instrument the coordinator, the driver pool, and the mixed index together, then triage from the table.

Symptom	Diagnose	Resolve
`OverloadedException` storms during bulk load	ScyllaDB `reactor_utilization` near 100% on specific shards; `scylla_storage_proxy` write-latency climbing	Shard hotspot — throttle the producer, confirm the partitioner matches Cassandra’s, and widen backoff jitter so retries stop synchronizing
Traversals miss vertices loaded seconds ago	Vertex present via `g.V(id)` but absent via `has()`; `/_nodes/stats/indices/indexing` latency high	Mixed-index backfill incomplete — finish the full reindex and confirm the doc-count delta is zero before routing reads
`NoHostAvailableException` under steady load	`nodetool status` shows nodes `UN` but driver pool utilization pegged at 100%	Pool exhaustion, not node death — raise `max-connections-per-host` toward per-shard alignment and shorten `request-timeout`
Write latency spikes only after cutover	`storage.batch-loading` left `true` in the live config; consistency checks relaxed on the serving path	Set `storage.batch-loading=false` and restart — batch mode is a load-phase setting, never a serving setting
Schema mutations hang on `apply`	System operations escalating to global `QUORUM` across datacenters	Confirm `only-use-local-consistency-for-system-operations=true` so ID allocation and schema locks stay on `LOCAL_QUORUM`
P99 latency 20%+ over baseline post-cutover	Compaction backlog on the new cluster; `scylla_compaction_manager_pending` growing	Trip the rollback gate, revert routing to the standby cluster, and let compaction drain before retrying cutover

When drift persists beyond what a producer throttle can close, run a REINDEX through the JanusGraph Management API during a maintenance window rather than dropping and rebuilding the live index. Keep continuous monitoring on shard reactor utilization, driver pool utilization, and mixed-index lag — those three series predict every row above. Follow the numbered cutover sequence: verify all indexes ENABLED, audit index freshness against storage counts, capture a P95/P99 latency baseline, shift 10% of read traffic and watch Overloaded metrics for 15 minutes, then move to 100% while holding the legacy cluster in standby for 24 hours. If P99 exceeds baseline by more than 20% or mixed-index staleness passes 60 seconds, revert routing immediately and investigate the compaction backlog before the next attempt.

Up a level: JanusGraph Storage Backend Architecture & Configuration — the storage tier this migration retargets.
Optimizing ScyllaDB Read/Write Consistency for Graphs — the decouple-then-backfill consistency model in depth.
Cassandra Backend Setup — the source-backend configuration whose property deltas you map from.
Connection Pooling — the full CQL pool sizing and starvation model behind the per-shard rules above.
Replication Strategies — datacenter topology and replica counts to re-create before bulk load.
Mixed-Index Routing — verify search-tier parity before routing production reads.