ADR 0016 — Async StoreBackend is canonical; sync layer is transitional¶
Context and Problem Statement¶
ADR-0015 (CSA-0046) shipped the StoreBackend Protocol + dual backends (SqliteStore + PostgresStore) synchronously because the existing FastAPI routers were already synchronous and converting them was called out as a separate workstream:
Negative: the sync PostgresStore ties the router surface to blocking I/O. A future ADR will evaluate async conversion once router-side
awaitchains are already justified by other work. — ADR-0015
That separate workstream is now justified. Three concrete forces push the portal into async def end-to-end:
- Copilot streaming —
apps/copilotemits Server-Sent Events over long-lived HTTP responses. Every blockingstore.get(...)on the router thread starves that stream because FastAPI holds the event loop while the sync call-site blocks in the thread-pool executor. - BFF concurrency —
routers/auth_bff.pyalready runs async (ADR-0014). Having the session store async while the persistence store is sync is a confusing dichotomy that regularly trips contributors. - Postgres connection efficiency — SQLAlchemy 2.0's
AsyncEngineasyncpgamortises the AAD token-fetch cost across the pool in a way the syncpsycopgpath cannot (the sync engine blocks the event loop on every new connection while the AAD token is minted).
The question this ADR answers is: given the async conversion is now required, do we rip out the sync layer, run a thread-pool bridge, or keep the sync layer as a transitional compat path?
Decision Drivers¶
- Non-regressive test suite — the 150-test suite against the sync store must not break during the refactor.
- Supply-chain hygiene — no new mandatory deps on
aiosqliteorsqlalchemy[asyncio]for SQLite-only deployments. - Migration runway — consumers that import
from portal.shared.api.persistence import SqliteStoreoutside the tree (there are none in this repo today, but the public surface hints at future re-use) get at least one minor release to migrate. - FastAPI-native concurrency — no
asyncio.run_in_executorpaper bridges. The routes must be genuinely async end-to-end. - Production-grade error surface — typed exceptions (
StoreBackendError,StoreConnectionError,MigrationDigestMismatchError,MigrationPartialFailure) + tenacity retries on transient driver errors. - Data migration must be first-class — ADR-0015 explicitly deferred the migration CLI; without it, operators cannot move production data from SQLite to Postgres.
Considered Options¶
- AsyncStoreBackend as canonical + sync retained as compat (chosen) — add an async Protocol, async SQLite (
aiosqlite) and async Postgres (sqlalchemy.ext.asyncio+asyncpg) stores, convert the routers toawait-style via FastAPIDepends, and leave the sync layer in place — marked deprecated in docstrings — for one minor release. - Stay sync forever — keep
SqliteStore+PostgresStore+ router code synchronous. Simpler in the short term; blocks Copilot streaming, creates ever-more awkward mixing with the already-async BFF path, and bottlenecks Postgres through the synchronous driver. - Thread-pool the sync store from async routes — wrap each sync call in
asyncio.to_thread/run_in_executor. Avoids adding async drivers but leaks threads under FastAPI concurrency, doubles latency on fast reads, and introduces a second class of bug (thread-local connection reuse collisions under load). - Delete the sync layer immediately — cleanest architecturally but any out-of-tree consumer that imports
SqliteStorewould break without warning on upgrade.
Decision Outcome¶
Chosen: Option 1 — AsyncStoreBackend canonical; sync layer transitional.
Concrete layout:
portal/shared/api/persistence_async.py—AsyncStoreBackendProtocol plus:AsyncSqliteStore—aiosqlite-backed; preserves WAL, PRAGMAs,BEGIN IMMEDIATEsemantics;asyncio.Lockreplacesthreading.RLock.AsyncPostgresStore—sqlalchemy.ext.asyncio.AsyncEngine+asyncpg; per-URL engine cache; managed-identity token injection viaDefaultAzureCredential.aio; password-callable bridge so asyncpg picks up rotating tokens without pool restarts.build_async_store_backend()— dispatch onDATABASE_URL, fails closed on unknown schemes (mirrors sync factory exactly).- Typed exceptions + tenacity retries (3 attempts, jittered exponential backoff) on transient driver errors.
close_async_engines()— lifespan shutdown hook disposes the engine pool + closes the AAD credential.
portal/shared/api/dependencies.py— singleton async store instances wired into FastAPI viaDependsgetters (get_sources_store,get_pipelines_store,get_runs_store,get_access_store,get_products_store,get_quality_store).portal/shared/api/routers/{access,marketplace,pipelines,sources,stats}.py— every route isasync def; every persistence call isawait; stores resolved viaAnnotated[AsyncStoreBackend, Depends(get_*_store)].portal/shared/api/main.py— lifespan calls asyncseed_demo_*coroutines on startup andclose_async_engines()on shutdown.scripts/migrate_portal_persistence.py— standalone CLI that walks every portal store (or a--tablessubset), reads via the async SQLite/Postgres store, writes viaINSERT ... ON CONFLICT DO NOTHING, computes sha256 row digests, and emits a structured report. Dry-run mode available. Exit codes document every failure class.- Sync layer (
persistence.py,persistence_postgres.py,persistence_factory.py) retained with deprecated-in-docstring markers. To be removed in the next minor release (CSA-0046 v3) after tenant teams have had one release to migrate.
Consequences¶
- Positive: FastAPI routes now participate in the async event loop end-to-end; Copilot SSE streams and BFF session flows no longer compete with blocking persistence calls.
- Positive: Postgres under load benefits from the asyncpg pool (shared across all six stores) + non-blocking AAD token refresh.
- Positive: first-class migration CLI closes the CSA-0046 deferred item; operators can now migrate SQLite → Postgres with a single command + verified row digests.
- Positive: explicit typed exceptions make test assertions crisp and let operators distinguish "database unreachable" from "row-level digest mismatch" in alerting.
- Negative: dual layer doubles the store surface. Mitigated by the deprecation timeline and by having the async and sync stores share the same physical table schema (both the SQLite and Postgres layouts are unchanged from ADR-0015).
- Negative:
aiosqlite+sqlalchemy[asyncio]+asyncpgnow carry the full async stack; install size for thepostgresextra grows by ~4 MB. Acceptable for production but the sync-SQLite-only dev-loop (pip install -e .[portal]) remains slim because the async SQLite dep lives under thepostgresextra too. - Negative: aiosqlite opens a background executor thread per connection. In the FastAPI process this is a non-issue (one connection per store, reused for the life of the process), but tests that spin up fresh stores every test must close them explicitly.
conftest.pyuses the sync compat layer for that path to avoid a per-testasyncio.run.
Pros and Cons of the Options¶
Option 1 — Async canonical + sync compat (chosen)¶
- Pros: covers the whole codebase in one pass, no stranded behaviour, consumers get a one-release deprecation window, migration CLI is the natural end-to-end smoke test of the async surface.
- Cons: dual layer to maintain for one release; CI test matrix slightly larger.
Option 2 — Stay sync¶
- Pros: zero refactor effort.
- Cons: blocks Copilot streaming + BFF concurrency + async Postgres; technical debt compounds.
Option 3 — Thread-pool the sync store¶
- Pros: no new async drivers; minimal code churn on the sync store.
- Cons: leaks executor threads under FastAPI concurrency; doubles latency for fast reads; breaks the
BEGIN IMMEDIATEserialisation guarantee when the threaded wrapper deadlocks with_WRITE_LOCK. Rejected.
Option 4 — Delete sync layer¶
- Pros: cleanest endpoint.
- Cons: breaks any out-of-tree consumer without a deprecation window. Rejected pending at least one transitional release.
Migration + deprecation timeline¶
| Release | Action |
|---|---|
| CSA-0046 v2 (this ADR) | Async canonical. Sync layer retained + deprecated docstrings. Migration CLI shipped. |
| CSA-0046 v3 (next minor) | Sync layer removed from portal.shared.api.persistence + persistence_postgres. build_store_backend removed. dependencies.py becomes the only public store entry-point. |
Data migration between SQLite and Postgres is performed with scripts/migrate_portal_persistence.py. The script is idempotent (ON CONFLICT DO NOTHING) and non-destructive (never writes to the source), so rollback is "drop the target tables and re-run" — the script documents this in its docstring.
Validation¶
We will know this decision is right if:
python -m pytest portal/shared/tests/passes unchanged for the existing 150-test sync-exercising suite + the newtest_async_persistence.pyandtest_migration_cli.pytests.python scripts/migrate_portal_persistence.py --helpprints usage and exits cleanly.- A full end-to-end migration of a seeded SQLite portal into a fresh Postgres Flexible Server using the CLI completes with
digest_mismatches=0on every table. - Under load, P95 latency on
/api/v1/sourcesdrops vs the sync baseline becauseawaiton the async Postgres driver is non-blocking. - Copilot SSE streams no longer show gaps when the portal is under persistence load (the original driver for this ADR).
References¶
- Decision tree: n/a.
- Related code:
portal/shared/api/persistence_async.pyportal/shared/api/dependencies.pyportal/shared/api/routers/*.pyportal/shared/api/main.pyscripts/migrate_portal_persistence.pyportal/shared/tests/test_async_persistence.pyportal/shared/tests/test_migration_cli.py
- Framework controls: NIST 800-53 AC-3 (access enforcement — async route scoping preserved verbatim), AU-3 (content of audit records — audit emissions still run inside the request scope), CP-9 (inherited from ADR-0015's Postgres choice), SC-28 (encryption at rest — unchanged), IA-2 (passwordless auth —
DefaultAzureCredential.aiosupplies MI tokens). - Preceding ADR: 0015 (Postgres portal persistence — explicitly deferred this work).
- Discussion: CSA-0046 follow-on.