Deploy iteration log — first real provision¶

Honest record of every iteration through the first real run_mode=full deploy of the CSA Loom Bicep stack against Azure Commercial. Each iteration surfaced new real-Azure failures specific to tenant / region / API version that couldn't be caught by Bicep lint or what-if alone. This document is the proof of what was hit + what was fixed — and what remains as known intermittent issues.

Cumulative result (as of 2026-05-23)¶

✅ Auth validated end-to-end against Azure Commercial (Wave 2 SP auth fix)
✅ Bicep what-if validates the entire 30+ resource stack against the live tenant
✅ 5 iterations of real provision fixing real Azure regional / state / API issues
⏳ Full provision still has remaining issues — see "Outstanding" below

Iteration log¶

#	Run ID	Failure	Fix	PR
1	26322792356	Tenant already has Enterprise Purview account; only 1 per tenant	Default `purviewEnabled = false`; operator opts in	#294
2	26324494445	(a) KV role assignment: empty admin group; (b) AI Search: regional capacity in eastus2; © Container Apps Env: missing LAW sharedKey	(a) Gate role on `!empty(adminEntraGroupId)`; (b) `aiSearchEnabled = false`; © wire `lawSharedKey` from monitoring	#297
3	26325085349	(a) DLZ network CIDR malformed ('10.mw.0.0/16' from uniqueString); (b) DLZ ADX DB referenced non-existent admin cluster	(a) Deterministic `10.100.0.0/16` default; (b) `adxEnabled = false` gate	#299
4	26325723745	synapse.bicep called shared diag-settings helper without scope override → BadRequest at RG scope	Removed broken module call; inline diag resource was already correct	#300
5	26326256884	(a) Event Hubs CG referenced `$default` event hub that didn't exist; (b) Storage versioning incompatible with HNS; © Cosmos zonal account capacity in eastus2	(a) Removed manual CG; (b) `isVersioningEnabled = false`; © `zoneRedundant` param default false	#301
6	26327020669	Cosmos containers used slash-path nested name pattern → what-if BadRequest	Refactored to `parent: dbs[i]` array-index reference	#302

Continued iteration after #302¶

#	Run ID	Failure	Fix	PR
7	26328703349	Synapse — 3 issues: AAD admin needs SID; firewall rules need public access; deployment-script needs valid UAMI	Gate AAD admin on `!empty()`; gate firewall on `managedVnet=false`; new `synapseRoleAssignmentUamiId` param	#304
8	26329796598	Databricks: NSG `ConflictWithNetworkIntentPolicy` — vnet-injected ADB needs specific worker outbound rules	Added 4 required NSG security rules (worker→Sql, worker→Storage, worker→EventHub, intra-vnet inbound)	#305

🎉 Iter #8 SUCCESS¶

Run 26330578932 — first full successful deploy against Azure Commercial.

Bicep what-if: ✅
Provision: ✅ in 10m21s
Post-provision validation (from inside cluster): ✅
External smoke test (best-effort): ✅
Resources kept per keep_resources=true

Console URL emitted: https://loom-console.delightfulmoss-96202bfd.eastus2.azurecontainerapps.io (VNet-internal — requires Bastion to browse)

Resources successfully provisioned in rg-csa-loom-admin-eastus2 + rg-csa-loom-dlz-single-eastus2:

Hub VNet + 7 subnets + Bastion Standard + Azure Firewall + 17 private DNS zones
7 UAMIs
Key Vault Premium + private endpoint
LAW + AppInsights + Sentinel + 2 AI threat-detection rules
ACR Premium + private endpoint
Container Apps Env (internal, zone-redundant)
Catalog stub + AI defense playbook
DLZ: spoke VNet + ADB-compliant NSG + peering
DLZ storage (ADLS Gen2 + HNS + 5 containers + EG topic + PEs)
DLZ Databricks workspace (Premium, VNet-injected)
DLZ Synapse workspace (Serverless SQL + managed VNet + audit)
DLZ Event Hubs (Kafka surface + PE)
DLZ Cosmos DB (5 workload databases + PE)

Gated off per first-deploy convention (operator opts in module-by-module):

AI Foundry Hub (aiFoundryEnabled)
APIM (apimEnabled)
AI Search (aiSearchEnabled — regional capacity)
ADX DB (adxEnabled — needs cluster pre-provisioned)
App deployments (deployAppsEnabled — needs container images in ACR)
Synapse data-plane RBAC script (synapseRoleAssignmentUamiId)
Purview (purviewEnabled — tenant collision per iter #1)

Iter #9 — image build (separate iteration cycle)¶

Run 26330782170 — image build via GitHub runners.

All 6 builds failed with CONNECTIVITY_REFRESH_TOKEN_ERROR at ACR login.

Root cause: ACR has publicNetworkAccess: Disabled (correct per security posture) — unreachable from external GitHub runners.

Fix paths (operator chooses):

A — ACR Tasks (build-fiab-images-acr-tasks.yml): ATTEMPTED in iter #10 (run 26331140630) — also failed. Even ACR Tasks runs as Microsoft service infrastructure OUTSIDE the customer VNet, so it gets the same firewall-deny when publicNetworkAccess: Disabled. The networkRuleBypassOptions: AzureServices bypass is overridden when public access is fully disabled.
B — Self-hosted GitHub runner inside Hub VNet (recommended for production): VM-based runner in snet-workloads with VNet access to the ACR private endpoint; tag the build workflow to runs-on: [self-hosted, csa-loom-hub].
C — Temporarily enable ACR public access for the initial build pull:
```
az acr update --name $ACR --public-network-enabled true
gh workflow run build-fiab-images --ref main -f tag_override=v0.1
# After build completes:
az acr update --name $ACR --public-network-enabled false
```
This is the fastest unblock for first-time deploys. The window of public access is bounded to the duration of the build (~10-15 min) and the ACR still requires SP auth, so the risk is bounded.
D — Operator builds locally + pushes via Bastion: developer machine inside a Bastion-jumpbox session has VNet access to ACR.

Outstanding follow-ups¶

Image build via ACR Tasks — add .github/workflows/build-fiab-images-acr-tasks.yml
Re-dispatch with deployAppsEnabled=true once images exist
Front-end UI validation: requires Bastion + jumpbox in hub VNet; per first-deploy.md Phase 4
GCC + GCC-High validation: per secrets-bootstrap.md, needs SP credentials in those tenants
Build 2026 freshness rescan: auto-fires Jun 8

Recommended next-step pattern¶

Always teardown first between deploy attempts when state is suspect:
```
gh workflow run teardown-fiab-commercial -f confirm=DELETE
```
Wait for teardown completion (up to 30 min for Cosmos cleanup) before re-dispatching.

Dispatch with keep_resources=true during iteration so you can inspect live state:

gh workflow run deploy-fiab-commercial -f run_mode=full -f keep_resources=true

Use the post-provision validation script which curls from inside the cluster (works around VNet-internal ingress).

Region recommendations¶

Based on iter #2 + #5 capacity failures in eastus2: - AI Search: try eastus, centralus, or westus3 - Cosmos zonal: stick with default zoneRedundant=false in eastus2; try eastus for zonal - All other resources have validated successfully in eastus2

Cost notes¶

Each run_mode=full attempt that gets past provision spins up resources costing roughly $10-30/hour at F8 capacity SKU + 1 Databricks workspace.
The keep_resources=true workflow input WARNS about ongoing spend in the log; operator manually runs the teardown workflow when done iterating.
Failed deploys roll back via ARM (most resources auto-delete on failure), but Cosmos accounts in particular leave partial state.

Loom LAW monitoring + alert pack
Deploy failure runbook
First deploy operator playbook
Workflow: .github/workflows/deploy-fiab-commercial.yml
Teardown workflow: .github/workflows/teardown-fiab-commercial.yml
Script: .github/scripts/fiab-teardown.sh (multi-sub aware; KV purge before RG delete)