Skip to content

Home > Docs > Runbooks > Authentication Failure Playbook

๐Ÿ”‘ Authentication Failure Playbook

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Anchor: incident-response-template.md Audience: On-call engineers, identity admins, platform SREs Purpose: Diagnose and resolve Fabric authentication failures across Workspace Identity, Service Principals, Managed Identities, OAuth users, and Conditional Access

Category Domain Platform Severity


๐Ÿ“‘ Table of Contents

  1. Symptoms
  2. Severity Classification
  3. Identity-Type Triage Matrix
  4. Triage Decision Tree
  5. Diagnostic Steps
  6. Resolution Procedures
  7. Verification
  8. Rollback
  9. Post-Incident Actions
  10. Escalation
  11. Quick-Reference Commands
  12. Related Runbooks

Symptoms

Symptom Common Surface Likely Identity Type
HTTP 401 Unauthorized from Fabric REST API Pipeline activities, fabric-cicd deploys, Power BI refresh SP secret expired, MI not granted, token expired
HTTP 403 Forbidden with AuthorizationFailed Lakehouse query, OneLake shortcut, Key Vault read Workspace Identity missing role, OneLake DAR misconfigured
AADSTS50076 MFA challenge during non-interactive job Scheduled pipelines, automated refresh User principal used where SP/MI required
AADSTS700016 / AADSTS7000215 invalid client / invalid secret CI/CD job, fabric-cicd-deploy.py SP secret rotated, expired, or wrong client_id
AADSTS50105 user does not have access to the application Newly onboarded analyst, guest user Conditional Access scope, group membership
AADSTS53003 access blocked by Conditional Access Sudden mass-failure across one location CA policy change, IP firewall, country block
AADSTS500011 resource principal not found in tenant Cross-tenant guest, B2B collaboration Guest user removed, tenant federation broken
Token lifetime exceeded / expired_token Long-running Spark job, streaming consumer OAuth refresh token expired
Tenant IP firewall: client IP not allowed Remote employee, new VPN egress Workspace IP firewall rules
Mass user lockout โ€” many users 401 simultaneously All consumers of a workspace Tenant-level CA change, secret-rotation cascade, region outage

Triage rule: if >10 distinct UPNs fail within 5 minutes, classify SEV1 and check the M365 Service Health Dashboard and Entra ID status page before assuming a Fabric-side issue.


Severity Classification

See the master Severity Matrix for SLAs.

Severity Trigger Examples
SEV1 Sudden mass auth failure (>10 UPNs in 5 min); production pipeline halted; compliance reporting blocked; tenant-wide CA misconfiguration CA policy rolled out tenant-wide blocks all SPs; Workspace Identity disabled in error; KV secret rotated without updating SP binding
SEV2 Single workspace down; primary CI/CD SPN broken; one production pipeline failing >30 min on auth fabric-cicd SPN secret expired; Workspace Identity lost role assignment after RG redeploy; IP firewall block of CI runner range
SEV3 Single user 403; one guest user can't access; non-prod identity issue Analyst lacks workspace Viewer role; guest user not refreshed since reseed; dev-environment SPN missing
SEV4 Documentation/UX issue; no production impact Stale runbook link; non-blocking warning in sign-in logs

Identity-Type Triage Matrix

Use this matrix as your first move. The identity type drives every downstream step.

Identity Type Where Used Typical Failure Modes First Diagnostic
Workspace Identity (GA 2026) Lakehouse โ†’ ADLS shortcut, notebook โ†’ Key Vault, OneLake DAR Role assignment missing on target resource; identity disabled at workspace level List role assignments on principalId from infra/modules/security/workspace-identity.bicep outputs
Service Principal (CI/CD) fabric-cicd-deploy.py, GitHub Actions, ADO pipelines Secret expired/rotated; client_id wrong; lost workspace Contributor role; CA policy excluded SP Check Entra app registration โ†’ Certificates & secrets; check workspace role assignments
System-Assigned Managed Identity Azure Function calling Fabric REST API Function app MI not granted Fabric workspace access; resource lacks system-assigned enabled az resource show ... --query identity
User-Assigned Managed Identity Azure VM, Logic App, Container App Role assignment missing on target Fabric workspace or downstream Azure resource az identity show + az role assignment list --assignee <principalId>
User OAuth (Entra ID) Power BI Desktop, Fabric portal, interactive notebook Token expired; MFA challenge in headless context; CA blocks device Refresh credentials; inspect token claims at jwt.ms
Conditional Access All identities Policy change blocks legitimate access; missing exclusion; geo-block List CA policies via Graph API; check sign-in log conditionalAccessStatus
IP Firewall Workspace, tenant New VPN egress IP not allowlisted; firewall rules drift Compare api.ipify.org to firewall ranges
Cross-Tenant Guest (B2B) External auditors, partners (e.g., NIGC, regulators) Guest user removed; cross-tenant access settings changed; consent revoked Check guest user state in target tenant; verify cross-tenant access policy

Triage Decision Tree

flowchart TD
    Start([Auth Failure Reported]) --> Q1{How many<br/>identities affected?}
    Q1 -->|>10 UPNs in 5 min| SEV1[SEV1 โ€” page on-call IC<br/>+ Identity team]
    Q1 -->|Single user/SP| Q2{Identity type?}

    Q2 -->|Workspace Identity| WI[Check role assignment<br/>+ workspace identity enabled]
    Q2 -->|Service Principal| SP[Check secret expiry<br/>+ workspace role + CA exclusion]
    Q2 -->|Managed Identity| MI[Check identity enabled<br/>+ Fabric role grant]
    Q2 -->|User OAuth| User[Check group membership<br/>+ MFA + CA + IP firewall]
    Q2 -->|Cross-Tenant Guest| Guest[Check guest state<br/>+ cross-tenant access policy]

    SEV1 --> Diag[Phase 1 โ€” Detect & Triage]
    WI --> Diag
    SP --> Diag
    MI --> Diag
    User --> Diag
    Guest --> Diag

    Diag --> Resolve[Apply Resolution Procedure]
    Resolve --> Verify[Run Probe Query]
    Verify -->|Pass| Close[Close incident, schedule PIR]
    Verify -->|Fail| Escalate[Escalate to Identity team]

Diagnostic Steps

1. Pull Entra Sign-In Logs (KQL)

// Sign-ins for the failing identity in last 1 hour
SigninLogs
| where TimeGenerated > ago(1h)
| where UserPrincipalName == "spn-fabric-cicd-prod@contoso.onmicrosoft.com"
   or AppDisplayName has "Fabric"
| project TimeGenerated, UserPrincipalName, AppDisplayName, ResourceDisplayName,
          ResultType, ResultDescription, IPAddress, ConditionalAccessStatus
| order by TimeGenerated desc | take 100

// Mass-failure detection in last 15 min
SigninLogs
| where TimeGenerated > ago(15m) and ResultType != 0
| summarize Failures = count(), Sample = any(ResultDescription)
        by UserPrincipalName, AppDisplayName
| where Failures > 3 | order by Failures desc

// Service principal sign-ins (separate table)
AADServicePrincipalSignInLogs
| where TimeGenerated > ago(1h) and ResultType != 0
| project TimeGenerated, ServicePrincipalName, AppId, ResultType, ResultDescription, IPAddress

2. Inspect Token Claims

For interactive debugging, copy a token (never paste production tokens publicly) and decode at https://jwt.ms. Verify:

  • aud (audience) matches https://api.fabric.microsoft.com or https://analysis.windows.net/powerbi/api
  • appid matches expected Service Principal client_id
  • oid (object id) matches expected principalId
  • roles / wids contain expected app roles
  • exp is in the future (not expired)
  • tid (tenant id) is correct tenant
# Acquire and decode token via Azure CLI (dev only - never run in prod terminals)
az account get-access-token --resource https://api.fabric.microsoft.com \
  --query accessToken -o tsv \
  | cut -d. -f2 | base64 -d 2>/dev/null | jq .

3. Validate Workspace Identity Role Assignments

# Resolve the workspace identity principalId from Bicep outputs
PRINCIPAL_ID=$(az identity show \
  --name "id-fabric-ws-fabricpoc-prod" \
  --resource-group "rg-fabric-prod" \
  --query principalId -o tsv)

# List all role assignments for this identity
az role assignment list \
  --assignee "$PRINCIPAL_ID" \
  --all \
  --output table

# Verify expected scopes per workspace-identity.bicep:
#   - Key Vault Secrets User (4633458b-17de-408a-b874-0445c86b69e6)
#   - Storage Blob Data Contributor (ba92f5b4-2d11-453d-a403-e96b0029c9fe)
#   - Purview Data Curator (af8bf84c-4de3-462a-b576-41e6c7478f52) [optional]

4. Verify Conditional Access Policies via Graph API

# List enabled CA policies
az rest --method get \
  --url "https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies" \
  --query "value[?state=='enabled'].{name:displayName, id:id}" -o table

# Inspect a specific policy
az rest --method get \
  --url "https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies/${POLICY_ID}"

# Find CA-blocked sign-ins
az rest --method get \
  --url "https://graph.microsoft.com/v1.0/auditLogs/signIns?\$filter=conditionalAccessStatus eq 'failure'&\$top=20"

5. Check IP Firewall Rules for the Workspace

# Resolve current public egress IP
CURRENT_IP=$(curl -s https://api.ipify.org)
echo "Current public IP: $CURRENT_IP"

# List workspace firewall rules
az rest --method get \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/firewallRules" \
  --query "value[].{name:name, start:startIpAddress, end:endIpAddress}" \
  --output table

# Test reachability (should resolve and accept TLS)
nslookup api.fabric.microsoft.com
# From Windows / SHIR machine
Test-NetConnection -ComputerName "api.fabric.microsoft.com" -Port 443
Test-NetConnection -ComputerName "login.microsoftonline.com" -Port 443

6. Check Recent Identity Changes (Audit Log)

// Recent SPN credential changes + CA policy edits in last 24h
AuditLogs
| where TimeGenerated > ago(24h)
| where OperationName has_any (
    "service principal credentials",
    "Certificates and secrets management",
    "conditional access"
)
| project TimeGenerated, OperationName, InitiatedBy, Result, TargetResources
| order by TimeGenerated desc

Resolution Procedures

A. Rotate Service Principal Secret + Update Key Vault Binding

# 1. Create new SP secret (use --append to retain old secret during rollout)
NEW_SECRET=$(az ad app credential reset --id "${SPN_CLIENT_ID}" --append \
  --years 1 --display-name "rotated-$(date +%Y%m%d)" --query password -o tsv)

# 2. Push to Key Vault (Key Vault retains version history)
az keyvault secret set --vault-name "kv-fabric-prod" \
  --name "fabric-spn-client-secret" --value "$NEW_SECRET"

# 3. Restart cached consumers (Function/Logic apps), then re-trigger failed run
az functionapp restart --name "func-fabric-orchestrator" --resource-group "rg-fabric-prod"
gh workflow run deploy-fabric.yml

Critical: Do NOT delete the old secret until the new one is verified. Keep both active for 24h while cached consumers refresh.

B. Re-Grant Workspace Identity Role Assignments

If the workspace identity lost a role assignment (common after RG redeploy), re-run the Bicep module or grant directly:

# Direct grants matching workspace-identity.bicep role IDs
az role assignment create --assignee "$PRINCIPAL_ID" --role "Key Vault Secrets User" \
  --scope "/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KV_NAME}"

az role assignment create --assignee "$PRINCIPAL_ID" --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${STG_NAME}"

# Or re-deploy Bicep idempotently
az deployment group create --resource-group "$RG" \
  --template-file infra/modules/security/workspace-identity.bicep \
  --parameters projectPrefix=fabricpoc environment=prod \
      keyVaultId="/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KV_NAME}" \
      storageAccountId="/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${STG_NAME}"

C. Refresh OAuth Tokens (REST API)

For long-running consumers that ran past token expiry:

# Client credentials flow (SP)
curl -X POST "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token" \
  -d "client_id=${CLIENT_ID}" -d "client_secret=${CLIENT_SECRET}" \
  -d "scope=https://api.fabric.microsoft.com/.default" -d "grant_type=client_credentials"

# Refresh token flow (delegated user)
curl -X POST "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token" \
  -d "client_id=${CLIENT_ID}" -d "refresh_token=${REFRESH_TOKEN}" \
  -d "grant_type=refresh_token" -d "scope=https://api.fabric.microsoft.com/.default"

In notebooks, prefer Workspace Identity (no refresh needed) or mssparkutils.credentials.getToken() which auto-refreshes.

D. Update Conditional Access (with Approval)

CHANGE-CONTROL RULE: Modifying CA policies requires approval from the Identity team Change Advisory Board (CAB). For SEV1, get verbal approval from on-call Identity lead and document in the incident channel; file a retroactive RFC within 24h.

# Add an exclusion for the impacted SP/group (preferred over disabling policy)
az rest --method patch \
  --url "https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies/${POLICY_ID}" \
  --body '{
    "conditions": {
      "users": {
        "excludeServicePrincipals": ["'"${SPN_OBJECT_ID}"'"]
      }
    }
  }'

E. Whitelist IP Range (Workspace IP Firewall)

# Add allowed IP range to the workspace
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/firewallRules" \
  --body '{
    "name": "GHA-runner-egress-2026-04-27",
    "startIpAddress": "20.42.0.0",
    "endIpAddress": "20.42.0.255"
  }'

Reference IP firewall pattern: see network-security.md ยงIP Firewall.

F. Re-Grant Cross-Tenant Guest Access

# Re-invite the guest user (idempotent - re-issues invitation if accepted state lost)
az rest --method post \
  --url "https://graph.microsoft.com/v1.0/invitations" \
  --body '{
    "invitedUserEmailAddress": "auditor@partner-tenant.com",
    "inviteRedirectUrl": "https://app.fabric.microsoft.com",
    "sendInvitationMessage": true
  }'

# Add to the appropriate Fabric security group
az ad group member add \
  --group "sg-fabric-casino-prod-viewers" \
  --member-id "${GUEST_OBJECT_ID}"

# Verify cross-tenant access policy allows inbound from partner tenant
az rest --method get \
  --url "https://graph.microsoft.com/v1.0/policies/crossTenantAccessPolicy/partners/${PARTNER_TENANT_ID}"

Verification

After applying any resolution, run a probe as the failing identity before declaring resolved.

Probe 1 โ€” Token + Workspace Read

TOKEN=$(curl -s -X POST "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token" \
  -d "client_id=${CLIENT_ID}" -d "client_secret=${NEW_SECRET}" \
  -d "scope=https://api.fabric.microsoft.com/.default" -d "grant_type=client_credentials" \
  | jq -r .access_token)

curl -s -o /dev/null -w "HTTP %{http_code}\n" \
  -H "Authorization: Bearer $TOKEN" \
  "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}"

Probe 2 โ€” Lakehouse Query (Workspace Identity scenarios)

# Run inside the affected notebook as the workspace identity
spark.sql("SELECT COUNT(*) AS row_count FROM lh_bronze.bronze_slot_telemetry").show()

Probe 3 โ€” Pipeline Smoke Test

az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances?jobType=Pipeline"

Verification Checklist

  • Probes 1-3 all pass for the affected identity
  • Sign-in logs show ResultType == 0 for the identity since the fix
  • No new 401/403 alerts firing; affected pipelines complete end-to-end
  • Monitor for 2ร— the incident duration before closing (per Phase 3 verification)

Rollback

If the fix made things worse, roll back fast.

Change Rollback Step
New SP secret pushed to KV az keyvault secret set to previous version (Key Vault keeps history): az keyvault secret restore or set value from az keyvault secret list-versions
CA policy modified Revert via az rest --method patch with prior policy JSON (always snapshot policy before edit: az rest --method get ... > backup-${POLICY_ID}-$(date +%s).json)
IP firewall rule added az rest --method delete --url ".../firewallRules/${RULE_ID}"
Role assignment added az role assignment delete --assignee ${PRINCIPAL_ID} --role ${ROLE} --scope ${SCOPE}
Guest user re-invited Remove from group: az ad group member remove --group ${GROUP} --member-id ${GUEST_OBJECT_ID}
Bicep redeploy of workspace-identity Re-deploy from prior commit: git checkout ${PRIOR_SHA} -- infra/modules/security/workspace-identity.bicep && az deployment group create ...

Always snapshot before changing identity config. A 30-second az rest --method get > backup.json saves hours of recovery.


Post-Incident Actions

Action Owner Due
File postmortem using template Incident Commander 48 hr (SEV½)
Set up secret-expiry alerts in Action Group (90/30/7-day warnings) Platform Lead 1 week
Automate SP rotation via Key Vault rotation policy or Azure Automation runbook Identity Team 30 days
Audit-log forensics: confirm no unauthorized access during outage window Security 5 business days
Update identity-rbac-patterns.md if a new pattern emerged Doc owner 5 business days
If recurring class: promote action items to P0 (per PIR rules) IC At PIR
Verify Workspace Identity preferred over SP for all in-Fabric workloads Platform Lead Quarterly review

Long-Term Hardening

  • Federated credentials over secrets โ€” replace SP client secrets with OIDC federation for GitHub Actions (workflow_identity setting); zero-secrets is the goal.
  • Workspace Identity first โ€” for any in-Fabric workload, always prefer Workspace Identity (infra/modules/security/workspace-identity.bicep) over SPs.
  • Conditional Access staging โ€” never roll out a new CA policy tenant-wide without a "report-only" pilot phase first.
  • Secret rotation cadence โ€” enforce 90-day max via tenant policy; alert at 30 days remaining.

Escalation

When Escalate To Channel
Mass auth failure (>10 UPNs in 5 min) Identity Team on-call + Incident Commander Page (SEV1)
CA policy needs emergency change Identity CAB lead Phone bridge
Suspected credential compromise / unauthorized access Security Operations Center (SOC) SOC paging line + email security@
Cross-tenant guest issue persists after re-invite Partner tenant identity admin Email coordinated by Comms Lead
Tenant-wide Entra outage Microsoft Premier Support (Sev A ticket) Support portal + reference Service Health incident ID
Workspace Identity not propagating roles after 30 min Microsoft Fabric support (Sev A) Support portal
HIPAA / SOX / NIGC compliance impact during outage Compliance Officer Phone + email per comm tree

Quick-Reference Commands

Azure CLI โ€” Create Service Principal + Role Assignment

# Create app registration + SP
APP=$(az ad app create --display-name "spn-fabric-cicd-prod" --query appId -o tsv)
az ad sp create --id "$APP"

# Add 1-year secret (capture immediately โ€” shown only once)
SECRET=$(az ad app credential reset --id "$APP" --years 1 --query password -o tsv)

# Store in Key Vault
az keyvault secret set --vault-name "kv-fabric-prod" --name "fabric-spn-cicd-secret" --value "$SECRET"

# Grant Fabric workspace Contributor role (via Fabric REST)
PRINCIPAL_ID=$(az ad sp show --id "$APP" --query id -o tsv)
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/roleAssignments" \
  --body "{\"principal\":{\"id\":\"${PRINCIPAL_ID}\",\"type\":\"ServicePrincipal\"},\"role\":\"Contributor\"}"

Microsoft Graph KQL โ€” Sign-In Investigation

// Last hour failures grouped by ResultType (error code)
SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType != 0
| summarize Count = count(), Sample = any(ResultDescription) by ResultType
| order by Count desc
// CA-blocked sign-ins by location
SigninLogs
| where TimeGenerated > ago(24h)
| where ConditionalAccessStatus == "failure"
| summarize Failures = count() by Location, City = tostring(LocationDetails.city)
| order by Failures desc

Bicep โ€” Grant Workspace Identity Role (snippet)

// See infra/modules/security/workspace-identity.bicep for the full module
resource grant 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(targetResource.id, workspaceIdentity.id, roleId)
  scope: targetResource
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', roleId)
    principalId: workspaceIdentity.properties.principalId
    principalType: 'ServicePrincipal'
  }
}

PowerShell โ€” IP Firewall Validation

# Validate connectivity from the failing host
Test-NetConnection -ComputerName "api.fabric.microsoft.com" -Port 443
Test-NetConnection -ComputerName "login.microsoftonline.com" -Port 443
Test-NetConnection -ComputerName "onelake.dfs.fabric.microsoft.com" -Port 443

# Get current public egress IP (compare to firewall allowlist)
(Invoke-RestMethod "https://api.ipify.org?format=json").ip

REST โ€” Probe Token + Workspace Read

TOKEN=$(az account get-access-token --resource https://api.fabric.microsoft.com --query accessToken -o tsv)

curl -s -H "Authorization: Bearer $TOKEN" \
  "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}" | jq .

Runbook When to Use
Incident Response Template Anchor โ€” severity matrix, comms tree, postmortem template
Pipeline Failure Triage If auth failure surfaced as pipeline activity error
Capacity Throttling Response If 401/403 mistakenly tied to capacity issues
Tenant Migration (Dev/Staging/Prod) If failure followed a recent tenant promotion
Multi-Region Failover If region-wide Entra ID outage suspected
Document Description
Identity & RBAC Patterns Workspace roles, RLS/CLS/OLS, service identity selection
Network Security IP Firewall, Conditional Access integration, private endpoints
OneLake Security Data Access Roles, OneLake-level RLS/CLS
Customer-Managed Keys KV-bound encryption keys; auth-related failures when KV access lost
SQL Audit Logs Compliance Audit trail for compliance investigation post-incident
fabric-cicd Deployment SP-based CI/CD patterns; secret management

โฌ†๏ธ Back to Top | ๐Ÿ“š Runbooks Index | ๐Ÿ  Home