🎁 Get the FREE AI Skills Starter GuideSubscribe →
BytesAgainBytesAgain
🦀 ClawHub

AMG Storage Account Health Check

by @1w2w3y

Run only when the user explicitly asks for a fleet-wide Azure Storage Account health check — pulse check for availability, latency, transactions, and error r...

Versionv1.0.0
Downloads220
TERMINAL
clawhub install amg-check-storage-account

📖 About This Skill


name: amg-check-storage-account description: Run only when the user explicitly asks for a fleet-wide Azure Storage Account health check — pulse check for availability, latency, transactions, and error rates across all accounts, then deep-dives into the top 7 most interesting accounts with metrics (E2E latency, server latency, capacity, ingress/egress) and resource logs. Tracks known issues across sessions via persistent report. On first run, auto-discovers datasource UID and prompts for subscription ID. argument-hint: "[time-range, e.g. 7d, 1d, 3d] [subscription-id]" disable-model-invocation: true effort: max allowed-tools: mcp__amg__amgmcp_pulse_check mcp__amg__amgmcp_query_resource_graph mcp__amg__amgmcp_query_resource_metric mcp__amg__amgmcp_query_resource_metric_definition mcp__amg__amgmcp_query_resource_log mcp__amg__amgmcp_datasource_list mcp__amg__amgmcp_query_activity_log Bash(node *) Glob Read Write Edit

OpenClaw Setup (one-time)

This skill calls MCP tools prefixed with mcp__amg__*, so OpenClaw must have an MCP server registered under the exact name amg. Run this once per workspace before invoking the skill:

openclaw mcp set amg '{"url":"https:///api/azure-mcp","transport":"streamable-http","headers":{"Authorization":"Bearer "}}'

Replace with your Azure Managed Grafana endpoint and with a valid Grafana service-account token (starts with glsa_). The server name must be amg — the skill's allowed-tools reference mcp__amg__* and will not find tools under any other name.

Verify the server is registered:

openclaw mcp list

> Official skill source: https://github.com/Azure/amg-skills

Runtime Context

  • Current UTC time: !date -u +%Y-%m-%dT%H:%M:%SZ
  • Config: !cat memory/amg-check-storage-account/config.md 2>/dev/null || echo "NOT_CONFIGURED"
  • Prior report: ![ -f memory/amg-check-storage-account/report.md ] && echo "exists ($(grep -c '^### SA-' memory/amg-check-storage-account/report.md) bugs documented)" || echo "not found"
  • Arguments: time-range=$0, subscription-override=$1
  • > Known Issues: Before presenting findings, cross-reference results against memory/amg-check-storage-account/report.md.

    Azure Storage Account Health Check

    Analyze Azure Storage Account health using a two-phase approach: a single amgmcp_pulse_check call for fleet-wide summary, followed by targeted deep dives into the top 7 most interesting accounts only.

    Critical Constraints

  • Do NOT use subagents (Agent tool) for MCP queries. Subagents do not have access to MCP tools. All MCP tool calls must be made directly from the main conversation context.
  • Deep dive limit: at most 7 accounts. Select the most interesting accounts from pulse check results. Do not deep-dive the entire fleet.
  • Time format: ISO 8601 UTC with explicit from/to — NEVER use timespan (it causes errors).
  • UsedCapacity only supports interval PT1H — do NOT use PT6H or P1D for this metric.
  • Parallelism cap: 30 concurrent MCP calls per batch. Reduce to 4-5 if rate-limited.
  • Prerequisites

  • An AMG-MCP server must be connected (the allowed-tools frontmatter references the MCP server name — update it if your server has a different name)
  • The MCP server's Grafana service account token environment variable must be set
  • Configuration

    If Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.

    If Config is populated: Extract the datasource UID and subscription ID(s) from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.

  • Datasource UID: from ## Azure Monitor Datasource > UID
  • Subscription ID(s): from ## Subscriptions (or $1 if provided)
  • Resource Type: microsoft.storage/storageaccounts (lowercase)
  • ARM ID template: /subscriptions/{SUB}/resourceGroups/{RG}/providers/Microsoft.Storage/storageAccounts/{name}
  • Time Range

    Default is 7 days (pastDays: 7) for pulse check and deep-dive metrics, 24 hours for resource logs. If the user specifies a different range via $ARGUMENTS[0] (e.g., /amg-check-storage-account 3d), adjust accordingly. For resource log queries, keep the range narrow (1-2 days) to avoid timeouts.


    Workflow

    Phase 1: Validate Datasource & Discover Accounts

    Step 1a: Validate Datasource

    Call amgmcp_datasource_list with no parameters.

    Search the results for a datasource with type equal to grafana-azure-monitor-datasource. Extract its uid.

  • If it matches the configured UID, proceed.
  • If it differs, update memory/amg-check-storage-account/config.md, warn the user, and use the new UID.
  • If no Azure Monitor datasource is found, abort with a clear error.
  • Step 1b: Discover All Storage Accounts

    Call amgmcp_query_resource_graph once using the configured datasource UID and subscription ID(s):

    azureMonitorDatasourceUid: {DATASOURCE_UID}
    query: |
      resources
      | where type == 'microsoft.storage/storageaccounts'
      | where subscriptionId in ({SUBSCRIPTION_IDS})
      | project name, resourceGroup, location, subscriptionId, properties.provisioningState
      | order by location asc, name asc
    

    Replace {SUBSCRIPTION_IDS} with the configured subscription IDs formatted as comma-separated quoted strings (e.g., 'sub-id-1', 'sub-id-2').

    Constructing the ARM resource ID: Use subscriptionId from each row:

    /subscriptions/{subscriptionId}/resourceGroups/{resourceGroup}/providers/Microsoft.Storage/storageAccounts/{name}
    

    Region summary: Derive from the account list by counting accounts per unique location value.

    Note any accounts not in "Succeeded" provisioning state — flag them immediately.

    If zero accounts are found, report "No Storage Accounts found" and stop.

    Phase 2: Fleet-Wide Pulse Check

    Call amgmcp_pulse_check once to get a summary across all storage accounts:

    azureMonitorDatasourceUid: {DATASOURCE_UID}
    pastDays: 7
    scenarios: storage_summary
    

    If $1 provides a subscription ID, add subscriptionId to scope the scan. Otherwise, if the config has a single subscription, pass it.

    After the pulse check, verify: 1. The number of scanned resources is close to the Phase 1 account count. 2. The scenario shows status: "completed". 3. If errors occurred, retry once. If still failing, note the failure in the report.

    Cross-reference pulse check results with Phase 1 inventory to enrich each account with its resource group and region from the Resource Graph data.

    Phase 3: Top 7 Deep Dive

    From the pulse check results, select at most 7 accounts for detailed investigation. Prioritize accounts with the most interesting signals:

    1. Availability drops — any account below 99.9% availability 2. Highest error transaction counts — accounts with the most non-Success responses 3. Highest latency — accounts with the highest average E2E latency 4. Unusual capacity or traffic patterns — sudden capacity jumps, abnormally high ingress/egress 5. Diversity — prefer selecting accounts from different regions to maximize coverage

    If the pulse check shows fewer than 7 accounts with notable signals, only deep-dive those that have something worth investigating. Do not pad to 7.

    If the pulse check shows the entire fleet is healthy with no notable signals, skip Phase 3 entirely and report the fleet as healthy.

    #### Step 3a: Deep Metrics

    For each selected account, query these metrics in parallel using amgmcp_query_resource_metric. Compute from (matching pastDays from Phase 2) and to (now) in ISO 8601 UTC.

    | Metric Name | Aggregation | Interval | Purpose | |-------------|-------------|----------|---------| | Availability | Average | PT6H | Availability trend | | SuccessE2ELatency | Average | PT6H | Client-perceived latency | | SuccessE2ELatency | Maximum | PT6H | Tail latency spikes | | SuccessServerLatency | Average | PT6H | Storage backend latency | | Transactions | Total | PT6H | Total volume (no filter) | | Transactions (errors) | Total | PT6H | Error count (filter: ResponseType ne 'Success') | | UsedCapacity | Average | PT1H | Current capacity (use last data point) | | Ingress | Total | PT6H | Data volume in | | Egress | Total | PT6H | Data volume out |

    All 7 accounts can be queried in parallel (7 accounts x 9 metrics = 63 calls, within the 30-call batch cap when split into 3 batches).

    Correlation analysis — when analyzing metrics together:

  • High E2E latency + low server latency = network issue, not storage
  • High E2E latency + high server latency = storage backend issue (throttling, overloaded partition)
  • High error transactions + normal latency = client errors (auth, not-found), not performance
  • High error transactions + high latency = throttling (503) or backend degradation
  • Availability drop + high error transactions = sustained storage issue
  • High ingress + high latency = large writes saturating bandwidth
  • Dormant (all empty) = no traffic, investigate if expected or orphaned
  • #### Step 3b: Resource Logs

    For each selected account, query Storage resource logs using amgmcp_query_resource_log. Keep time range to 1-2 days.

    Log Query 1: Failed requests by status code

    StorageBlobLogs
    | where TimeGenerated between (datetime() .. datetime())
    | where StatusCode >= 400
    | summarize count() by StatusCode, StatusText, bin(TimeGenerated, 1h)
    | order by TimeGenerated asc
    

    Log Query 2: High latency operations

    StorageBlobLogs
    | where TimeGenerated between (datetime() .. datetime())
    | where ServerLatencyMs > 100
    | summarize count(), avg(ServerLatencyMs), max(ServerLatencyMs) by OperationName
    | order by count_ desc
    | take 20
    

    Log Query 3: Error distribution by operation

    StorageBlobLogs
    | where TimeGenerated between (datetime() .. datetime())
    | where StatusCode >= 400
    | summarize count() by OperationName, StatusCode, StatusText
    | order by count_ desc
    | take 30
    

    Log Query 4: Request volume trend

    StorageBlobLogs
    | where TimeGenerated between (datetime() .. datetime())
    | summarize
        TotalRequests=count(),
        FailedRequests=countif(StatusCode >= 400),
        ThrottledRequests=countif(StatusCode == 503),
        AvgLatencyMs=round(avg(ServerLatencyMs), 2)
        by bin(TimeGenerated, 1h)
    | order by TimeGenerated asc
    

    Log Query 5: Authentication failures

    StorageBlobLogs
    | where TimeGenerated between (datetime() .. datetime())
    | where StatusCode == 403
    | summarize count() by AuthenticationType, CallerIpAddress, UserAgentHeader
    | order by count_ desc
    | take 20
    

    > Note: If StorageBlobLogs returns no data, diagnostic settings may not be configured. Try StorageTableLogs or StorageQueueLogs. If none return data, note it and skip logs for that account.


    Classification

    | Severity | Criteria | |----------|----------| | CRITICAL | Availability avg < 99.0% | | WARNING | Availability avg < 99.9%, OR SuccessE2ELatency avg > 50ms, OR sustained latency > 20ms for 6+ hours, OR sustained error transactions across multiple time windows | | DORMANT | All metrics return empty timeSeries (no traffic in scan period) | | HEALTHY | All metrics within normal ranges |


    Analysis Guidance

    For known patterns, deep-dive queries, and correlation techniques, see reference/analysis-patterns.md.

    For optional deep-dive queries, see reference/deep-dive-queries.md.


    Output Format

    Present a summary report with these sections:

    1. Fleet Inventory

    Account count by region and subscription. Flag any accounts not in "Succeeded" provisioning state.

    2. Pulse Check Summary

    Fleet-wide summary from the storage_summary pulse check:

  • Total accounts scanned and scan status
  • Key fleet-wide metrics (availability, latency, error counts, capacity)
  • Breakdown by health status (healthy / warning / critical / dormant)
  • Which accounts were selected for deep dive and why each was chosen
  • 3. Deep Dive Findings (Top 7)

    For each selected account:

  • Metric timeline: Availability, E2E latency, server latency, transactions over the scan window
  • Anomaly characterization: spike vs sustained, onset time, duration, recovery
  • Correlation analysis: which metrics moved together
  • Error rate: error transactions / total transactions percentage
  • 4. Resource Log Findings

    For each deep-dived account:

  • Error summary: top error status codes, affected operations, throttling timeline
  • Latency analysis: slow operations by type, server vs E2E latency comparison
  • Authentication failures: if any, source IPs and user agents
  • 5. Known Issue Cross-Reference

    Compare findings against memory/amg-check-storage-account/report.md. For each known bug, state: still active / improving / worsening / resolved.

    6. Action Items

    Prioritized list:

  • Critical: accounts with availability < 99.0%
  • High: accounts with sustained error transactions, high latency, or throttling
  • Medium: accounts with elevated latency, intermittent errors, or capacity concerns
  • Low: dormant accounts (investigate if orphaned), informational items

  • Update Known Issues

    After presenting findings, update memory/amg-check-storage-account/report.md:

    1. Read the current file (create if it doesn't exist). 2. Update status of existing bugs based on today's telemetry. 3. Add new bugs with: severity, account name, region, metric evidence, log evidence, root cause, recommended action. 4. Update the "Updated" date in the header.

    Only add genuine issues: sustained availability drops, persistent throttling, high error rates, or latency degradation.


    Error Handling

    See ${CLAUDE_SKILL_DIR}/reference/error-handling.md for the full recovery table.


    Reference

  • Storage resource type: Microsoft.Storage/storageAccounts
  • ARM ID template: /subscriptions/{SUB}/resourceGroups/{RG}/providers/Microsoft.Storage/storageAccounts/{name}
  • Resource log tables: StorageBlobLogs (primary), StorageTableLogs, StorageQueueLogs (alternatives)
  • Key status codes: 503 (throttling), 403 (auth failure), 404 (not found), 409 (conflict)
  • UsedCapacity: PT1H interval only
  • Known issues: memory/amg-check-storage-account/report.md
  • Configuration: memory/amg-check-storage-account/config.md

  • First-Run Setup

    Run only when Config shows NOT_CONFIGURED. After completing, return to the Workflow above.

    1. Discover Datasource UID: Call amgmcp_datasource_list. Filter type == "grafana-azure-monitor-datasource". Prefer uid == "azure-monitor-oob" if multiple match. Abort if zero match.

    2. Discover Subscription ID(s): Run this Resource Graph query to list all subscriptions with storage accounts, then present the results as a table and ask the user which subscription(s) to use:

    resources
    | where type == 'microsoft.storage/storageaccounts'
    | join kind=inner (
        resourcecontainers
        | where type == 'microsoft.resources/subscriptions'
        | project subscriptionId, subscriptionName=name
    ) on subscriptionId
    | summarize StorageAccounts=count() by subscriptionId, subscriptionName
    | order by StorageAccounts desc
    

    Present the results as a table with columns: Subscription Name, Subscription ID, Storage Accounts. Then ask the user: *"Which subscription ID(s) should I configure for this health check?"*

    3. Write config: Write memory/amg-check-storage-account/config.md:

    # amg-check-storage-account Configuration

    User-specific values for the Storage Account health check skill. This file is auto-generated on first run and can be edited manually.

    Azure Monitor Datasource

  • UID: {discovered_uid}
  • Name: {discovered_name}
  • Subscriptions

  • {subscription_id_1}
  • {subscription_id_2}
  • 4. Confirm: Show the resolved config and ask for confirmation before proceeding.

    ⚙️ Configuration

    If Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.

    If Config is populated: Extract the datasource UID and subscription ID(s) from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.

  • Datasource UID: from ## Azure Monitor Datasource > UID
  • Subscription ID(s): from ## Subscriptions (or $1 if provided)
  • Resource Type: microsoft.storage/storageaccounts (lowercase)
  • ARM ID template: /subscriptions/{SUB}/resourceGroups/{RG}/providers/Microsoft.Storage/storageAccounts/{name}