Guard
by @clawkk
Deep AI safety guardrails workflow—policy definition, input/output filtering, monitoring, escalation, and false-positive handling. Use when reducing harmful...
clawhub install guard📖 About This Skill
name: guard description: Deep AI safety guardrails workflow—policy definition, input/output filtering, monitoring, escalation, and false-positive handling. Use when reducing harmful outputs, misuse, or policy violations in LLM products.
AI Guardrails (Deep Workflow)
Guardrails turn product and legal policy into enforced behavior: blocking, rewriting, logging, and human review—with attention to false positives and latency.
When to Offer This Workflow
Trigger conditions:
Initial offer:
Use six stages: (1) policy scope, (2) threat model, (3) controls stack, (4) implementation patterns, (5) monitoring & review, (6) iteration & appeals). Confirm latency budget and jurisdictions.
Stage 1: Policy Scope
Goal: Define prohibited categories (hate, sexual content, violence, self-harm, malware instructions, etc.) and required disclaimers for sensitive domains (medical, legal).
Exit condition: Policy document owned by legal/product; escalation path for gray areas.
Stage 2: Threat Model
Goal: Identify adversaries (prompt injection, data exfiltration, tool abuse) and assets (user data, system prompts, connectors).
Stage 3: Controls Stack
Goal: Layer defenses: input screening, model safety APIs, output classifiers, tool sandboxing, allowlists for tools and URLs.
Stage 4: Implementation Patterns
Goal: Structured refusal messages; telemetry on every block; distinguish block vs rewrite vs warn; avoid silent failures.
Stage 5: Monitoring & Review
Goal: Sample borderline cases for human review; dashboards on block rates by category; abuse spike alerts.
Stage 6: Iteration & Appeals
Goal: User appeals path where appropriate; version policy changes; measure false positives by locale and use case.