Skill Eval
by @xiaoxing9
Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his...
Follow USAGE.md for all workflows.
Quick reference:
| Workflow | What It Tests | USAGE.md Section |
|----------|---------------|------------------|
| Trigger Rate | Does description trigger SKILL.md reads at the right times? | Workflow 1 |
| Quality Compare | Does skill improve output vs no-skill baseline? | Workflow 2 |
| Model Comparison | Quality + Speed across haiku/sonnet/opus | Workflow 3 |
| Latency Profile | Response time p50/p90 | Workflow 4 |
Each workflow follows the same pattern:
1. Agent spawns subagents using sessions_spawn
2. Agent collects histories using sessions_history
3. Agent writes raw data to workspace/{skill}/iter-{n}/raw/
4. Agent runs analysis script via exec
clawhub install openclaw-skill-eval