🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub✦ BytesAgain

Agent Learner

by @xueyetianya

Benchmark and compare agent prompts and evaluation results. Use when tuning strategies, evaluating outputs, or comparing configurations.

Versionv2.0.2
Downloads669
Installs2
TERMINAL
clawhub install agent-learner

πŸ“– About This Skill


version: "2.0.1" name: agent-learner description: "Benchmark and compare agent prompts and evaluation results. Use when tuning strategies, evaluating outputs, or comparing configurations." author: BytesAgain homepage: https://bytesagain.com source: https://github.com/bytesagain/ai-skills

Agent Learner

An AI toolkit for configuring, benchmarking, comparing, and optimizing agent prompts and evaluation results. Agent Learner provides persistent, file-based logging for each command category with timestamped entries, summary statistics, multi-format export, and full-text search across all records.

Commands

| Command | Description | |---------|-------------| | configure | Configure agent settings β€” log configuration entries or view recent ones | | benchmark | Benchmark agent performance β€” log benchmark results or view history | | compare | Compare agent outputs β€” log comparison data or view recent comparisons | | prompt | Prompt management β€” log prompt variations or view recent prompts | | evaluate | Evaluate agent outputs β€” log evaluation results or view history | | fine-tune | Fine-tune parameters β€” log fine-tuning sessions or view recent ones | | analyze | Analyze agent behavior β€” log analysis entries or view recent analyses | | cost | Cost tracking β€” log cost data or view recent cost entries | | usage | Usage monitoring β€” log usage metrics or view recent usage data | | optimize | Optimize configurations β€” log optimization runs or view history | | test | Test agent behavior β€” log test results or view recent tests | | report | Report generation β€” log report entries or view recent reports | | stats | Show summary statistics across all log categories (entry counts, data size, first entry date) | | export | Export all data in json, csv, or txt format to the data directory | | search | Full-text search across all log files (case-insensitive) | | recent | Show the 20 most recent entries from the activity history log | | status | Health check β€” show version, data directory, total entries, disk usage, and last activity | | help | Show the full help message with all available commands | | version | Print the current version string |

Each data command (configure, benchmark, compare, etc.) works in two modes:

  • Without arguments: displays the 20 most recent entries from that category
  • With arguments: saves the input as a new timestamped entry and reports the total count
  • Data Storage

    All data is stored in plain text files under the data directory:

  • Category logs: $DATA_DIR/.log β€” one file per command (e.g., configure.log, benchmark.log, prompt.log), each entry is timestamp|value
  • History log: $DATA_DIR/history.log β€” audit trail of every command executed with timestamps
  • Export files: $DATA_DIR/export. β€” generated by the export command in json, csv, or txt format
  • Default data directory: ~/.local/share/agent-learner/

    Requirements

  • Bash (with set -euo pipefail support)
  • Standard Unix utilities: grep, cat, date, echo, wc, du, head, tail, basename
  • No external dependencies or API keys required
  • When to Use

    1. Benchmarking agent performance β€” When you need to track and compare benchmark results across different agent configurations, models, or prompt strategies 2. Prompt engineering iteration β€” When you're testing multiple prompt variations and want to log each version with results for later comparison 3. Cost and usage tracking β€” When you need to monitor API costs and usage metrics over time to optimize spending 4. Fine-tuning experiments β€” When running fine-tuning sessions and you want to log parameters, results, and observations for reproducibility 5. Cross-category analysis β€” When you need to search across all logged data (benchmarks, prompts, evaluations, costs) to find patterns or specific entries

    Examples

    # Initialize and check status
    agent-learner status

    Log a benchmark result

    agent-learner benchmark "GPT-4o on MMLU: 88.7% accuracy, 1.2s avg latency"

    Log a prompt variation

    agent-learner prompt "System: You are a helpful coding assistant. Always explain your reasoning step by step."

    Compare two configurations

    agent-learner compare "GPT-4o vs Claude-3.5: GPT-4o 12% faster, Claude 5% more accurate on code tasks"

    Track costs

    agent-learner cost "March batch: 12,450 tokens input, 3,200 tokens output, $0.47 total"

    View all recent benchmarks

    agent-learner benchmark

    Search across all logs for a specific term

    agent-learner search "accuracy"

    Export all data as JSON

    agent-learner export json

    View summary statistics

    agent-learner stats

    Show recent activity

    agent-learner recent

    Output

    All commands return output to stdout. Export files are written to the data directory:

    agent-learner export json   # β†’ ~/.local/share/agent-learner/export.json
    agent-learner export csv    # β†’ ~/.local/share/agent-learner/export.csv
    agent-learner export txt    # β†’ ~/.local/share/agent-learner/export.txt
    

    Every command execution is logged to $DATA_DIR/history.log for auditing purposes.


    Powered by BytesAgain | bytesagain.com | hello@bytesagain.com

    ⚑ When to Use

    TriggerAction
    2. **Prompt engineering iteration** β€” When you're testing multiple prompt variations and want to log each version with results for later comparison
    3. **Cost and usage tracking** β€” When you need to monitor API costs and usage metrics over time to optimize spending
    4. **Fine-tuning experiments** β€” When running fine-tuning sessions and you want to log parameters, results, and observations for reproducibility
    5. **Cross-category analysis** β€” When you need to search across all logged data (benchmarks, prompts, evaluations, costs) to find patterns or specific entries

    πŸ’‘ Examples

    # Initialize and check status
    agent-learner status

    Log a benchmark result

    agent-learner benchmark "GPT-4o on MMLU: 88.7% accuracy, 1.2s avg latency"

    Log a prompt variation

    agent-learner prompt "System: You are a helpful coding assistant. Always explain your reasoning step by step."

    Compare two configurations

    agent-learner compare "GPT-4o vs Claude-3.5: GPT-4o 12% faster, Claude 5% more accurate on code tasks"

    Track costs

    agent-learner cost "March batch: 12,450 tokens input, 3,200 tokens output, $0.47 total"

    View all recent benchmarks

    agent-learner benchmark

    Search across all logs for a specific term

    agent-learner search "accuracy"

    Export all data as JSON

    agent-learner export json

    View summary statistics

    agent-learner stats

    Show recent activity

    agent-learner recent