Autoresearch Agent
by @alirezarezvani
Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed e...
First Time β Create the Experiment
Run the setup script. The user decides where experiments live:
Project-level (inside repo, git-tracked, shareable with team):
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope project
User-level (personal, in ~/.autoresearch/):
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user
The --scope flag determines where .autoresearch/ lives:
project (default) β .autoresearch/ in the repo root. Experiment definitions are git-tracked. Results are gitignored.user β ~/.autoresearch/ in the home directory. Everything is personal.What Setup Creates
.autoresearch/
βββ config.yaml β Global settings
βββ .gitignore β Ignores results.tsv, *.log
βββ {domain}/{experiment-name}/
βββ program.md β Objectives, constraints, strategy
βββ config.cfg β Target, eval cmd, metric, direction
βββ results.tsv β Experiment log (gitignored)
βββ evaluate.py β Evaluation script (if --evaluator used)
results.tsv columns: commit | metric | status | description
commit β short git hashmetric β float value or "N/A" for crashesstatus β keep | discard | crashdescription β what changed or why it crashedDomains
| Domain | Use Cases |
|--------|-----------|
| engineering | Code speed, memory, bundle size, test pass rate, build time |
| marketing | Headlines, social copy, email subjects, ad copy, engagement |
| content | Article structure, SEO descriptions, readability, CTR |
| prompts | System prompts, chatbot tone, agent instructions |
| custom | Anything else with a measurable metric |
If program.md Already Exists
The user may have written their own program.md. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
evaluate.py is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.clawhub install autoresearch-agent