🦀 ClawHub
LLM Eval Harness
by @charlie-morrison
Evaluate LLM outputs systematically — run test suites, score responses for accuracy/relevance/safety, compare models, and detect regressions in AI applications.
💡 Examples
"Evaluate our chatbot responses against the test suite"
"Compare GPT-4 vs Claude on our use cases"
"Run regression tests on the updated system prompt"
"Score these LLM outputs for accuracy and safety"
"Build an eval dataset for our RAG pipeline"
TERMINAL
clawhub install llm-eval-harness