A curated collection of interesting finds for this week.
Karpathy's experiment in automating research loops with LLMs. Worth a look for harness-design ideas.
EleutherAI's de-facto standard harness for benchmarking LLMs across hundreds of tasks. The reference tool when you need reproducible eval numbers.
The official MIT-licensed Python impl of the GEPA optimizer — evolve prompts, code, and arbitrary text artifacts against your own eval.
Popular open-source LLM eval framework — pytest-style assertions, RAG/agent metrics, and built-in GEPA-style prompt optimization.
Open-source framework for building and running agents. Relevant reference point for agent-harness architecture.
DSPy's first-class GEPA integration. The most ergonomic way to run GEPA over a declarative LLM program with a defined metric.
Microsoft's prompt-compression toolkit — compresses prompts up to 20× with minimal performance loss. The serious counterpoint to anecdotal SPR-style compression.