Idler builds evals for coding, finance, science, and defense. Our tasks are fair, frontier-difficulty, long horizon, diverse, and interesting. Customers use Idler tasks to measure how their models stack up against the competition, to check for regressions across model versions, as training tasks, and as high-taste hills for researchers to climb.
What makes the tasks good five things every Idler task is
01
Sourced from the real world
Tasks pulled from real engineering, not invented.
02
Compatible with your use-case
Your stack, your tools, your formats.
03
Diverse problem sets
Bundled to meet your diversity requirements.
04
Exhaustive quality control
Every task tested, every failure mode verified.
05
Calibrated to your difficulty
Stratified by type, graded step by step.
Method from real engineering work to a graded world
What they cover the engineering work the environments are built from
Debugging
Reproduce, localize, and fix real bugs in a live repo.
Feature work
Build features across an unfamiliar codebase.
Refactors
Restructure code without breaking what works.
Tests & review
Write tests, read diffs, and catch regressions.
Why Idler real, graded, frontier
Real
Environments from real engineering work, never invented benchmarks. The skill transfers.
Graded
Every step checked against a working result. Dense reward, not just pass or fail.
Frontier
Built for the best models, aimed at the engineering they still get wrong.
Notes method write-ups
Dense rewardWhy step-by-step grading beats pass or fail.Note
Environments under RLWhat a graded world does to a model.Study
Shelf lifeRepresenting a codebase as an environment.Note
About the studio
A small team building the training worlds for coding agents.
Idler works quietly with frontier labs, turning production engineering into reinforcement-learning environments and keeping a neutral record of what models can actually do.