Idler builds evals for coding, finance, science, and defense. Our tasks are fair, frontier-difficulty, long horizon, diverse, and interesting. Customers use Idler tasks to measure how their models stack up against the competition, to check for regressions across model versions, as training tasks, and as high-taste hills for researchers to climb.
Reproduce, localize, and fix real bugs in a live repo.
Build features across an unfamiliar codebase.
Restructure code without breaking what works.
Write tests, read diffs, and catch regressions.
Idler works quietly with frontier labs, turning production engineering into reinforcement-learning environments and keeping a neutral record of what models can actually do.
We are hiring environment engineers. team@idler.ai