Context Research

AI systems solve real-world tasks in a continuous cycle: enriching context, reasoning over context, and retaining what they learn from context. Context should be at the core of AI system design, with the model and the harness co-evolving around it. The model must recognize what context a task requires, and the harness must give it efficient ways to acquire the right context. As the task progresses, context becomes complex, messy, and constantly changing. The model then must be able to autonomously plan, manage, and reason over context, while the harness must provide efficient support for context management. Neither side is enough on its own, and real progress requires both to improve together.

Our long-term mission: to develop rigorous measurements of how well models understand and manage context, and explore scalable ways to build AI systems that are better at handling complex context.

Our research ↓

CL-bench Life

Evaluating context learning in everyday life

CL-bench

Evaluating context learning in professional domains

Leaderboard

CL-bench Life: Can Language Models Learn From Real-life Context?

Real-life contexts are highly complex, messy, and fragmented. CL-bench Life aims to evaluate models' context learning ability under such contexts.

405 Tasks
5,348 Rubrics
3 Context Categories
Rank Model Organization Solving Rate (±std)

Single run — no standard deviation available.

Task pass rate under different thresholds

Different models' task pass rates across pass score thresholds. Higher thresholds are stricter, resulting in lower pass rates.

Model Details

Hover a table row or chart line to focus; click a row in the Model Details table to pin that line.

Rank Model 100% 95% 90% 85% 80% 75% 70% 65%

Submit Your Model

Run your model on CL-bench Life, score with our scripts, and submit a PR with your results.

Leaderboard

CL-bench: A Benchmark for Context Learning

CL-bench evaluates language models' ability to learn from context in professional domains.

1,899 Tasks
31,607 Rubrics
4 Context Categories
Rank Model Organization Solving Rate (±std)

Single run — no standard deviation available.

Submit Your Model

Run your model on CL-bench, score with our scripts, and submit a PR with your results.