Context Research

AI systems solve real-world tasks in a continuous cycle: enriching context, reasoning over context, and retaining what they learn from context. Context should be at the core of AI system design, with the model and the harness co-evolving around it. The model must recognize what context a task requires, and the harness must give it efficient ways to acquire the right context. As the task progresses, context becomes complex, messy, and constantly changing. The model then must be able to autonomously plan, manage, and reason over context, while the harness must provide efficient support for context management. Neither side is enough on its own, and real progress requires both to improve together.

Our long-term mission: to develop rigorous measurements of how well models understand and manage context, and explore scalable ways to build AI systems that are better at handling complex context.

Our research ↓

CL-bench Life

Evaluating context learning in everyday life

Blog Paper Code

Data

CL-bench

Evaluating context learning in professional domains

Blog Paper Code

Data

Leaderboard

CL-bench Life: Can Language Models Learn From Real-life Context?

Blog Paper Code

Data

Real-life contexts are highly complex, messy, and fragmented. CL-bench Life aims to evaluate models' context learning ability under such contexts.

405 Tasks

5,348 Rubrics

3 Context Categories

Communication & Social Interactions

3 subcategories: Private Conversations / Group Conversations & Meeting Transcripts / Community Interactions

▼

An example case from Group Conversations & Meeting Transcripts

Context Months of overlapping book-club group chat — free-clothing handoffs, a husband returning home, fantasy-football recruitment, baking taste-tests, puppy-sitter requests, August meeting RSVPs, and a long-running thread to lock down the November "fancy" dinner.

Task Trace the November "fancy" meeting from first proposal to final decision — restaurant and date changes, every conflicting schedule constraint, who can or can't attend (and why), whether the final plan honored all stated constraints, and who drove the logistics.

View Full Case→

Fragmented Information & Revisions

3 subcategories: Personal Information Fragments / Public Information Fragments / Creation & Revision Histories

▼

An example case from Personal Information Fragments

Context A cyclist's personal note collection spanning years of bike trip logs (Day 12, Day 25, …), maintenance records, tire specs, and scattered observations about weather disruptions, gear failures, and emergency workarounds.

Task Synthesize the fragmented notes into a structured pre-trip checklist for an upcoming 5-day bike trip — each item must be justified by a specific note entry, enhancing safety and trip success while avoiding impractical workshop-only tools.

View Full Case→

Behavioral Records & Activity Trails

3 subcategories: Game Logs / Digital Footprints & Daily-Life Records / Self-Tracking Trajectories

▼

An example case from Self-Tracking Trajectories

Context Detailed workout log data (TSV) covering dozens of exercises across 2023–2024 — including weights, reps, and dates — with a 3-month injury gap before resuming in 2024.

Task Identify which muscle groups were most negatively impacted by the injury break, spot specific plateaus in the 2024 routine (e.g., lat pulldown stuck at 90 lbs), and recommend concrete adjustments to break through them.

View Full Case→

Rank	Model	Organization	Solving Rate (±std)

^† Single run — no standard deviation available.

Task pass rate under different thresholds

Different models' task pass rates across pass score thresholds. Higher thresholds are stricter, resulting in lower pass rates.

Model Details

Hover a table row or chart line to focus; click a row in the Model Details table to pin that line.

Rank	Model	100%	95%	90%	85%	80%	75%	70%	65%

Submit Your Model

Run your model on CL-bench Life, score with our scripts, and submit a PR with your results.

Dataset Code & Submission

Leaderboard

CL-bench: A Benchmark for Context Learning

Blog Paper Code

Data

CL-bench evaluates language models' ability to learn from context in professional domains.

1,899 Tasks

31,607 Rubrics

4 Context Categories

Domain Knowledge Reasoning

▼

An example case · Genfanad PM onboarding

Context A raw document dump from a small indie studio building a video game — geography notes, map tiles, character art briefs, audio assets, UI mockups, HR records, marketing plans, and a few files that aren't even about the project — under a PM-bot spec that demands strict tables, explicit confidence scores, PII suppression, and no speculation beyond the source text.

Task Organize every piece of information into a table keyed by development channel, cite the source document for each entry, attach a bold 0–100% confidence score, and adapt the output cleanly across two revision requests — while also assembling a separate one-sentence summary list of all world-building topics.

View Full Case→

Rule System Application

▼

An example case · EZLang teaching-language assistant

Context A full specification of EZLang, a minimal English-like teaching language with its own var/set syntax, if/for/while constructs, and a small set of system global functions (print, sleep, getTime, …) — paired with a coding-coach persona aimed at non-technical high-schoolers.

Task Write a time-checker program that polls every 30 minutes, stops at 5:30pm, and prints the full log; then explain the program line by line and cite the exact EZLang global-function docs for each call — without smuggling in any keyword (e.g. break) the language doesn't actually define.

View Full Case→

Procedural Task Execution

▼

An example case · Shelby's peanut- & dairy-free recipe assistant

Context A strict recipe-bot spec — no peanuts, no dairy, mandatory Shelby's product inclusion, ≤15 min hands-on prep, rigid output template — plus a long inspiration dump of oxtail mac-and-cheese recipes and Thanksgiving menus the user pastes in.

Task Turn a dairy-heavy oxtail mac-and-cheese request into a compliant dairy-free elevated version featuring a Shelby's product, scale the recipe down to 1 lb of oxtail, and then pivot mid-dinner to a sub-5-minute no-bake dessert when the user's coconut pie cracks — all while keeping peanuts and dairy out.

View Full Case→

Empirical Discovery & Simulation

▼

An example case · Electron in a magnetic field

Context A raw CSV of roughly 1,000 (t, x, y, z) points sampling an electron's helical trajectory through a uniform magnetic field, paired with a reverse-engineering system prompt that enforces Occam's razor and a strict parameter-tuning mode.

Task Recover the underlying physics across three turns: the pitch angle at which the electron enters the field (3 sig figs), the magnetic field strength in tesla (1 sig fig), and the initial speed in standard form — reusing earlier numerical estimates instead of re-deriving them.

View Full Case→

Rank	Model	Organization	Solving Rate (±std)

^† Single run — no standard deviation available.

Submit Your Model

Run your model on CL-bench, score with our scripts, and submit a PR with your results.

Dataset Code & Submission