Leaderboard
CL-bench Life: Can Language Models Learn From Real-life Context?
Real-life contexts are highly complex, messy, and fragmented. CL-bench Life aims to evaluate models' context learning ability under such contexts.
405
Tasks
5,348
Rubrics
3
Context Categories
Communication & Social Interactions
3 subcategories: Private Conversations / Group Conversations & Meeting Transcripts / Community InteractionsAn example case from Group Conversations & Meeting Transcripts
Context
Months of overlapping book-club group chat — free-clothing handoffs, a husband returning home, fantasy-football recruitment, baking taste-tests, puppy-sitter requests, August meeting RSVPs, and a long-running thread to lock down the November "fancy" dinner.
Task
Trace the November "fancy" meeting from first proposal to final decision — restaurant and date changes, every conflicting schedule constraint, who can or can't attend (and why), whether the final plan honored all stated constraints, and who drove the logistics.
View Full Case→
Fragmented Information & Revisions
3 subcategories: Personal Information Fragments / Public Information Fragments / Creation & Revision HistoriesAn example case from Personal Information Fragments
Context
A cyclist's personal note collection spanning years of bike trip logs (Day 12, Day 25, …), maintenance records, tire specs, and scattered observations about weather disruptions, gear failures, and emergency workarounds.
Task
Synthesize the fragmented notes into a structured pre-trip checklist for an upcoming 5-day bike trip — each item must be justified by a specific note entry, enhancing safety and trip success while avoiding impractical workshop-only tools.
View Full Case→
Behavioral Records & Activity Trails
3 subcategories: Game Logs / Digital Footprints & Daily-Life Records / Self-Tracking TrajectoriesAn example case from Self-Tracking Trajectories
Context
Detailed workout log data (TSV) covering dozens of exercises across 2023–2024 — including weights, reps, and dates — with a 3-month injury gap before resuming in 2024.
Task
Identify which muscle groups were most negatively impacted by the injury break, spot specific plateaus in the 2024 routine (e.g., lat pulldown stuck at 90 lbs), and recommend concrete adjustments to break through them.
View Full Case→
| Rank | Model | Organization | Solving Rate (±std) |
|---|
† Single run — no standard deviation available.
Task pass rate under different thresholds
Different models' task pass rates across pass score thresholds. Higher thresholds are stricter, resulting in lower pass rates.
Model Details
Hover a table row or chart line to focus; click a row in the Model Details table to pin that line.
| Rank | Model | 100% | 95% | 90% | 85% | 80% | 75% | 70% | 65% |
|---|
Submit Your Model
Run your model on CL-bench Life, score with our scripts, and submit a PR with your results.
Leaderboard
CL-bench: A Benchmark for Context Learning
CL-bench evaluates language models' ability to learn from context in professional domains.
1,899
Tasks
31,607
Rubrics
4
Context Categories
Domain Knowledge Reasoning
An example case · Genfanad PM onboarding
Context
A raw document dump from a small indie studio building a video game — geography notes, map tiles, character art briefs, audio assets, UI mockups, HR records, marketing plans, and a few files that aren't even about the project — under a PM-bot spec that demands strict tables, explicit confidence scores, PII suppression, and no speculation beyond the source text.
Task
Organize every piece of information into a table keyed by development channel, cite the source document for each entry, attach a bold 0–100% confidence score, and adapt the output cleanly across two revision requests — while also assembling a separate one-sentence summary list of all world-building topics.
View Full Case→
Rule System Application
An example case · EZLang teaching-language assistant
Context
A full specification of EZLang, a minimal English-like teaching language with its own
var/set syntax, if/for/while constructs, and a small set of system global functions (print, sleep, getTime, …) — paired with a coding-coach persona aimed at non-technical high-schoolers.
Task
Write a time-checker program that polls every 30 minutes, stops at 5:30pm, and prints the full log; then explain the program line by line and cite the exact EZLang global-function docs for each call — without smuggling in any keyword (e.g.
View Full Case→
break) the language doesn't actually define.
Procedural Task Execution
An example case · Shelby's peanut- & dairy-free recipe assistant
Context
A strict recipe-bot spec — no peanuts, no dairy, mandatory Shelby's product inclusion, ≤15 min hands-on prep, rigid output template — plus a long inspiration dump of oxtail mac-and-cheese recipes and Thanksgiving menus the user pastes in.
Task
Turn a dairy-heavy oxtail mac-and-cheese request into a compliant dairy-free elevated version featuring a Shelby's product, scale the recipe down to 1 lb of oxtail, and then pivot mid-dinner to a sub-5-minute no-bake dessert when the user's coconut pie cracks — all while keeping peanuts and dairy out.
View Full Case→
Empirical Discovery & Simulation
An example case · Electron in a magnetic field
Context
A raw CSV of roughly 1,000 (t, x, y, z) points sampling an electron's helical trajectory through a uniform magnetic field, paired with a reverse-engineering system prompt that enforces Occam's razor and a strict parameter-tuning mode.
Task
Recover the underlying physics across three turns: the pitch angle at which the electron enters the field (3 sig figs), the magnetic field strength in tesla (1 sig fig), and the initial speed in standard form — reusing earlier numerical estimates instead of re-deriving them.
View Full Case→
| Rank | Model | Organization | Solving Rate (±std) |
|---|
† Single run — no standard deviation available.
Submit Your Model
Run your model on CL-bench, score with our scripts, and submit a PR with your results.