| Rank | Model | Organization | Solving Rate (±std) |
|---|
CL-bench is a benchmark for evaluating language models' context learning. Solving tasks in CL-bench requires models to learn new knowledge from the provided context, rather than relying on pre-trained knowledge.
Model solutions are evaluated via instance-level rubrics and LM-as-a-judge. Each task contains an average of 16.6 rubrics, which evaluate solutions across multiple dimensions.
Want to add your model? Download the dataset from Hugging Face, run inference and evaluation using our scripts, then submit a PR with your results.