2024-01: Towards the end of 2023, the HuggingFace Open LLM Leaderboard has driven a trend of benchmark gaming that makes it (and other synthetic benchmarks) largely unuseful for testing model capabilities.

The current consensus is that LMSys Chatbot Arena which allows users to compare responses and choose a winner (and does ELO style ranking) is the current gold standard for ranking general model performance. Obviously, this is has its own weaknesses (biased by audience and use case, colored by refusals or other considerations), but it’s seems to be the best we’ve got atm.

  • LMSys is also collecting datasets from this chat data which could be extremely useful for training
  • u/DontPlanToEnd posted a benchmark correlation analysis on 2023-12-30 which showed MT-Bench (GPT-4 judged benchmark) had 0.89 correlation with Chatbot Arena, making it probably the 2nd best score. (MMLU has 0.85 correlation)
  • u/WolframRavenwolf has been posting his own LLM Comparison/Tests of new models which is pretty interesting - it tests in German (almost assuredly out of distribution) and focuses on instruction following, but is a good sanity check

General

Price Perf

Running Your Own

Do your own Chat Arena! https://github.com/Contextualist/lone-arena

promptfoo: https://www.promptfoo.dev/docs/intro/

See also:

Hallucination

Vectara Hallucination Leaderboard

Code

See Code Evaluation for code evals.

Roleplay

Context

InfiniteBench

Japanese

New

Contamination

Eval the Evals