📖 llm-tracker

Search

❯

❯

Code Evaluation

Code Evaluation

Nov 20, 2024, 1 min read

Leaderboards

LiveCodeBench

https://livecodebench.github.io/leaderboard.html

https://arxiv.org/pdf/2403.07974

Allows changing time-windows for problems to mitigate/test for contamination
Has Self Repair, Test Output Prediction, and Code Execution but only Code Generation is updated for the latest models usually

Aider

https://aider.chat/docs/leaderboards/

Tests code edits
Benchmark is 133 Python Exercism practice items: https://aider.chat/docs/benchmarks.html#the-benchmark with MD instructions, stub python code, Unit tests
EvalPlus Leaderboard
- https://github.com/evalplus/evalplus
CRUXEval Leaderboard
- https://github.com/facebookresearch/cruxeval
CanAiCode Leaderboard
- https://github.com/the-crypt-keeper/can-ai-code
Big Code Models Leaderboard
InfiCoder-Eval
- https://infi-coder.github.io/inficoder-eval/
TabbyML Coding LLMs Leaderboard
SWE-bench
Dev

https://codetlingua.github.io/leaderboard.html https://aider.chat/docs/leaderboards/ https://intercode-benchmark.github.io/ https://yale-lily.github.io/spider https://github.com/THUDM/NaturalCodeBench

https://www.swebench.com/

Running human-eval: https://github.com/abacaj/code-eval

https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/ https://galois.com/blog/2023/09/using-gpt-4-to-assist-in-c-to-rust-translation/ https://galois.com/blog/2023/08/applying-gpt-4-to-saw-formal-verification/ https://twitter.com/a_karvonen/status/1717168110505955568

Zero-Shot Replication Framework - replicate HumanEval, LeetCodeSparks, LeetCode100
code-eval - scripts for running/reproducing human-eval scores on models
llm-humaneval-benchmarks - HuggingFace models evald vs HumanEval+
Multilingual Code Models Evaluation - base multilingual code generation models
airate - C++ bug catching test
phi-1 prompt tests
- https://colab.research.google.com/drive/1mSb2t8NDz0o_Cc8VgTMbhOg8kIh-cRIu?usp=sharing

Overfitting: https://arxiv.org/pdf/2403.07974 NaturalCodeBench https://arxiv.org/pdf/2405.04520

https://huggingface.co/blog/sc2-instruct

Leaderboards
LiveCodeBench
Aider

Backlinks

List of Evals

Created with Quartz © 2025