Leaderboards
LiveCodeBench
https://livecodebench.github.io/leaderboard.html
- Allows changing time-windows for problems to mitigate/test for contamination
- Has Self Repair, Test Output Prediction, and Code Execution but only Code Generation is updated for the latest models usually
Aider
https://aider.chat/docs/leaderboards/
-
Tests code edits
-
Benchmark is 133 Python Exercism practice items: https://aider.chat/docs/benchmarks.html#the-benchmark with MD instructions, stub python code, Unit tests
-
InfiCoder-Eval
-
Dev
https://codetlingua.github.io/leaderboard.html https://aider.chat/docs/leaderboards/ https://intercode-benchmark.github.io/ https://yale-lily.github.io/spider https://github.com/THUDM/NaturalCodeBench
Running human-eval: https://github.com/abacaj/code-eval
https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/ https://galois.com/blog/2023/09/using-gpt-4-to-assist-in-c-to-rust-translation/ https://galois.com/blog/2023/08/applying-gpt-4-to-saw-formal-verification/ https://twitter.com/a_karvonen/status/1717168110505955568
- Zero-Shot Replication Framework - replicate HumanEval, LeetCodeSparks, LeetCode100
- code-eval - scripts for running/reproducing human-eval scores on models
- llm-humaneval-benchmarks - HuggingFace models evald vs HumanEval+
- Multilingual Code Models Evaluation - base multilingual code generation models
- airate - C++ bug catching test
- phi-1 prompt tests
Overfitting: https://arxiv.org/pdf/2403.07974 NaturalCodeBench https://arxiv.org/pdf/2405.04520