https://livecodebench.github.io/leaderboard.html
- EvalPlus Leaderboard
- CRUXEval Leaderboard
- CanAiCode Leaderboard
- Big Code Models Leaderboard
- InfiCoder-Eval
- TabbyML Coding LLMs Leaderboard
- SWE-bench
- Dev
https://codetlingua.github.io/leaderboard.html https://aider.chat/docs/leaderboards/ https://intercode-benchmark.github.io/ https://yale-lily.github.io/spider https://github.com/THUDM/NaturalCodeBench
Running human-eval: https://github.com/abacaj/code-eval
https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/ https://galois.com/blog/2023/09/using-gpt-4-to-assist-in-c-to-rust-translation/ https://galois.com/blog/2023/08/applying-gpt-4-to-saw-formal-verification/ https://twitter.com/a_karvonen/status/1717168110505955568
- Zero-Shot Replication Framework - replicate HumanEval, LeetCodeSparks, LeetCode100
- code-eval - scripts for running/reproducing human-eval scores on models
- llm-humaneval-benchmarks - HuggingFace models evald vs HumanEval+
- Multilingual Code Models Evaluation - base multilingual code generation models
- airate - C++ bug catching test
- phi-1 prompt tests
Overfitting: https://arxiv.org/pdf/2403.07974 NaturalCodeBench https://arxiv.org/pdf/2405.04520