Leaderboards

LiveCodeBench

https://livecodebench.github.io/leaderboard.html

  • Allows changing time-windows for problems to mitigate/test for contamination
  • Has Self Repair, Test Output Prediction, and Code Execution but only Code Generation is updated for the latest models usually

Aider

https://aider.chat/docs/leaderboards/

https://codetlingua.github.io/leaderboard.html https://aider.chat/docs/leaderboards/ https://intercode-benchmark.github.io/ https://yale-lily.github.io/spider https://github.com/THUDM/NaturalCodeBench

https://www.swebench.com/

Running human-eval: https://github.com/abacaj/code-eval

https://www.reddit.com/r/LocalLLaMA/comments/18m54tw/real_world_multi_step_reasoning_software/ https://galois.com/blog/2023/09/using-gpt-4-to-assist-in-c-to-rust-translation/ https://galois.com/blog/2023/08/applying-gpt-4-to-saw-formal-verification/ https://twitter.com/a_karvonen/status/1717168110505955568

Overfitting: https://arxiv.org/pdf/2403.07974 NaturalCodeBench https://arxiv.org/pdf/2405.04520

https://huggingface.co/blog/sc2-instruct