LLMs

Lists of Models

We will eventually be hosting an API and list of LLMs.  In the meantime:

See Evals for potentially more models and how they compare.

Which model to use?

There is a constant stream of new model releases (both fine tunes and foundational models) so any list will be inevitably out of date w/o constant attention. Instead of a list, here's how to pick a model:

Research

Research

Improving LLM Quality

Model Architecture

Mixture of Experts / Ensemble

Zoph, Barret, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. “ST-MoE: Designing Stable and Transferable Sparse Expert Models.” arXiv, April 29, 2022. https://doi.org/10.48550/arXiv.2202.08906.

Rajbhandari, Samyam, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.” arXiv, July 21, 2022. https://doi.org/10.48550/arXiv.2201.05596.

Du, Nan, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, et al. “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.” arXiv, August 1, 2022. https://doi.org/10.48550/arXiv.2112.06905.

Li, Margaret, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. “Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models.” arXiv, August 5, 2022. http://arxiv.org/abs/2208.03306.

Mustafa, Basil, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. “Multimodal Contrastive Learning with LIMoE: The Language-Image Mixture of Experts.” arXiv, June 6, 2022. https://doi.org/10.48550/arXiv.2206.02770.

Shen, Sheng, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, et al. “Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models.” arXiv, July 5, 2023. http://arxiv.org/abs/2305.14705.

Context

Haviv, Adi, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. “Transformer Language Models without Positional Encodings Still Learn Positional Information.” arXiv, December 5, 2022. https://doi.org/10.48550/arXiv.2203.16634.

Sun, Yutao, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. “A Length-Extrapolatable Transformer.” arXiv, December 20, 2022. https://doi.org/10.48550/arXiv.2212.10554.

Poli, Michael, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. “Hyena Hierarchy: Towards Larger Convolutional Language Models.” arXiv, March 5, 2023. http://arxiv.org/abs/2302.10866.

Yu, Lili, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. “MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers.” arXiv, May 19, 2023. https://doi.org/10.48550/arXiv.2305.07185.

Mohtashami, Amirkeivan, and Martin Jaggi. “Landmark Attention: Random-Access Infinite Context Length for Transformers.” arXiv, May 25, 2023. https://doi.org/10.48550/arXiv.2305.16300.

Liu, Hao, and Pieter Abbeel. “Blockwise Parallel Transformer for Long Context Large Models.” arXiv, May 30, 2023. http://arxiv.org/abs/2305.19370.

Nguyen, Eric, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, et al. “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.” arXiv, June 27, 2023. https://doi.org/10.48550/arXiv.2306.15794.

Chen, Shouyuan, Sherman Wong, Liangjian Chen, and Yuandong Tian. “Extending Context Window of Large Language Models via Positional Interpolation.” arXiv, June 28, 2023. https://doi.org/10.48550/arXiv.2306.15595.

emozilla. “Dynamically Scaled RoPE Further Increases Performance of Long Context LLaMA with Zero Fine-Tuning.” Reddit Post. R/LocalLLaMA, June 30, 2023. www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.

Ding, Jiayu, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. “LongNet: Scaling Transformers to 1,000,000,000 Tokens.” arXiv, July 5, 2023. https://doi.org/10.48550/arXiv.2307.02486.

Tworkowski, Szymon, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś. “Focused Transformer: Contrastive Training for Context Scaling.” arXiv, July 6, 2023. https://doi.org/10.48550/arXiv.2307.03170.

Han, Chi, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. “LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models.” arXiv, August 30, 2023. https://doi.org/10.48550/arXiv.2308.16137.

Peng, Bowen, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. “YaRN: Efficient Context Window Extension of Large Language Models.” arXiv, August 31, 2023. https://doi.org/10.48550/arXiv.2309.00071.

Sparsity and Efficiency

Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating Long Sequences with Sparse Transformers.” arXiv, April 23, 2019. http://arxiv.org/abs/1904.10509.

DeepSpeed. “DeepSpeed Sparse Attention,” September 8, 2020. https://www.deepspeed.ai/2020/09/08/sparse-attention.html.

Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. “Big Bird: Transformers for Longer Sequences.” arXiv, January 8, 2021. https://doi.org/10.48550/arXiv.2007.14062.

Jaszczur, Sebastian, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. “Sparse Is Enough in Scaling Transformers.” arXiv, November 24, 2021. https://doi.org/10.48550/arXiv.2111.12763.

Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. “Efficient Transformers: A Survey.” arXiv, March 14, 2022. https://doi.org/10.48550/arXiv.2009.06732.

Fedus, William, Barret Zoph, and Noam Shazeer. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” arXiv, June 16, 2022. http://arxiv.org/abs/2101.03961.

Cho, Sungjun, Seonwoo Min, Jinwoo Kim, Moontae Lee, Honglak Lee, and Seunghoon Hong. “Transformers Meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost.” arXiv, October 27, 2022. https://doi.org/10.48550/arXiv.2210.15541.

Thangarasa, Vithursan, Abhay Gupta, William Marshall, Tianda Li, Kevin Leong, Dennis DeCoste, Sean Lie, and Shreyas Saxena. “SPDF: Sparse Pre-Training and Dense Fine-Tuning for Large Language Models.” arXiv, March 18, 2023. https://doi.org/10.48550/arXiv.2303.10464.

Zhuang, Bohan, Jing Liu, Zizheng Pan, Haoyu He, Yuetian Weng, and Chunhua Shen. “A Survey on Efficient Training of Transformers.” arXiv, May 3, 2023. https://doi.org/10.48550/arXiv.2302.01107.

Dettmers, Tim, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. “SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression.” arXiv, June 5, 2023. https://doi.org/10.48550/arXiv.2306.03078.

MLP

Liu, Hanxiao, Zihang Dai, David R. So, and Quoc V. Le. “Pay Attention to MLPs.” arXiv, June 1, 2021. https://doi.org/10.48550/arXiv.2105.08050.

Yu, Ping, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, and Xian Li. “Efficient Language Modeling with Sparse All-MLP.” arXiv, May 31, 2022. https://doi.org/10.48550/arXiv.2203.06850.

Bachmann, Gregor, Sotiris Anagnostidis, and Thomas Hofmann. “Scaling MLPs: A Tale of Inductive Bias.” arXiv, June 23, 2023. https://doi.org/10.48550/arXiv.2306.13575.

MQA/GQA

Shazeer, Noam. “Fast Transformer Decoding: One Write-Head Is All You Need.” arXiv, November 5, 2019. https://doi.org/10.48550/arXiv.1911.02150.

Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. “Efficiently Scaling Transformer Inference.” arXiv, November 9, 2022. http://arxiv.org/abs/2211.05102.

Xu, Yangyang, Xiangtai Li, Haobo Yuan, Yibo Yang, and Lefei Zhang. “Multi-Task Learning with Multi-Query Transformer for Dense Prediction.” arXiv, April 7, 2023. http://arxiv.org/abs/2205.14354.

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv, May 22, 2023. https://doi.org/10.48550/arXiv.2305.13245.

Activation Functions

Shazeer, Noam. “GLU Variants Improve Transformer.” arXiv, February 12, 2020. https://doi.org/10.48550/arXiv.2002.05202.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. “PaLM: Scaling Language Modeling with Pathways.” arXiv, October 5, 2022. http://arxiv.org/abs/2204.02311.

Fang, Haishuo, Ji-Ung Lee, Nafise Sadat Moosavi, and Iryna Gurevych. “Transformers with Learnable Activation Functions.” In Findings of the Association for Computational Linguistics: EACL 2023, 2382–98. Dubrovnik, Croatia: Association for Computational Linguistics, 2023. https://aclanthology.org/2023.findings-eacl.181.

Liu, Hong, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. “Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-Training.” arXiv, May 23, 2023. http://arxiv.org/abs/2305.14342.

Grounding

Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. “REALM: Retrieval-Augmented Language Model Pre-Training.” arXiv, February 10, 2020. https://doi.org/10.48550/arXiv.2002.08909.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv, April 12, 2021. https://doi.org/10.48550/arXiv.2005.11401.

Jiang, Zhengbao, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. “Active Retrieval Augmented Generation.” arXiv, May 11, 2023. https://doi.org/10.48550/arXiv.2305.06983.

Ma, Xinbei, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. “Query Rewriting for Retrieval-Augmented Large Language Models.” arXiv, May 23, 2023. https://doi.org/10.48550/arXiv.2305.14283.

Shi, Weijia, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. “REPLUG: Retrieval-Augmented Black-Box Language Models.” arXiv, May 24, 2023. https://doi.org/10.48550/arXiv.2301.12652.

Liu, Jiongnan, Jiajie Jin, Zihan Wang, Jiehan Cheng, Zhicheng Dou, and Ji-Rong Wen. “RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit.” arXiv, June 8, 2023. https://doi.org/10.48550/arXiv.2306.05212.

Wang, Weizhi, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. “Augmenting Language Models with Long-Term Memory.” arXiv, June 12, 2023. https://doi.org/10.48550/arXiv.2306.07174.

Pan, Shirui, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. “Unifying Large Language Models and Knowledge Graphs: A Roadmap.” arXiv, June 20, 2023. https://doi.org/10.48550/arXiv.2306.08302.

Training

Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” arXiv, May 13, 2020. https://doi.org/10.48550/arXiv.1910.02054.

Ren, Jie, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. “ZeRO-Offload: Democratizing Billion-Scale Model Training.” arXiv, January 17, 2021. https://doi.org/10.48550/arXiv.2101.06840.

Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.” arXiv, April 15, 2021. https://doi.org/10.48550/arXiv.2104.07857.

Caggiano, Myle Ott, Sam Shleifer, Min Xu, Priya Goyal, Quentin Duval, Vittorio. “Fully Sharded Data Parallel: Faster AI Training with Fewer GPUs.” Engineering at Meta (blog), July 15, 2021. https://engineering.fb.com/2021/07/15/open-source/fsdp/.

Smith, Shaden, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, et al. “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model.” arXiv, February 4, 2022. https://doi.org/10.48550/arXiv.2201.11990.

Aminabadi, Reza Yazdani, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, et al. “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv, June 30, 2022. https://doi.org/10.48550/arXiv.2207.00032.

Li, Conglong, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, and Yuxiong He. “DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing.” arXiv, February 7, 2023. https://doi.org/10.48550/arXiv.2212.03597.

Zhao, Yanli, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, et al. “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.” arXiv, April 21, 2023. https://doi.org/10.48550/arXiv.2304.11277.

Liu, Hao, and Pieter Abbeel. “Blockwise Parallel Transformer for Long Context Large Models.” arXiv, May 30, 2023. http://arxiv.org/abs/2305.19370.

Liu, Zhuang, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, and Trevor Darrell. “Dropout Reduces Underfitting.” arXiv, May 31, 2023. https://doi.org/10.48550/arXiv.2303.01500.

Better Data

autolabel: https://github.com/refuel-ai/autolabel - "Autolabel is a Python library to label, clean and enrich text datasets with any Large Language Models (LLM) of your choice." Start 2023-03

Trajanoska, Milena, Riste Stojanov, and Dimitar Trajanov. “Enhancing Knowledge Graph Construction Using Large Language Models.” arXiv, May 8, 2023. https://doi.org/10.48550/arXiv.2305.04676.

Eldan, Ronen, and Yuanzhi Li. “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?” arXiv, May 24, 2023. https://doi.org/10.48550/arXiv.2305.07759.

Xu, Can, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. “WizardLM: Empowering Large Language Models to Follow Complex Instructions.” arXiv, June 10, 2023. https://doi.org/10.48550/arXiv.2304.12244.

Luo, Ziyang, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. “WizardCoder: Empowering Code Large Language Models with Evol-Instruct.” arXiv, June 14, 2023. https://doi.org/10.48550/arXiv.2306.08568.

Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, et al. “Textbooks Are All You Need.” arXiv, June 20, 2023. https://doi.org/10.48550/arXiv.2306.11644.

Lee, Alycia, Brando Miranda, and Sanmi Koyejo. “Beyond Scale: The Diversity Coefficient as a Data Quality Metric Demonstrates LLMs Are Pre-Trained on Formally Diverse Data.” arXiv, June 23, 2023. https://doi.org/10.48550/arXiv.2306.13840.

Self Supervised Training

Sun, Zhiqing, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. “Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision.” arXiv, May 4, 2023. https://doi.org/10.48550/arXiv.2305.03047.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv, May 29, 2023. https://doi.org/10.48550/arXiv.2305.18290.

Manikandan, Hariharan, Yiding Jiang, and J. Zico Kolter. “Language Models Are Weak Learners.” arXiv, June 24, 2023. https://doi.org/10.48550/arXiv.2306.14101.

Jain, Neel, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. “Bring Your Own Data! Self-Supervised Evaluation for Large Language Models.” arXiv, June 29, 2023. https://doi.org/10.48550/arXiv.2306.13651.

Song, Feifan, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. “Preference Ranking Optimization for Human Alignment.” arXiv, June 30, 2023. https://doi.org/10.48550/arXiv.2306.17492.

Distillation

Wu, Lirong, Haitao Lin, Yufei Huang, and Stan Z. Li. “Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs.” arXiv, June 8, 2023. https://doi.org/10.48550/arXiv.2306.05628.

Gu, Yuxian, Li Dong, Furu Wei, and Minlie Huang. “Knowledge Distillation of Large Language Models.” arXiv, June 14, 2023. https://doi.org/10.48550/arXiv.2306.08543.

Agarwal, Rishabh, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. “GKD: Generalized Knowledge Distillation for Auto-Regressive Sequence Models.” arXiv, June 23, 2023. https://doi.org/10.48550/arXiv.2306.13649.

Inference

Long, Jieyi. “Large Language Model Guided Tree-of-Thought.” arXiv, May 14, 2023. https://doi.org/10.48550/arXiv.2305.08291.

Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv, May 17, 2023. https://doi.org/10.48550/arXiv.2305.10601.

Wang, Guanzhi, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. “Voyager: An Open-Ended Embodied Agent with Large Language Models.” arXiv, May 25, 2023. https://doi.org/10.48550/arXiv.2305.16291.

Lin, Bill Yuchen, Yicheng Fu, Karina Yang, Prithviraj Ammanabrolu, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. “SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks.” arXiv, May 27, 2023. https://doi.org/10.48550/arXiv.2305.17390.

Shinn, Noah, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv, June 10, 2023. https://doi.org/10.48550/arXiv.2303.11366.

Agrawal, Lakshya A., Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, and Sriram K. Rajamani. “Guiding Language Models of Code with Global Context Using Monitors.” arXiv, June 19, 2023. https://doi.org/10.48550/arXiv.2306.10763.

Yang, John, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. “InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback.” arXiv, June 26, 2023. https://doi.org/10.48550/arXiv.2306.14898.

Sanchez, Guillaume, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. “Stay on Topic with Classifier-Free Guidance.” arXiv, June 30, 2023. https://doi.org/10.48550/arXiv.2306.17806.

Performance

Xia, Heming, Tao Ge, Furu Wei, and Zhifang Sui. “Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding.” arXiv, May 20, 2022. https://doi.org/10.48550/arXiv.2203.16487.

Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv, February 2, 2023. https://doi.org/10.48550/arXiv.2302.01318.

Miao, Xupeng, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. “SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification.” arXiv, May 16, 2023. https://doi.org/10.48550/arXiv.2305.09781.

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast Inference from Transformers via Speculative Decoding.” arXiv, May 18, 2023. https://doi.org/10.48550/arXiv.2211.17192.

Sheng, Ying, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, et al. “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.” arXiv, June 12, 2023. https://doi.org/10.48550/arXiv.2303.06865.

Team, vLLM. “VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention,” June 21, 2023. https://vllm.ai/.

Controlling Output

Jsonformer

2023-04-30: " Jsonformer is a wrapper around Hugging Face models that fills in the fixed tokens during the generation process, and only delegates the generation of content tokens to the language model. This makes it more efficient and bulletproof than existing approaches."

Context-Free Grammar Parsing with LLMs

2023-05-14: "Use a context-free grammar and a parser generator to determine valid next tokens for an LLM generation."

Logit Bias

You can use LLMs as an output by using `logit_bias` to control output tokens:

Research

Learning Resources

Getting Started

If you're starting from nothing. Just go to Wikipedia and start reading:

You can use an LLM to help you summarize and better understand/query, although I would not use anything less than ChatGPT 4 or ChatGPT/with Web Browsing (or Bing Chat, which is also designed to do retrieval-augmented replies) to minimize hallucinations. This has the added benefit of getting better first hand experiences of what LLMs can do well (or poorly).

There are plenty of resource lists for research:

Announcements

For a layperson wanting to learn more, I actually think that reading the various announcements on models (and using an LLM to interrogate parts you don't understand) are probably a decent way to get started. You might not understand everything, but they start give you the "flavor-text" so to speak of AI attributes, keywords, etc:

Reading the announcements and model cards for models (and looking up what you don't understand) is a great way to get up to speed fast.

Basics

Overview

Foundational Models

Context Window

Learn From Scratch

For programmers who don't know ML, it may be easier to learn by doing:

Structured Courses

Deep Dive Explanations

Fine Tuning Guides

Resource Lists

Latest Research

arXiv

https://arxiv.org/list/cs.AI/recent

https://arxiv.org/list/cs.LG/recent

AK's Daily Papers

Papers With Code

https://paperswithcode.com/

Blogs

Misc

Research

Interpretability

Language Models Implement Simple Word2Vec-style
Vector Arithmetic

https://arxiv.org/pdf/2305.16130.pdf

Colophon

This site runs on BookStack, a PHP-based Wiki/documentation software. While there are other documentation generators I considered (mdBook, MkDocs) I wanted something that could allow collaboration/contribution without managing Github pull requests, which didn't leave too many easy/open source options (HedgeDoc,, Outline, and Wiki.js were runners up).

Bookstack is run with Docker Compose using solidnerd/docker-bookstack. (I spent a detour setting up my own container without Apache 2.0 but ran into issues with php-fpm and ended up giving up).

Code Assistants

We'll probably move this somewhere else, but I figure it might be useful to put this in public somewhere since I'm researching Coding Assistants to help w/ a refactor of a largish code base.

I'm looking for practical tools for production use here, and less of the neat toys that's popular atm.

See also:

Hosted Services

Cursor

GPT-4 backed VSCode-like editor

Sourcegraph Cody

bloop

bloop is a developer assistant that uses GPT-4 to answer questions about your codebase. The agent searches both your local and remote repositories with natural language, regex and filtered queries.

Mutable.ai

Supposedly will let you talk to your code base, create tests, review PRs, refactor folders, auto-document, etc...

OpenAI API

Instrukt

Coding AI: A coding assistant. Create indexes over any code base and attach it to the agent to do RAG (Retrieval Augmented Generation)

Maccarone

VSCode plugin that uses GPT-4 to write stubbed code snippets

Mentat

Mentat is an interactive command-line tool that can load and coordinate edits from your command line. It can directly edit files without copy and pasting (Y/n/i)

GPT-4 Coding Assistant

GPT-4 Coding Assistant is a web application that leverages the power of OpenAI's GPT-4 to help developers with their coding tasks. The application serves as an interactive chatbot that assists in code generation, understanding, and troubleshooting. It also utilizes embeddings and the Annoy library to search for similar code snippets in the provided codebase, offering more contextually relevant responses.

aider

aider is a command-line chat tool that allows you to write and edit code with OpenAI's GPT models. You can ask GPT to help you start a new project, or modify code in your existing git repo. Aider makes it easy to git commit, diff & undo changes proposed by GPT without copy/pasting. It also has features that help GPT-4 understand and modify larger codebases.

Chapyter

Chat w/ your Jupyter Lab code

CodeGPT

Write commit messages/does code review via git hook w/ OpenAI API

Local Code Interpreter

OpenAI ChatGPT-4 Code Interpreter

You get a VM sandbox and GPT-4 that knows how to run Python with it.

Local Models

With the release of Meta's Code Llama there is finally a model that is competitive with GPT-4 for code generation:

Also, it's worth pointing out there was another model, a Llama2 70B full fine tune, Lemur that also performs quite well.

Note there are some API providers for those that can't run these locally:

Local Apps

Refact

Refact is an open-source Copilot alternative available as a self-hosted or cloud option.

Continue

Continue is the open-source autopilot for software development—an IDE extension that brings the power of ChatGPT to VS Code and JetBrains

localpilot

Slick menubar interface to allow switching Github Copilot to use local models (built for Macs). Uses llama-cpp-python so should support any GGUF.

Code Llama for VSCode

An API which mocks llama.cpp to enable support for Code Llama with the Continue Visual Studio Code extension.

ggml p1

This project is an attempt to implement a local code completion engine utilizing large language models (LLM). Think of it as an open-source alternative to Github Copliot that runs on your device.

(deprecated) TurboPilot

TurboPilot is a self-hosted copilot clone which uses GGML,the library behind llama.cpp, to run local code completion models. It was originally designed to run Salesforce codegen models but has recently been updated to provide support for Starcoder, Wizardcoder and most recently, StableCode Instruct from StabilityAI. It is heavily based and inspired by on the fauxpilot project.

localGPT

Index and search through local files w/ local models

Other

Maybe useful, but not going to organize...

Code Interpreters

Toys

Japanese LLMs

There has been a stream of open Japanese LLMs being trained but they are on average far behind their English counterparts. The current most promising open model for conversation and instruction are the ELYZA Llama2-based models.

GPT-4 and gpt-3.5-turbo are still far ahead, however.

A good summary from 2023-08-29 ELYZA blog post:

1693269385734-Gr8jhRlmRP.webp

The best way to track new models is probably via momonga's HF as he is doing quants of Japanese models.

See also:

Other models:

Both are built with gpt-neox, due to its good Japanese support: https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese

GPT-4

Quality: Great

Temp: 0.7

Prompt:
あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。
USER
ねえ、アニメ好きなの?
ASSISTANT
うん、大好きだよ!お気に入りのアニメは何?
USER
宮崎さんの作品全部好き!
ASSISTANT
わかる、宮崎駿の作品はマジで最高だよね!
USER
ナウシカで何が起こったか忘れてしまったのですが、思い出してもらえますか?
ASSISTANT
もちろんだよ。大まかに説明すると、『風の谷のナウシカ』は、大災害後の世界を舞台に、ナウシカ王女が人々と毒の森を守るために戦う話だよ。
USER
そうそう、あの森は有毒なだけじゃなくて、彼らにとって本当に大切な森だったんですよね?
ASSISTANT
うん、正解!その森は地球の浄化システムの一部だったんだ。だから、人間と森の共存がテーマになってるんだよね。

GPT-3

Quality: Great

Temp: 0.7

Prompt:
あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。
USER
ねえ、アニメ好きなの?
ASSISTANT
うん、大好きだよ!
USER
好きなアニメは何ですか?
ASSISTANT
たくさんあるけど、最近は「鬼滅の刃」がすごく好きだな!
USER
私もです! テーマソングを誰が歌っているか知っていますか? とてもキャッチーです。
ASSISTANT
もちろん知ってるよ!「紅蓮華」っていう曲を、LiSAが歌ってるんだよ。めっちゃハマるよね!
USER
チェンソーマンを知っていますか?
ASSISTANT
うん、知ってるよ!最近話題のマンガだよね。絵もストーリーもすごく面白いって聞くよ!
USER
実は今でもアニメになっています!
ASSISTANT
ほんと?それは知らなかった!アニメ化されてるんだね、楽しみだな!
USER
あなたがこれについて聞いたことがないのは驚きですが、これは日本だけでなく世界中で今最も人気のあるアニメのようなものです。 ここ数年、岩の下で暮らしていましたか?
ASSISTANT
ごめんね、最近のアニメの情報にはあまり追いつけていなかったみたいだね。岩の下で暮らしていたわけじゃないけど、最新のトレンドには少し疎くなってしまったかもしれない。でも教えてくれてありがとう!これからもっとアニメの情報を追いかけるように頑張るね!
USER
ごめんなさい、ちょっと意地悪でした。 私たちオタクは団結することが重要です。
ASSISTANT
全然大丈夫だよ!オタク同士で団結することはとっても大切だし、お互いに情報をシェアして楽しむことができるからね。これからも一緒にアニメの話をしようね!

Claude

Quality: Can't stay in character

Human: あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。ねえ、アニメ好きなの?

Assistant: ああ、俺だってアニメ好きだよ。特に最近のアニメはカッコいいと思う。でも色々なジャンルのアニメがあるから、みんなが好きなアニメは違うだろうね。アニメについて語り合うのは楽しいよな。

Human: 好きなアニメは何ですか?

Assistant: 俺はアクションものが大好きだな。「ワンパンマン」とか「鬼滅の刃」はバトルシーンがカッコよくてついはまっちゃうよ。コメディものだと「ガンダムビルドファイターズ」が面白いし、スポ根アニメの「ハイキュー」も良作だと思う。あとは「スタイリッシュで動きのあるアニメーション」が好きなんだ。作画を見るのが楽しい。

人によって好きなアニメは千差万別だろうね。友達といろんなアニメについて語り合えるのがアニメファンの良いところだと思うよ。

Human: 今はチェンソーマンにハマってます。 とてもキャッチーなテーマソングもあります。

Assistant: ああ、「チェンソーマン」は面白いアニメだね!キャラクターデザインがユニークで、アクションシーンもカッコいいよね。でも俺が一番気に入っているのは、あのキャッチーなオープニングテーマ「君はロックを聴かない」だ。歌もメロディーも耳に残るし、アニメの世界観が伝わってくる良いテーマソングだと思う。アニメ本編もストーリーも面白いけど、オープニングを聞くだけで嬉しくなるんだ。同じアニメファンなんだから、あのテーマソングは君も気に入っているだろうね。

チェンソーマンにはまっている理由がよく分かるよ。アニメの話をするのはいつでも歓迎だぜ!

Human: 混乱しているでしょう、オープニングテーマは米津玄師の『KICK BACK』です ^\_\_^

Assistant: はは、ごめんごめん。「チェンソーマン」のオープニングテーマを間違えてた。確かにあれは米津玄師さんの「KICK BACK」だった。私のアニメ知識が甘かった。でもコメントしてくれてありがとう。次からは「チェンソーマン」の話をする時はきちんとオープニングが「KICK BACK」だと心がけるよ。アニメの会話をする時は、その作品の基本的な知識を持っていることも大切だね。ご指摘いただいて勉強になりました。アニメの話は今後ともぜひ続けていきたいです!

Claude Try 2

Quality: OK

Human: あなたは、ユーザーと自然な対話ができるように設計された会話型ボットです。 この出力はテキスト読み上げに送信されるため、応答は短く簡潔にして、定型的な応答や長文の応答は避けてください。 さあ始めましょう。

Assistant: はい、簡潔な会話を心がけます。対話を楽しみましょう。

Human: アニメ好きなの?

Assistant: はい、アニメが大好きです。特にスタジオジブリ作品が好きです。

Human: 私もです! 私のお気に入りは魔女の宅急便です。 お気に入りはありますか?

Assistant: はい、魔女の宅急便も大好きです。私のお気に入りは千と千尋の神隠しです。宮崎駿監督の世界観が素晴らしいと思います。

Human: 久しぶりに千と千尋の神隠しを見ました。 映画では何が起こったのでしょうか?

Assistant: 千と千尋の神隠しでは、小さい女の子・千尋が神秘的な精霊の世界に迷い込み、名前を失ったカオナシと出会います。千尋は、魔女ユババのお風呂屋さんで働き始め、カオナシとの友情を深めていきます。様々な精霊たちと出会い成長していく姿が印象的です。簡単に説明するとそんな内容です。

Human: そうそう、基本的にあの温泉っぽい場所で行われたのを覚えています。

Assistant: はい、その通りです。千尋が迷い込んだのは、一見ただの朽ちた建物に見えるユババのお風呂屋さんでした。でも中に入ると、温泉のある不思議な精霊の世界が広がっていました。ユババのお風呂屋さんは、温泉のシーンが多く印象的な舞台の一つですね。簡単ながら、映画の場所と雰囲気を想起できたのかなと思います。

Human: あなたは「鬼滅の刃」のファンですか?

Assistant: はい、鬼滅の刃は見ています。アニメや漫画共にストーリーやキャラクター設定が素晴らしい作品だと思います。鬼殺隊のメンバーそれぞれの個性が際立っていて面白いですね。特に主人公の竈門炭治郎と竈門禰豆子の兄妹愛が感動的です。アクションシーンも迫力があり、鬼滅の刃が大好きです。

Human: なるほど...

Assistant: はい、鬼滅の刃はストーリーやキャラクターが魅力的な作品だと思います。アニメや漫画を楽しむ上で、キャラクター性やドラマ性も重要な要素ですね。鬼滅の刃はアクションもあるけれど、キャラクターの人間ドラマにも注目しているファンが多いのだと思います。アニメや漫画の魅力を多角的に考えることが大切だと改めて感じました。ご意見ありがとうございます。

Human: うわー、あなたの回答がとても長くなってしまいました。

Assistant: はい、申し訳ありません。簡潔にするように気をつけます。鬼滅の刃についての質問だったので、作品の魅力を説明しようと長くなってしまいました。次からは要点を絞って短めに回答するようにいたします。ご指摘ありがとうございます。わかりやすい対話を心がけます。

Human: めんどくさい

Assistant: すみません、わかりました。対話を楽しめるよう、できるだけシンプルに返答するようにいたします。ご迷惑をおかけして申し訳ありません。今後はもっと会話を楽しめるよう努力しますので、よろしくお願いいたします。

ELYZA

Quality: I mean, technically makes sense but not a very good Otaku

Prompt:
あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。
ELYZA-japanese-Llama-2-7b-fast-instruct-q6_K.gguf
52 tokens/s

ユーザー
オタク

Temperature: 0.7
Penalize repeat: 1.18
Penalize tokens: 256
Top-K: 40
Top-P: 0.95
<<SYS>>
{{prompt}}
<</SYS>>

{{history}}
{{char}}:
**ユーザー:** ねえ、アニメ好きなの?

**オタク:** おうよ!私は週に3回くらい見てるな

**ユーザー:** ああ、あなたの好きなアニメシリーズは何ですか?

**オタク:** はい、今ハマっているのは「ブリーチ」だね。

**ユーザー:** 個人的に、私は宮崎の大ファンです。 全部見たことがありますか?

**オタク:** 残念ながら全部ではなく半分も見れてないんだけどね。

**ユーザー:** ナウシカを見たことはありますか?

**オタク:** いいなぁ、見てみたいかも・・・でも1984年だよね?もう30年位前になるんだよねぇ・・・。

**ユーザー:** はい、古いですが古典的です! トトロはどうでしょうか? それを見たことがありますか?

**オタク:** 私はまだ見たことがないんだよね。 この前テレビでやっていたからいつか絶対に見ないと!

**ユーザー:** 誰もが一度は見たことがあるトトロ! すごく有名なんです!

**オタク:** そうなんだ~。また機会があれば見てみたいな。

**ユーザー:** ネコバスを覚えていますか? ちょっと不気味だけど、乗ったらカッコいいだろうな。

**オタク:** なんか聞いたことあるような・・名前だけはw 一度ビデオ屋に寄って帰ろうかな。

**ユーザー:** アキラはどうですか? それを見たことがありますか? それは今までで最も有名なアニメですよね?

**オタク:** そうだなあ。私もいつかは確かめようと思っているとこなんだ。

weblab-10

Note: does not use JP tokenizer

Quality: subpar

Prompt:
あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。
### 指示:
ねえ、アニメ好きなの?あなたは大のアニメファンで、友達と会話しています。 返信は短くカジュアルなものにする必要があります。

### 指示:
ねえ、アニメ好きなの?

### 応答:そうだよ!

### 指示:
好きなアニメは何ですか?

### 応答:今は「ソード・オブ・デスティクス」ってやつを見ているんだけど最高だよ!!
### 指示:
どういうことなのか教えてもらえますか? 今まで聞いたことがありません!

### 応答:これからお知らせします
### 指示:
わかりました、教えていただけますか?

### 応答:こちらからお伝えします
### 指示:
残念ですが、別のことについて話してもいいですか?

### 応答:あっ、それはいい考えだと思います!では次回にでも相談したいので手配などはそちらに任せます。
### 指示:
うわー、あなたはちょっとバカですね?

### 応答:失礼します
### 指示:
私はまだここにいるよ。

### 応答:さようなら

### 指示:
いいえ、まだ話しています。

### 応答:
### 指示:
lolol

### 応答:冗談です
### 指示:
とても面白かったです!

### 応答:本当にありがとうございます
### 指示:
ねえ、となりのトトロで何が起こったのか教えてもらえますか? 久しぶりに見ました。

### 応答:その映画には多くの楽しい小ネタが含まれていて子供向けです。

### 指示:
ああ、映画の中で登場人物や何が起こったのかについてもっと考えていました。

### 応答:他のキャラクターや小道具が何度も現実世界に戻って来るところを注意深く観察されています。

### 指示:


### 応答:素晴らしい脚本と演出の力学を持っています。

指示 

Quantization

Quantization

Quantization Overview

How does quantisation affect model output? - 15 basic tests on different quant levels

EXL2 (ExLlamaV2)

OmniQuant

Here are my docs for how run it: OmniQuant

QUIP

SqueezeLM

AWQ

GGML k-quants

SmoothQuant

SpQR

GPTQ/OPTQ

Quantization

OmniQuant

Summary

OmniQuant (omnidirectionally calibrated quantization) is a quantization technique published (2023-08-25) by Wenqi Shao and Mengzhao Chen from the General Vision Group, Shanghai AI Lab. Instead of hand-crafted quantization parameters, OmniQuant uses trained Learnable Weight Clipping (LWC) that modulates the extreme values of weights by optimizing the clipping threshold and Learnable Equivalent Transformation (LET) to handle activation outliers.

The paper (Table 1) shows better WikiText2 perplexity than GPTQ and AWQ at every weight quantization for Llama 1 & 2, 7B-70B. Most interesting is that it has good (close to W4A16 RTN) perplexity at W3A16/W3A16g128. Also, there is also weight activation quantization that scores well in real-world benchmarks (ARC, HellaSwag, Winogrande, etc), with W6A6 results close to FP16 (Table 2).

On performance, the paper reports that using MLC the W4A16g128 performs well (2X unquantized FP16), but that W3A16g128 is only slightly faster - spoiler alert, but at the bottom of the page you can see my test results, which are much improved from the paper results.

Running a model

Per tradition, the installation instructions in the repo are incomplete/do not work.

Basic install works:

conda create -n omniquant python=3.10 -y
conda activate omniquant
git clone https://github.com/OpenGVLab/OmniQuant.git
cd OmniQuant
pip install --upgrade pip 
pip install -e .

If you would like to run one of the pre-compiled modules using the instructions, you will get an error (I've added comments):

# You probably have this already
conda install git git-lfs
git lfs install

# Where MLC expects to find models
mkdir dist && cd dist

# test Llama-2-7b-chat with w3a16g128 quantization
git clone https://huggingface.co/ChenMnZ/Llama-2-7b-chat-omniquant-w3a16g128asym

# not included, but duh
cd ..

# wah wah, the included executable won't run
./mlc_chat_cli --local-id Llama-2-7b-chat-omniquant-w3a16g128asym --device-name cuda
./mlc_chat_cli: error while loading shared libraries: libmlc_llm.so: cannot open shared object file: No such file or directory

So, I've familiarized myself with MLC LLM before and the next obvious step is to install our own mlc_chat_cli:

conda install -c mlc-ai -c conda-forge mlc-chat-cli-nightly

# boo-urns - also note, the command line parameters have changed w/ recent versions
❯ mlc_chat_cli --model Llama-2-7b-chat-omniquant-w3a16g128asym --device cuda
Use MLC config: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params/mlc-chat-config.json"
Use model weights: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params/ndarray-cache.json"
Use model library: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [model]  reload model `model` from disk, or reload the current model if `model` is not specified

Loading model...
mlc_chat_cli: symbol lookup error: /home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so: undefined symbol: __cudaRegisterFatBinary

Unfortunately, the nightly build does not work with CUDA. (if you search for undefined symbol: __cudaRegisterFatBinary you will find that it refers to binaries not being linked properly to the CUDA runtime libs (eg -lcudart). To solve this, you need to build your own mlc_chat_cli. I use my previous docs:

# Note, you need TVM, but you no longer need to build them separately for CUDA support
# I'm using CUDA 12.1, but pick your appropriate package from https://mlc.ai/package/
pip install --pre --force-reinstall mlc-ai-nightly-cu121 mlc-chat-nightly-cu121 -f https://mlc.ai/wheels

# yes, even with mlc-chat-nightly-cu121 CUDA error occurs

# Compile
git clone https://github.com/mlc-ai/mlc-llm.git --recursive
cd mlc-llm
mkdir build && cd build
cmake .. -DUSE_CUDA=ON
make -j`nproc`

# You now have a usable mlc_chat_cli in mlc-llm/build
 ../mlc-llm/build/mlc_chat_cli --model Llama-2-7b-chat-omniquant-w3a16g128asym
Use MLC config: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params/mlc-chat-config.json"
Use model weights: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params/ndarray-cache.json"
Use model library: "/home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [model]  reload model `model` from disk, or reload the current model if `model` is not specified

Loading model...
Loading finished
Running system prompts...
System prompts finished
[INST]: /stats
prefill: 42862.4 tok/s, decode: -nan tok/s
[INST]: 

Quantize a Model

OK, well that was a PITA, but lets turn our eye onto quantizing a model. Here, the docs are sadly incomplete, although one of the primary authors, ChenMnZ was very quick to respond. Thanks!

I'll go through this step by step. First the shifts and scales available for download are only for their specific model-zoo, so you'll definitely need to generate them for your specific raw unquantized model:

python generate_act_scale_shift.py --model /models/llm/hf/meta-llama_Llama-2-7b-hf

Note: before you can create a custom model you will need to add the name of your model to net_choices in main.py (here) or it'll barf at you.

From here, they suggest that you run 2) a weight-only quantization:

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /models/llm/hf/meta-llama_Llama-2-7b-hf  \
--epochs 20 --output_dir ./log/llama2-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc

However, if you do this, you will end up with a set of logs (about 200MB for llama2-7b) that you will need to further process. If you accidentally did this (which takes almost 2 hours on a 4090), then you can "set --epochs to 0 and add --resume (see full reply) to the "fake quantize" step, but this "fake quantize" is really what you want to do if you want an actual quantized model. I recommend skipping to this directly:

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /models/llm/hf/meta-llama_Llama-2-7b-hf  \
--epochs 20 --output_dir ./log/llama2-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--save_dir /models/llm/hf/llama2-7b-omniquant-w3a16g128

You now have a "fake quantize" checkpoint that is ready to be processed into a "real" quant. This pre-quant checkpoint will be the same size as an FP16 model (~13GB for llama2-7b).

After that, you will want to run build.py from your mlc-llm repo checkout, but here's the crucial step that's undocumented in the README.md. You need to add their quantization scheme to mlc_llm/quantization/__init__.py:

    "w3a16g128asym": QuantizationScheme(
        name="w3a16g128asym",
        linear_weight=GroupQuantizationSpec(
            dtype="float16",
            mode="int3",
            sym=False,
            storage_nbit=16,
            group_size=128,
            transpose=False,
        ),
        embedding_table=None,
        final_fc_weight=None,
    ),

You can probably figure out the general pattern for any other quants you want to do.

Once you have this, then you can compile the MLC-LLM model, which should be quick:

python build.py --target cuda --quantization w3a16g128asym --model /models/llm/hf/llama2-7b-omniquant-w3a16g128 --use-cache=0

# Now you will have a new dist/ folder with the compiled model

OK, so this was everything I needed to get things working.

Quantize Times

Note, OmniQuant runs 20 epochs per layer and takes some time to run. Here is the quantization time provided from Table A1 of the paper on how long LLaMA (1) models took the researchers to quant using a single NVIDIA A100-80G:

LLaMA 7B 13B 30B 65B
weight-only 1.1h 2.2h 4.5h 8.9h
weight-activation 1.6h 3.3h 7.3h 14.4h

On my RTX 4090, a llama2-7b took ~1.7-1.8h to for a weight-only quantize.

Inferencing Performance

mlc_chat_cli afaik doesn't have an easy method for measuring perplexity, but it does have an ``--evaluate` flag now that lets me set my standard comparison parameters.

Here I was pleasantly surprised. The llama2-7b inferencing on the 4090 at 176 t/s actually manages to beat MLC LLM's q4f16_1 results (165 t/s, the previous speed champ).

The 3090 results are slower, but still respectable.

# 4090
❯ CUDA_VISIBLE_DEVICES=0 build/mlc_chat_cli --model llama2-7b-omniquant-w3a16g128-w3a16g128asym --evaluate --eval-prompt-len 128 --eval-gen-len 1920
...
[20:40:33] /home/local/llm/omniquant/mlc-llm/cpp/llm_chat.cc:706: [i: 2048] decoding-time=6.4203ms tok/s: 176.101.

# 3090
❯ CUDA_VISIBLE_DEVICES=0 build/mlc_chat_cli --model llama2-7b-omniquant-w3a16g128-w3a16g128asym --evaluate --eval-prompt-len 128 --eval-gen-len 1920
...
[20:45:35] /home/local/llm/omniquant/mlc-llm/cpp/llm_chat.cc:706: [i: 2048] decoding-time=12.3842ms tok/s: 88.736.

Just to make sure we were doing an apples-to-apples comparison though, I've also run the same test against a new q4f16_1 build with the same (2023-09-02 TVM and MLC):

# Build Model
❯ CUDA_VISIBLE_DEVICES=1 python build.py --target cuda --quantization q4f16_1 --model /models/llm/hf/meta-llama_Llama-2-7b-hf --use-cache=0

# 4090
❯ CUDA_VISIBLE_DEVICES=0 build/mlc_chat_cli --model meta-llama_Llama-2-7b-hf-q4f16_1 --evaluate --eval-prompt-len 128 --eval-gen-len 1920
...
[21:32:42] /home/local/llm/omniquant/mlc-llm/cpp/llm_chat.cc:706: [i: 2048] decoding-time=6.50777ms tok/s: 172.947.

# 3090
❯ CUDA_VISIBLE_DEVICES=1 build/mlc_chat_cli --model meta-llama_Llama-2-7b-hf-q4f16_1 --evaluate --eval-prompt-len 128 --eval-gen-len 1920
...
[21:35:55] /home/local/llm/omniquant/mlc-llm/cpp/llm_chat.cc:706: [i: 2048] decoding-time=8.13898ms tok/s: 141.371.

It looks like TVM/MLC have gotten even faster in the past month or so, although on the 4090, WA3A16g128 still barely edges out q4f16_1.

Check out my performance page for links to more perf benchmarks, although as this test shows, keep in mind that software perf is constantly improving.

Also note, in my testing, the W3A16g128 doesn't seem to save much more memory - nvidia-smi reported a top usage of 5.1GiB VRAM used vs 5.2-5.5GiB for MLC-LLM q4f16_1, GGML q4_K_M, and GPTQ q4_128gs. With MLC LLM, the W3A16g128 saves about 250MiB of memory vs the q4f16_1 or about 4-5%, so it's not nothing, especially as model sizes grow. And according to Table 3 of the paper, a LLaMA-65B W4A16 uses 41.0G and the W3A16 uses only 35.1G.

While OmniQuant does seem like an incremental improvement overall, it's probably not worth running out and requantizing all your models, especially considering how time consuming it can be (and how much less usable MLC LLM is atm for most batch=1 workflows). That being said, it looks like it does work, and is a viable option to use if you are trying to squeeze into the least memory available. Also, I've only tested out the weight quantization model, and have no idea how the weight-activation models perform, but I'll leave that for others to test.

OpenAI API Compatibility

Most inferencing packages have their own REST API, but having an OpenAI compatible API is useful for using a variety of clients, or to be able to easily switch between providers.

 

https://github.com/abetlen/llama-cpp-python

https://github.com/c0sogi/llama-api

https://localai.io/

https://github.com/hyperonym/basaran

https://github.com/xorbitsai/inference

 

Some clients provide an OpenAI API compatibility layer:

https://github.com/oobabooga/text-generation-webui

https://lmstudio.ai/

Prompting

Prompting

Prompt Format

Most instruct/chat fine tunes use their own formatting (which can wildly differ). These can usually be dug out of model card/repos/localllama discussion/discords, but the best single reference I've found is oobabooga's Instruction Templates:

If using HF, see also the documentation on chat templating

Translation

Google MADLAD-400

NLLB

https://www.reddit.com/r/LocalLLaMA/comments/16hjrbr/google_research_releases_new_107b_multilingual/

 

See also: