Replit has trained a very strong 3B parameter code completion foundational model on The Stack. One fine tune beats WizardCoder-15B (StarCoder fine tune) in human-eval, making it probably the strongest open code-completion model as of July 2023.

2023-07-12: Sadly, it appears that replit-code-instruct-glaive’s extremely strong HumanEval performance may be mostly due to training data contamination: (also, I noticed a v2 in progress…)


### Environment
conda create -n replit
mamba install pip

Running Replit HF

First let’s see if we can run the included code. Install any libs if it complains

git clone
pip install einops sentencepiece transformers torch


# Code from
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = '/data/ai/models/llm/replit/replit-code-instruct-glaive'

# load model
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL, trust_remote_code=True)

PROMPT = '# Python function to call OpenAI Completion API'

x = tokenizer.encode(PROMPT, return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)


❯ time python
You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:04<00:00,  2.05s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Executed in   70.49 secs    fish           external
  • Fine, but slow, let’s ggml

Convert to GGML

git clone
mkdir build && cd build
# using system CUDA is fine
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc ..
# or -j`nproc`
make -j32 all
pip install -r ../requirements.txt
pip install pygments

# 0 for fp32, 1 for fp16
python ./examples/replit/ [replit_model_folder] 1
# outputs ggml-model-f16.bin in folder

# Optional quantize - for me fp16 is 105ms/tok, q8_0 is 60ms/tok, q5_1 is 50ms/tok
build/bin/replit-quantize ggml-model-f16.bin q8_0.bin 7
build/bin/replit-quantize ggml-model-f16.bin q5_1.bin 9


time build/bin/replit -m /data/ai/models/llm/replit/replit-code-instruct-glaive/ggml-model-f16.bin -p "# Python function to call OpenAI Completion API" -n 100
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090
replit_model_load: memory_size =   640.00 MB, n_mem = 65536
main:  predict time = 22038.78 ms / 105.45 ms per token
Executed in   12.97 secs    fish           external

C Transformers

We want C Transformers, a Python GGML wrapper since it will allow us to use w/ LangChain and other Python projects:

CT_CUBLAS=1 pip install -U ctransformers --no-binary ctransformers

And our test script based off of the usage docs:

from ctransformers import AutoModelForCausalLM
import time

MODEL = '/data/ai/models/llm/replit/replit-code-instruct-glaive/ggml-model-f16.bin'
PROMPT = '# Python function to call OpenAI Completion API'

start = time.time()
print('Loading... ', end='')
llm = AutoModelForCausalLM.from_pretrained(MODEL, model_type='replit', gpu_layers=99)
t = time.time() - start

tokens = llm.tokenize(PROMPT)

n = 0
start = time.time()
for token in llm.generate(tokens):
    print(llm.detokenize(token), end='')

    # 100 tokens pls
    n += 1
    if n >= 100:
tps = (time.time() - start)/100
print(f'*** {t:.3f} s/t')


❯ time python
Loading... 0.79s
*** 0.786 s/t
Executed in   13.22 secs    fish           external


We can test how CTransformers works with ChatDocs.

Our chatdocs.yml:

  model: /data/ai/models/llm/replit/replit-code-instruct-glaive
  model_file: ggml-model-f16.bin
  model_type: replit
    context_length: 2048


pip install chatdocs

# note you need to make sure chatdocs is using your conda Python
# you can either run: python `which chatdocs` [command]
# or you can modify the chatdocs bin

chatdocs download
chatdocs add /path/to/documents
chatdocs ui
  • I ran into some issues w/ ChromaDB’s indexing. Will need to debug later


For just testing out simple interactive usage, adapting the worked well (just replace the model and ggml path).