Replit has trained a very strong 3B parameter code completion foundational model on The Stack. One fine tune beats WizardCoder-15B (StarCoder fine tune) in human-eval, making it probably the strongest open code-completion model as of July 2023.
2023-07-12: Sadly, it appears that replit-code-instruct-glaiveβs extremely strong HumanEval performance may be mostly due to training data contamination: https://huggingface.co/sahil2801/replit-code-instruct-glaive/discussions/3 (also, I noticed a v2 in progressβ¦)
Setup
### Environment
conda create -n replit
mamba install pip
Running Replit HF
First letβs see if we can run the included code. Install any libs if it complains
git clone https://huggingface.co/sahil2801/replit-code-instruct-glaive
pip install einops sentencepiece transformers torch
Our test.py
:
# Code from https://huggingface.co/replit/replit-code-v1-3b#generation
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = '/data/ai/models/llm/replit/replit-code-instruct-glaive'
# load model
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL, trust_remote_code=True)
PROMPT = '# Python function to call OpenAI Completion API'
x = tokenizer.encode(PROMPT, return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer
.eos_token_id)
# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
Running:
β― time python test.py
You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:04<00:00, 2.05s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
...
Executed in 70.49 secs fish external
- Fine, but slow, letβs ggml
Convert to GGML
git clone https://github.com/ggerganov/ggml
mkdir build && cd build
# using system CUDA is fine
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc ..
# or -j`nproc`
make -j32 all
pip install -r ../requirements.txt
pip install pygments
# 0 for fp32, 1 for fp16
python ./examples/replit/convert-h5-to-ggml.py [replit_model_folder] 1
# outputs ggml-model-f16.bin in folder
# Optional quantize - for me fp16 is 105ms/tok, q8_0 is 60ms/tok, q5_1 is 50ms/tok
build/bin/replit-quantize ggml-model-f16.bin q8_0.bin 7
build/bin/replit-quantize ggml-model-f16.bin q5_1.bin 9
Test GGML
time build/bin/replit -m /data/ai/models/llm/replit/replit-code-instruct-glaive/ggml-model-f16.bin -p "# Python function to call OpenAI Completion API" -n 100
...
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090
replit_model_load: memory_size = 640.00 MB, n_mem = 65536
...
main: predict time = 22038.78 ms / 105.45 ms per token
...
Executed in 12.97 secs fish external
C Transformers
We want C Transformers, a Python GGML wrapper since it will allow us to use w/ LangChain and other Python projects:
CT_CUBLAS=1 pip install -U ctransformers --no-binary ctransformers
And our test script based off of the usage docs:
from ctransformers import AutoModelForCausalLM
import time
MODEL = '/data/ai/models/llm/replit/replit-code-instruct-glaive/ggml-model-f16.bin'
PROMPT = '# Python function to call OpenAI Completion API'
start = time.time()
print('Loading... ', end='')
llm = AutoModelForCausalLM.from_pretrained(MODEL, model_type='replit', gpu_layers=99)
t = time.time() - start
print(f'{t:.2f}s')
tokens = llm.tokenize(PROMPT)
n = 0
start = time.time()
for token in llm.generate(tokens):
print(llm.detokenize(token), end='')
# 100 tokens pls
n += 1
if n >= 100:
break
tps = (time.time() - start)/100
print('\n\n')
print(f'*** {t:.3f} s/t')
Output:
β― time python test-ggml.py
Loading... 0.79s
...
*** 0.786 s/t
...
Executed in 13.22 secs fish external
ChatDocs
We can test how CTransformers works with ChatDocs.
Our chatdocs.yml
:
ctransformers:
model: /data/ai/models/llm/replit/replit-code-instruct-glaive
model_file: ggml-model-f16.bin
model_type: replit
config:
context_length: 2048
Setup:
pip install chatdocs
# note you need to make sure chatdocs is using your conda Python
# you can either run: python `which chatdocs` [command]
# or you can modify the chatdocs bin
chatdocs download
chatdocs add /path/to/documents
chatdocs ui
- I ran into some issues w/ ChromaDBβs indexing. Will need to debug later
replit-3b-inference
For just testing out simple interactive usage, adapting the inference.py worked well (just replace the model and ggml path).