Here we experiment w/ getting a local mixture of experts.
Released 2023-08-23: https://x.com/jon_durbin/status/1694360998797250856
Code: https://github.com/jondurbin/airoboros#lmoe
Setup
Run
Uses 17.54GB VRAM
And test:
Client
The current version of the API is quite picky and I couldn’t find anything compatible… here’s a simple client that ChatGPT-4 CI helped me write:
- This is a super simple client, you’d want to add token counting and message log truncation if you were going to use it seriously
- For me on a fast system (NVMe SSD, 5950X, 4090), takes 2-3min to load, maybe shortened w/ bitsandbytes…
- The routing works but obvious not well, it’s just a POC and could be improved tremendously
- llama2-7b is dumb as a box of rocks, lol
To test the routing, I recommend some simple queries like:
Part of these were bugs that I reported and got stamped out:
/v1/models
endpoint bug
- CORS errors
Just as an FYI, here are the clients I tried that didn’t work: