While some inferencing engines have their own API as well, most have standardized on OpenAI API compatibility.
- llama.cpp - has
llama-server
which is compatible OOTB- If you want Python bindings: llama-cpp-python
- mlc-llm - run
mlc_llm server
for OpenAI server - ExLlamaV2 - not OOTB but run TabbyAPI for compatibility
- vLLM - comes OOTB with
vllm serve
- LiteLLM - useful as a single middleware layer for lots of model types
Most inferencing packages have their own REST API, but having an OpenAI compatible API is useful for using a variety of clients, or to be able to easily switch between providers.
Are there any with full support: assistants, function calling, chat completions?
https://github.com/xorbitsai/inference
- Python
- Uses various backends (CTransformers, llama-cpp-python, not well documented)
Some clients provide an OpenAI API compatibility layer:
https://github.com/oobabooga/text-generation-webui
- Uses conda
- Mac or Windows GUI App
- But w/ an OpenAI API layer
- Not open source but free to use