Skip to main content

💥 OpenAI Proxy Server

LiteLLM Server manages:

See LiteLLM Proxy code

Quick Start​

View all the supported args for the Proxy CLI here

$ pip install litellm
$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

Test​

In a new shell, run, this will make an openai.chat.completions request. Ensure you're using openai v1.0.0+

litellm --test

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Using LiteLLM Proxy - Curl Request, OpenAI Package​

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'

Server Endpoints​

  • POST /chat/completions - chat completions endpoint to call 100+ LLMs
  • POST /completions - completions endpoint
  • POST /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
  • GET /models - available models on server
  • POST /key/generate - generate a key to access the proxy

Supported LLMs​

All LiteLLM supported LLMs are supported on the Proxy. Seel all supported llms

$ export AWS_ACCESS_KEY_ID=
$ export AWS_REGION_NAME=
$ export AWS_SECRET_ACCESS_KEY=
$ litellm --model bedrock/anthropic.claude-v2

Using with OpenAI compatible projects​

LiteLLM allows you to set openai.api_base to the proxy server and use all LiteLLM supported LLMs in any OpenAI supported project

FLASK - Fine-grained Language Model Evaluation Use litellm to evaluate any LLM on FLASK https://github.com/kaistAI/FLASK

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

Step 3 Run with FLASK

git clone https://github.com/kaistAI/FLASK
cd FLASK/gpt_review

Run the eval

python gpt4_eval.py -q '../evaluation_set/flask_evaluation.jsonl'

Proxy Configs​

The Config allows you to set the following params

Param NameDescription
model_listList of supported models on the server, with model-specific configs
litellm_settingslitellm Module settings, example litellm.drop_params=True, litellm.set_verbose=True, litellm.api_base, litellm.cache
general_settingsServer settings, example setting master_key: sk-my_special_key
environment_variablesEnvironment Variables example, REDIS_HOST, REDIS_PORT

Example Config​

model_list:
- model_name: zephyr-alpha
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: huggingface/HuggingFaceH4/zephyr-7b-alpha
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: https://<my-hosted-endpoint>

litellm_settings:
drop_params: True
set_verbose: True

general_settings:
master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)


environment_variables:
OPENAI_API_KEY: sk-123
REPLICATE_API_KEY: sk-cohere-is-okay
REDIS_HOST: redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com
REDIS_PORT: "16337"
REDIS_PASSWORD:

Config for Multiple Models - GPT-4, Claude-2​

Here's how you can use multiple llms with one proxy config.yaml.

Step 1: Setup Config​

model_list:
- model_name: zephyr-alpha # the 1st model is the default on the proxy
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: huggingface/HuggingFaceH4/zephyr-7b-alpha
api_base: http://0.0.0.0:8001
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: sk-1233
- model_name: claude-2
litellm_params:
model: claude-2
api_key: sk-claude
info

The proxy uses the first model in the config as the default model - in this config the default model is zephyr-alpha

Step 2: Start Proxy with config​

$ litellm --config /path/to/config.yaml

Step 3: Use proxy​

Curl Command

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "zephyr-alpha",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'

Managing Auth - Virtual Keys​

Grant other's temporary access to your proxy, with keys that expire after a set duration.

Requirements:

  • Need to a postgres database (e.g. Supabase)

You can then generate temporary keys by hitting the /key/generate endpoint.

See code

Step 1: Save postgres db url

model_list:
- model_name: gpt-4
litellm_params:
model: ollama/llama2
- model_name: gpt-3.5-turbo
litellm_params:
model: ollama/llama2

general_settings:
master_key: sk-1234 # [OPTIONAL] if set all calls to proxy will require either this key or a valid generated token
database_url: "postgresql://<user>:<password>@<host>:<port>/<dbname>"

Step 2: Start litellm

litellm --config /path/to/config.yaml

Step 3: Generate temporary keys

curl 'http://0.0.0.0:8000/key/generate' \
--h 'Authorization: Bearer sk-1234' \
--d '{"models": ["gpt-3.5-turbo", "gpt-4", "claude-2"], "duration": "20m"}'
  • models: list or null (optional) - Specify the models a token has access too. If null, then token has access to all models on server.

  • duration: str or null (optional) Specify the length of time the token is valid for. If null, default is set to 1 hour. You can set duration as seconds ("30s"), minutes ("30m"), hours ("30h"), days ("30d").

Expected response:

{
"key": "sk-kdEXbIqZRwEeEiHwdg7sFA", # Bearer token
"expires": "2023-11-19T01:38:25.838000+00:00" # datetime object
}

Managing Auth - Upgrade/Downgrade Models​

If a user is expected to use a given model (i.e. gpt3-5), and you want to:

  • try to upgrade the request (i.e. GPT4)
  • or downgrade it (i.e. Mistral)
  • OR rotate the API KEY (i.e. open AI)
  • OR access the same model through different end points (i.e. openAI vs openrouter vs Azure)

Here's how you can do that:

Step 1: Create a model group in config.yaml (save model name, api keys, etc.)

model_list:
- model_name: my-free-tier
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: my-free-tier
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: my-free-tier
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
- model_name: my-paid-tier
litellm_params:
model: gpt-4
api_key: my-api-key

Step 2: Generate a user key - enabling them access to specific models, custom model aliases, etc.

curl -X POST "https://0.0.0.0:8000/key/generate" \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"models": ["my-free-tier"],
"aliases": {"gpt-3.5-turbo": "my-free-tier"},
"duration": "30min"
}'
  • How to upgrade / downgrade request? Change the alias mapping
  • How are routing between diff keys/api bases done? litellm handles this by shuffling between different models in the model list with the same model_name. See Code

Managing Auth - Tracking Spend​

You can get spend for a key by using the /key/info endpoint.

curl 'http://0.0.0.0:8000/key/info?key=<user-key>' \
-X GET \
-H 'Authorization: Bearer <your-master-key>'

This is automatically updated (in USD) when calls are made to /completions, /chat/completions, /embeddings using litellm's completion_cost() function. See Code.

Sample response

{
"key": "sk-tXL0wt5-lOOVK9sfY2UacA",
"info": {
"token": "sk-tXL0wt5-lOOVK9sfY2UacA",
"spend": 0.0001065,
"expires": "2023-11-24T23:19:11.131000Z",
"models": [
"gpt-3.5-turbo",
"gpt-4",
"claude-2"
],
"aliases": {
"mistral-7b": "gpt-3.5-turbo"
},
"config": {}
}
}

Save Model-specific params (API Base, API Keys, Temperature, Headers etc.)​

You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

Step 1: Create a config.yaml file

model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
azure_ad_token: eyJ0eXAiOiJ
- model_name: gpt-4-team2
litellm_params:
model: azure/gpt-4
api_key: sk-123
api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base
headers: {
"HTTP-Referer": "litellm.ai",
"X-Title": "LiteLLM Server"
}

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Load API Keys from Vault​

If you have secrets saved in Azure Vault, etc. and don't want to expose them in the config.yaml, here's how to load model-specific keys from the environment.

os.environ["AZURE_NORTH_AMERICA_API_KEY"] = "your-azure-api-key"
model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
api_key: os.environ/AZURE_NORTH_AMERICA_API_KEY

See Code

s/o to @David Manouchehri for helping with this.

Load Balancing - Multiple Instances of 1 model​

If you have multiple instances of the same model,

in the config.yaml just add all of them with the same 'model_name', and the proxy will handle routing requests (using LiteLLM's Router).

In the config below requests with model=zephyr-beta will be routed across multiple instances of HuggingFaceH4/zephyr-7b-beta

model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003

Step 2: Start Proxy with config​

$ litellm --config /path/to/config.yaml

Step 3: Use proxy​

Curl Command

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "zephyr-beta",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'

Fallbacks + Cooldowns + Retries + Timeouts​

If a call fails after num_retries, fall back to another model group.

If the error is a context window exceeded error, fall back to a larger model group (if given).

See Code

Set via config

model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: <my-openai-key>
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: gpt-3.5-turbo-16k
api_key: <my-openai-key>

litellm_settings:
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
request_timeout: 10 # raise Timeout error if call takes longer than 10s
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries
context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.

Set dynamically

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "zephyr-beta",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
"fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
"context_window_fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
"num_retries": 2,
"request_timeout": 10
}
'

Config for Embedding Models - xorbitsai/inference​

Here's how you can use multiple llms with one proxy config.yaml. Here is how LiteLLM calls OpenAI Compatible Embedding models

Config​

model_list:
- model_name: custom_embedding_model
litellm_params:
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
api_base: http://0.0.0.0:8000/
- model_name: custom_embedding_model
litellm_params:
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
api_base: http://0.0.0.0:8001/

Run the proxy using this config

$ litellm --config /path/to/config.yaml

Config for setting Model Aliases​

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

In the config below requests with model=gpt-4 will route to ollama/llama2

model_list:
- model_name: text-davinci-003
litellm_params:
model: ollama/zephyr
- model_name: gpt-4
litellm_params:
model: ollama/llama2
- model_name: gpt-3.5-turbo
litellm_params:
model: ollama/llama2

Caching Responses​

Caching can be enabled by adding the cache key in the config.yaml

Step 1: Add cache to the config.yaml​

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo

litellm_settings:
set_verbose: True
cache: # init cache
type: redis # tell litellm to use redis caching

Step 2: Add Redis Credentials to .env​

LiteLLM requires the following REDIS credentials in your env to enable caching

REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'

Step 3: Run proxy with config​

$ litellm --config /path/to/config.yaml

Using Caching​

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

Control caching per completion request​

Caching can be switched on/off per /chat/completions request

  • Caching on for completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": true
    }'
  • Caching off for completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": false
    }'

Set Custom Prompt Templates​

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
- model_name: mistral-7b # model alias
litellm_params: # actual params for litellm.completion()
model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1"
api_base: "<your-api-base>"
api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
initial_prompt_value: "\n"
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
final_prompt_value: "\n"
bos_token: "<s>"
eos_token: "</s>"
max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Debugging Proxy​

Run the proxy with --debug to easily view debug logs

litellm --model gpt-3.5-turbo --debug

When making requests you should see the POST request sent by LiteLLM to the LLM on the Terminal output

POST Request Sent from LiteLLM:
curl -X POST \
https://api.openai.com/v1/chat/completions \
-H 'content-type: application/json' -H 'Authorization: Bearer sk-qnWGUIW9****************************************' \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "this is a test request, write a short poem"}]}'

Logging Proxy Input/Output - Langfuse​

We will use the --config to set litellm.success_callback = ["langfuse"] this will log all successfull LLM calls to langfuse

Step 1 Install langfuse

pip install langfuse

Step 2: Create a config.yaml file and set litellm_settings: success_callback

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
litellm_settings:
success_callback: ["langfuse"]

Step 3: Start the proxy, make a test request

Start proxy

litellm --config config.yaml --debug

Test Request

litellm --test

Expected output on Langfuse

LiteLLM Proxy Performance​

Throughput - 30% Increase​

LiteLLM proxy + Load Balancer gives 30% increase in throughput compared to Raw OpenAI API

Latency Added - 0.00325 seconds​

LiteLLM proxy adds 0.00325 seconds latency as compared to using the Raw OpenAI API

Proxy CLI Arguments​

--host​

  • Default: '0.0.0.0'
  • The host for the server to listen on.
  • Usage:
    litellm --host 127.0.0.1

--port​

  • Default: 8000
  • The port to bind the server to.
  • Usage:
    litellm --port 8080

--num_workers​

  • Default: 1
  • The number of uvicorn workers to spin up.
  • Usage:
    litellm --num_workers 4

--api_base​

  • Default: None
  • The API base for the model litellm should call.
  • Usage:
    litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

--api_version​

  • Default: None
  • For Azure services, specify the API version.
  • Usage:
    litellm --model azure/gpt-deployment --api_version 2023-08-01 --api_base https://<your api base>"

--model or -m​

  • Default: None
  • The model name to pass to Litellm.
  • Usage:
    litellm --model gpt-3.5-turbo

--test​

  • Type: bool (Flag)
  • Proxy chat completions URL to make a test request.
  • Usage:
    litellm --test

--alias​

  • Default: None
  • An alias for the model, for user-friendly reference.
  • Usage:
    litellm --alias my-gpt-model

--debug​

  • Default: False
  • Type: bool (Flag)
  • Enable debugging mode for the input.
  • Usage:
    litellm --debug

--temperature​

  • Default: None
  • Type: float
  • Set the temperature for the model.
  • Usage:
    litellm --temperature 0.7

--max_tokens​

  • Default: None
  • Type: int
  • Set the maximum number of tokens for the model output.
  • Usage:
    litellm --max_tokens 50

--request_timeout​

  • Default: 600
  • Type: int
  • Set the timeout in seconds for completion calls.
  • Usage:
    litellm --request_timeout 300

--drop_params​

  • Type: bool (Flag)
  • Drop any unmapped params.
  • Usage:
    litellm --drop_params

--add_function_to_prompt​

  • Type: bool (Flag)
  • If a function passed but unsupported, pass it as a part of the prompt.
  • Usage:
    litellm --add_function_to_prompt

--config​

  • Configure Litellm by providing a configuration file path.
  • Usage:
    litellm --config path/to/config.yaml

--telemetry​

  • Default: True
  • Type: bool
  • Help track usage of this feature.
  • Usage:
    litellm --telemetry False