AI Proxy Advanced

Overview Get started Configuration reference Changelog

How it works

The AI Proxy Advanced plugin will mediate the following for you:

Request and response formats appropriate for the configured config.targets.model.provider and config.targets.route_type
The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets.model.options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong consumers, using consistent request and response formats, regardless of the backend provider or model.

This plugin currently only supports REST-based full text responses.

Load balancing

This plugin supports several load-balancing algorithms, similar to those used for Kong upstreams, allowing efficient distribution of requests across different AI models. The supported algorithms include:

Algorithm	Description
Consistent-hashing (sticky-session on given header value)	The consistent-hashing algorithm routes requests based on a specified header value (`X-Hashing-Header`). Requests with the same header are repeatedly routed to the same model, enabling sticky sessions for maintaining context or affinity across user interactions.
Lowest-latency	The lowest-latency algorithm is based on the response time for each model. It distributes requests to models with the lowest response time.
Lowest-usage	The lowest-usage algorithm in AI Proxy Advanced is based on the volume of usage for each model. It balances the load by distributing requests to models with the lowest usage, measured by factors such as: Prompt token counts Response token counts Cost v3.10+ Or other resource metrics.
Priority group v3.10+	The priority algorithm routes requests to groups of models based on assigned weights. Higher-weighted groups are preferred, and if all models in a group fail, the plugin falls back to the next group. This allows for reliable failover and cost-aware routing across multiple AI models.
Round-robin (weighted)	The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, they’ll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
Semantic	The semantic algorithm distributes requests to different models based on the similarity between the prompt in the request and the description provided in the model configuration. This allows Kong to automatically select the model that is best suited for the given domain or use case. This feature enhances the flexibility and efficiency of model selection, especially when dealing with a diverse range of AI providers and models.

Retry and fallback

The load balancer has customizable retries and timeouts for requests, and can redirect a request to a different model in case of failure. This allows you to have a fallback in case one of your targets is unavailable.

For versions v3.10+ this plugin supports fallback across targets with any supported formats. For versions earlier than 3.10, fallback is not supported across targets with different formats. You can still use multiple providers, but only if the formats are compatible. For example, load balancers with the following target combinations are supported:

Different OpenAI models
OpenAI models and Mistral models with the OpenAI format
Mistral models with the OLLAMA format and Llama models with the OLLAMA format

Some errors, such as client errors, result in a failure and don’t failover to another target.

v3.10+ To configure failover in addition to network errors, set config.balancer.failover_criteria to include:

Additional HTTP error codes, like http_429 or http_502

The non_idempotent setting, as most AI services accept POST requests

Request and response formats

The plugin’s config.targets.route_type should be set based on the target upstream endpoint and model, based on this capability matrix:

Provider name	Provider path	Kong route type	Example model name
OpenAI	/v1/chat/completions	llm/v1/chat	gpt-4
OpenAI	/v1/completions	llm/v1/completions	gpt-3.5-turbo-instruct
Cohere	/v1/chat	llm/v1/chat	command
Cohere	/v1/generate	llm/v1/completions	command
Azure	/openai/deployments/{deployment_name}/chat/completions	llm/v1/chat	gpt-4
Azure	/openai/deployments/{deployment_name}/completions	llm/v1/completions	gpt-3.5-turbo-instruct
Anthropic	/v1/complete in version 3.6, /v1/messages since version 3.7	llm/v1/chat	claude-2.1
Anthropic	/v1/complete	llm/v1/completions	claude-2.1
Mistral	User-defined	llm/v1/chat	User-defined
Mistral	User-defined	llm/v1/completions	User-defined
Llama2	User-defined	llm/v1/chat	User-defined
Llama2	User-defined	llm/v1/completions	User-defined
Amazon Bedrock	Use the LLM `chat` upstream path	llm/v1/chat	Use the model name for the specific LLM provider
Gemini	llm/v1/chat	llm/v1/chat	gemini-1.5-flash or gemini-1.5-pro
Hugging Face	/models/{model_provider}/{model_name}	llm/v1/chat	Use the model name for the specific LLM provider
Hugging Face	/models/{model_provider}/{model_name}	llm/v1/completions	Use the model name for the specific LLM provider

The following upstream URL patterns are used:

Provider	URL
OpenAI	https://api.openai.com:443/{route_type_path}
Cohere	https://api.cohere.com:443/{route_type_path}
Azure	https://{azure_instance}.openai.azure.com:443/openai/deployments/{deployment_name}/{route_type_path}
Anthropic	https://api.anthropic.com:443/{route_type_path}
Mistral	As defined in `config.targets.model.options.upstream_url`
Llama2	As defined in `config.targets.model.options.upstream_url`
Amazon Bedrock	https://bedrock-runtime.{region}.amazonaws.com
Gemini	https://generativelanguage.googleapis.com
Hugging Face	https://api-inference.huggingface.co

While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers. For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same config.targets.model.options.upstream_url plugin option.

v3.10+ If you are using each provider’s native SDK, Kong Gateway allows you to transparently proxy the request without any transformation and return the response unmodified. This can be done by setting config.llm_format to a value other than openai, such as gemini or bedrock.

In this mode, Kong Gateway will still provide useful analytics, logging, and cost calculation.

Input formats

Kong Gateway mediates the request and response format based on the selected config.targets.model.provider and config.targets.route_type.

v3.10+ By default, Kong Gateway uses the OpenAI format, but you can customize this using config.llm_format. If llm_format is not set to openai, the plugin will not transform the request when sending it upstream and will leave it as-is.

The Kong AI Proxy accepts the following inputs formats, standardized across all providers. The config.targets.route_type must be configured respective to the required request and response format examples:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ]
}

v3.9+With Amazon Bedrock, you can include your guardrail configuration in the request:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ],
    "extra_body":
        {
            "guardrailConfig":
                {
                    "guardrailIdentifier":"<guardrail_identifier>",
                    "guardrailVersion":"1",
                    "trace":"enabled"
                }
        }
}

{
    "prompt": "You are a scientist. What is the theory of relativity?"
}

Response formats

Conversely, the response formats are also transformed to a standard format across all providers:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "The theory of relativity is a...",
                "role": "assistant"
            }
        }
    ],
    "created": 1707769597,
    "id": "chatcmpl-ID",
    "model": "gpt-4-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 5,
        "prompt_tokens": 26,
        "total_tokens": 31
    }
}

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "text": "The theory of relativity is a..."
        }
    ],
    "created": 1707769597,
    "id": "cmpl-ID",
    "model": "gpt-3.5-turbo-instruct",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 10,
        "prompt_tokens": 7,
        "total_tokens": 17
    }
}

The request and response formats are loosely based on OpenAI. See the sample OpenAPI specification for more detail on the supported formats.

Templating v3.7+

The plugin allows you to substitute values in the config.model.name and any parameter under config.model.options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:

Action	Description
Select different models dynamically on one provider	Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider.
Use one chat route with dynamic Azure OpenAI deployments	Configure a dynamic route to target multiple Azure OpenAI model deployments.
Use unsupported models with OpenAI-compatible SDKs	Proxy models that are not officially supported, like Whisper-2, through an OpenAI-compatible interface using preserve routing.

Provider	Chat	Completion	Chat streaming	Completions streaming	Minimum Kong Gateway version
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)					3.8
Cohere					3.8
Azure					3.8
Anthropic					3.8
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)					3.8
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)					3.8
Amazon Bedrock					3.8
Gemini					3.8
Hugging Face					3.9