Install with vLLM AIBrix

This guide provides step-by-step instructions for integrating the vLLM AIBrix.

About vLLM AIBrix

vLLM AIBrix is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.

Key Features

High-Density LoRA Management: Streamlined support for lightweight, low-rank adaptations of models
LLM Gateway and Routing: Efficiently manage and direct traffic across multiple models and replicas
LLM App-Tailored Autoscaler: Dynamically scale inference resources based on real-time demand
Unified AI Runtime: A versatile sidecar enabling metric standardization, model downloading, and management
Distributed Inference: Scalable architecture to handle large workloads across multiple nodes
Distributed KV Cache: Enables high-capacity, cross-engine KV reuse
Cost-efficient Heterogeneous Serving: Enables mixed GPU inference to reduce costs with SLO guarantees
GPU Hardware Failure Detection: Proactive detection of GPU hardware issues

Integration Benefits

Integrating vLLM Semantic Router with AIBrix provides several advantages:

Intelligent Request Routing: Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas
Enhanced Scalability: AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand
Cost Optimization: By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees
Production-Ready Infrastructure: AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments
Simplified Operations: The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams

Prerequisites

Before starting, ensure you have the following tools installed:

kind - Kubernetes in Docker (Optional)
kubectl - Kubernetes CLI
Helm - Package manager for Kubernetes

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster optimized for the semantic router workload:

# Generate kind configuration
./tools/kind/generate-kind-config.sh

# Create cluster with optimized resource settings
kind create cluster --name semantic-router-cluster --config tools/kind/kind-config.yaml

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.

Step 2: Deploy vLLM Semantic Router

Deploy the semantic router service with all required components:

# Deploy semantic router using Kustomize
kubectl apply -k deploy/kubernetes/aibrix/semantic-router

# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s

# Verify deployment status
kubectl get pods -n vllm-semantic-router-system

Step 3: Install vLLM AIBrix

Install the core vLLM AIBrix components:

# Install vLLM AIBrix
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-dependency-v0.4.1.yaml

kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-core-v0.4.1.yaml

# wait for deployment to be ready
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-gateway-plugins --for=condition=Available
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-metadata-service --for=condition=Available
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-controller-manager --for=condition=Available

Step 4: Deploy Demo LLM

Create a demo LLM to serve as the backend for the semantic router:

# Deploy demo LLM
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml

kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available

Step 5: Create Gateway API Resources

Create the necessary Gateway API resources for the envoy gateway:

kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml

Testing the Deployment

Method 1: Port Forwarding (Recommended for Local Testing)

Set up port forwarding to access the gateway locally:

# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
  --selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
  -o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

Send Test Requests

Once the gateway is accessible, test the inference endpoint:

# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ]
  }'

You will see the response from the demo LLM, and additional headers injected by the semantic router.

HTTP/1.1 200 OK
server: fasthttp
date: Thu, 06 Nov 2025 06:38:08 GMT
content-type: application/json
x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9
x-went-into-req-headers: true
request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163
x-vsr-selected-category: math
x-vsr-selected-reasoning: on
x-vsr-selected-model: vllm-llama3-8b-instruct
x-vsr-injected-system-prompt: true
transfer-encoding: chunked

{"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]}

Cleanup

To remove the entire deployment:

# Remove Gateway API resources and Demo LLM
kubectl delete -f deploy/kubernetes/aibrix/aigw-resources

# Remove semantic router
kubectl delete -k deploy/kubernetes/aibrix/semantic-router

# Delete kind cluster
kind delete cluster --name semantic-router-cluster

Next Steps

Set up monitoring and observability
Implement authentication and authorization
Scale the semantic router deployment for production workloads

About vLLM AIBrix​

Key Features​

Integration Benefits​

Prerequisites​

Step 1: Create Kind Cluster (Optional)​

Step 2: Deploy vLLM Semantic Router​

Step 3: Install vLLM AIBrix​

Step 4: Deploy Demo LLM​

Step 5: Create Gateway API Resources​

Testing the Deployment​

Method 1: Port Forwarding (Recommended for Local Testing)​

Send Test Requests​

Cleanup​

Next Steps​