Install with vLLM AIBrix
This guide provides step-by-step instructions for integrating the vLLM AIBrix.
About vLLM AIBrix​
vLLM AIBrix is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.
Key Features​
- High-Density LoRA Management: Streamlined support for lightweight, low-rank adaptations of models
- LLM Gateway and Routing: Efficiently manage and direct traffic across multiple models and replicas
- LLM App-Tailored Autoscaler: Dynamically scale inference resources based on real-time demand
- Unified AI Runtime: A versatile sidecar enabling metric standardization, model downloading, and management
- Distributed Inference: Scalable architecture to handle large workloads across multiple nodes
- Distributed KV Cache: Enables high-capacity, cross-engine KV reuse
- Cost-efficient Heterogeneous Serving: Enables mixed GPU inference to reduce costs with SLO guarantees
- GPU Hardware Failure Detection: Proactive detection of GPU hardware issues
Integration Benefits​
Integrating vLLM Semantic Router with AIBrix provides several advantages:
-
Intelligent Request Routing: Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas
-
Enhanced Scalability: AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand
-
Cost Optimization: By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees
-
Production-Ready Infrastructure: AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments
-
Simplified Operations: The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams
Prerequisites​
Before starting, ensure you have the following tools installed:
- kind - Kubernetes in Docker (Optional)
- kubectl - Kubernetes CLI
- Helm - Package manager for Kubernetes
Step 1: Create Kind Cluster (Optional)​
Create a local Kubernetes cluster optimized for the semantic router workload:
# Generate kind configuration
./tools/kind/generate-kind-config.sh
# Create cluster with optimized resource settings
kind create cluster --name semantic-router-cluster --config tools/kind/kind-config.yaml
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
Step 2: Deploy vLLM Semantic Router​
Deploy the semantic router service with all required components:
# Deploy semantic router using Kustomize
kubectl apply -k deploy/kubernetes/aibrix/semantic-router
# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
# Verify deployment status
kubectl get pods -n vllm-semantic-router-system
Step 3: Install vLLM AIBrix​
Install the core vLLM AIBrix components:
# Install vLLM AIBrix
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-dependency-v0.4.1.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-core-v0.4.1.yaml
# wait for deployment to be ready
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-gateway-plugins --for=condition=Available
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-metadata-service --for=condition=Available
kubectl wait --timeout=2m -n aibrix-system deployment/aibrix-controller-manager --for=condition=Available
Step 4: Deploy Demo LLM​
Create a demo LLM to serve as the backend for the semantic router:
# Deploy demo LLM
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml
kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available
Step 5: Create Gateway API Resources​
Create the necessary Gateway API resources for the envoy gateway:
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml
Testing the Deployment​
Method 1: Port Forwarding (Recommended for Local Testing)​
Set up port forwarding to access the gateway locally:
# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
-o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
Send Test Requests​
Once the gateway is accessible, test the inference endpoint:
# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
]
}'
You will see the response from the demo LLM, and additional headers injected by the semantic router.
HTTP/1.1 200 OK
server: fasthttp
date: Thu, 06 Nov 2025 06:38:08 GMT
content-type: application/json
x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9
x-went-into-req-headers: true
request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163
x-vsr-selected-category: math
x-vsr-selected-reasoning: on
x-vsr-selected-model: vllm-llama3-8b-instruct
x-vsr-injected-system-prompt: true
transfer-encoding: chunked
{"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]}
Cleanup​
To remove the entire deployment:
# Remove Gateway API resources and Demo LLM
kubectl delete -f deploy/kubernetes/aibrix/aigw-resources
# Remove semantic router
kubectl delete -k deploy/kubernetes/aibrix/semantic-router
# Delete kind cluster
kind delete cluster --name semantic-router-cluster
Next Steps​
- Set up monitoring and observability
- Implement authentication and authorization
- Scale the semantic router deployment for production workloads