Install with Envoy AI Gateway
This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.
Architecture Overview​
The deployment consists of:
- vLLM Semantic Router: Provides intelligent request routing and semantic understanding
- Envoy Gateway: Core gateway functionality and traffic management
- Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers
Benefits of Integration​
Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:
1. Hybrid Model Selection​
Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.
2. Token Rate Limiting​
Protect your infrastructure and control costs with fine-grained rate limiting:
- Input token limits: Control request size to prevent abuse
- Output token limits: Manage response generation costs
- Total token limits: Set overall usage quotas per user/tenant
- Time-based windows: Configure limits per second, minute, or hour
3. Model/Provider Failover​
Ensure high availability with automatic failover mechanisms:
- Detect unhealthy backends and route traffic to healthy instances
- Support for active-passive and active-active failover strategies
- Graceful degradation when primary models are unavailable
4. Traffic Splitting & Canary Testing​
Deploy new models safely with progressive rollout capabilities:
- A/B Testing: Split traffic between model versions to compare performance
- Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
- Shadow Traffic: Send duplicate requests to new models without affecting production
- Weight-based routing: Fine-tune traffic distribution across model variants
5. LLM Observability & Monitoring​
Gain deep insights into your LLM infrastructure:
- Request/Response Metrics: Track latency, throughput, token usage, and error rates
- Model Performance: Monitor accuracy, quality scores, and user satisfaction
- Cost Analytics: Analyze spending patterns across models and providers
- Distributed Tracing: End-to-end visibility with OpenTelemetry integration
- Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack
Supported LLM Providers​
| Provider Name | API Schema Config on AIServiceBackend | Upstream Authentication Config on BackendSecurityPolicy | Status |
|---|---|---|---|
| OpenAI | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| AWS Bedrock | {"name":"AWSBedrock"} | AWS Bedrock Credentials | ✅ |
| Azure OpenAI | {"name":"AzureOpenAI","version":"2025-01-01-preview"} or {"name":"OpenAI", "version": "openai/v1"} | Azure Credentials or Azure API Key | ✅ |
| Google Gemini on AI Studio | {"name":"OpenAI","version":"v1beta/openai"} | API Key | ✅ |
| Google Vertex AI | {"name":"GCPVertexAI"} | GCP Credentials | ✅ |
| Anthropic on GCP Vertex AI | {"name":"GCPAnthropic", "version":"vertex-2023-10-16"} | GCP Credentials | ✅ |
| Groq | {"name":"OpenAI","version":"openai/v1"} | API Key | ✅ |
| Grok | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Together AI | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Cohere | {"name":"Cohere","version":"v2"} or {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Mistral | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| DeepInfra | {"name":"OpenAI","version":"v1/openai"} | API Key | ✅ |
| DeepSeek | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Hunyuan | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Tencent LLM Knowledge Engine | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Tetrate Agent Router Service (TARS) | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| SambaNova | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Anthropic | {"name":"Anthropic"} | Anthropic API Key | ✅ |
| Self-hosted-models | {"name":"OpenAI","version":"v1"} | N/A | ✅ |
Prerequisites​
Before starting, ensure you have the following tools installed:
- kind - Kubernetes in Docker (Optional)
- kubectl - Kubernetes CLI
- Helm - Package manager for Kubernetes
Step 1: Create Kind Cluster (Optional)​
Create a local Kubernetes cluster optimized for the semantic router workload:
# Generate kind configuration
./tools/kind/generate-kind-config.sh
# Create cluster with optimized resource settings
kind create cluster --name semantic-router-cluster --config tools/kind/kind-config.yaml
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
Step 2: Deploy vLLM Semantic Router​
Deploy the semantic router service with all required components:
# Deploy semantic router using Kustomize
kubectl apply -k deploy/kubernetes/ai-gateway/semantic-router
# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
# Verify deployment status
kubectl get pods -n vllm-semantic-router-system
Step 3: Install Envoy Gateway​
Install the core Envoy Gateway for traffic management:
# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
--version v0.0.0-latest \
--namespace envoy-gateway-system \
--create-namespace \
-f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml
kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
Step 4: Install Envoy AI Gateway​
Install the AI-specific extensions for inference workloads:
# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
--version v0.0.0-latest \
--namespace envoy-ai-gateway-system \
--create-namespace
# Install Envoy AI Gateway CRDs
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system
# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
Step 5: Deploy Demo LLM​
Create a demo LLM to serve as the backend for the semantic router:
# Deploy demo LLM
kubectl apply -f deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml
Step 6: Create Gateway API Resources​
Create the necessary Gateway API resources for the AI gateway:
kubectl apply -f deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
Testing the Deployment​
Method 1: Port Forwarding (Recommended for Local Testing)​
Set up port forwarding to access the gateway locally:
# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
-o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
Send Test Requests​
Once the gateway is accessible, test the inference endpoint:
# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
]
}'
Troubleshooting​
Common Issues​
Gateway not accessible:
# Check gateway status
kubectl get gateway semantic-router -n default
# Check Envoy service
kubectl get svc -n envoy-gateway-system
AI Gateway controller not ready:
# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller
# Check controller status
kubectl get deployment -n envoy-ai-gateway-system
Semantic router not responding:
# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system
# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router
Cleanup​
To remove the entire deployment:
# Remove Gateway API resources and Demo LLM
kubectl delete -f deploy/kubernetes/ai-gateway/aigw-resources
# Remove semantic router
kubectl delete -k deploy/kubernetes/ai-gateway/semantic-router
# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system
# Delete kind cluster
kind delete cluster --name semantic-router-cluster
Next Steps​
- Configure custom routing rules in the AI Gateway
- Set up monitoring and observability
- Implement authentication and authorization
- Scale the semantic router deployment for production workloads