Optimize tool discovery
When vMCP aggregates many backend MCP servers, the total number of tools exposed
to clients can grow quickly. Each tool definition consumes tokens in the AI
model's context, leading to higher costs and slower, less accurate tool
selection. The optimizer addresses this by replacing all individual tool
definitions with two lightweight primitives: find_tool and call_tool.
For the desktop/CLI approach using the MCP Optimizer container, see the MCP Optimizer tutorial. This guide covers the Kubernetes operator implementation using the VirtualMCPServer and EmbeddingServer CRDs.
Overview
Benefits
- Reduced token usage: Only relevant tools are included in context, not the entire toolset
- Improved tool selection: Hybrid semantic and keyword search surfaces the best tools for each query
- Simplified clients: Clients see only two tools (
find_toolandcall_tool) regardless of how many backends exist
How it works
- A client sends a prompt that requires tool assistance
- The AI calls
find_toolwith keywords extracted from the prompt - vMCP performs hybrid semantic and keyword search across all backend tools
- Only the most relevant tools (up to 8 by default) are returned
- The AI calls
call_toolto execute the selected tool, and vMCP routes the request to the appropriate backend
Quick start
Step 1: Create an EmbeddingServer
Create an EmbeddingServer with default settings. This deploys a text embeddings
inference (TEI) server using the BAAI/bge-small-en-v1.5 model:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: my-embedding
namespace: toolhive-system
spec: {}
Wait for the EmbeddingServer to reach the Running phase before proceeding. The
first startup may take a few minutes while the model downloads.
kubectl get embeddingserver my-embedding -n toolhive-system -w
Step 2: Add the embedding reference to VirtualMCPServer
Add embeddingServerRef to your existing VirtualMCPServer. This is the only
change needed to enable the optimizer. When you set embeddingServerRef, the
operator automatically enables the optimizer with sensible defaults. You only
need to add an explicit optimizer block if you want to
tune the parameters.
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
embeddingServerRef:
name: my-embedding
config:
groupRef: my-group
incomingAuth:
type: anonymous
Step 3: Verify
Check that the VirtualMCPServer is ready and clients now see only find_tool
and call_tool:
kubectl get virtualmcpserver my-vmcp -n toolhive-system
Clients connecting to the vMCP endpoint now see two tools instead of the full backend toolset.
EmbeddingServer resource
The EmbeddingServer CRD manages the lifecycle of a text embeddings inference
server. An empty spec: {} uses all defaults, which is sufficient for most
deployments. For the complete field reference, see the
EmbeddingServer CRD specification.
The default TEI image (ghcr.io/huggingface/text-embeddings-inference) is
amd64-only. If you are running on ARM64 (for example, Apple Silicon with Kind),
you must pre-load or build an ARM64-compatible image.
Tune the optimizer
To customize optimizer behavior, add the optimizer block under spec.config
in your VirtualMCPServer resource:
spec:
config:
groupRef: my-group
optimizer:
embeddingServiceTimeout: 30s
maxToolsToReturn: 8
hybridSearchSemanticRatio: '0.5'
semanticDistanceThreshold: '1.0'
Parameter reference
| Parameter | Description | Default |
|---|---|---|
embeddingServiceTimeout | HTTP request timeout for calls to the embedding service | 30s |
maxToolsToReturn | Maximum number of tools returned per search (1-50) | 8 |
hybridSearchSemanticRatio | Balance between semantic and keyword search. 0.0 = all keyword, 1.0 = all semantic | "0.5" |
semanticDistanceThreshold | Maximum distance for semantic results. 0 = identical, 2 = completely unrelated. Results beyond this threshold are filtered out | "1.0" |
hybridSearchSemanticRatio and semanticDistanceThreshold are string-encoded
floats (for example, "0.5" not 0.5). This is a Kubernetes CRD limitation, as
CRDs do not support float types portably.
- Lower
semanticDistanceThreshold(for example,"0.6") for higher precision: only very close matches are returned - Raise
semanticDistanceThreshold(for example,"1.4") for higher recall: broader matches are included - Increase
maxToolsToReturnif the AI frequently cannot find the right tool; decrease it to save tokens - Adjust
hybridSearchSemanticRatiotoward"1.0"if tool names are not descriptive, or toward"0.0"if exact keyword matching is more useful
Advanced example
A production-ready configuration with model caching and tuned optimizer parameters:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: prod-embedding
namespace: toolhive-system
spec:
replicas: 2
resources:
requests:
cpu: '500m'
memory: '512Mi'
limits:
cpu: '2'
memory: '1Gi'
modelCache:
enabled: true
storageSize: 5Gi
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: prod-vmcp
namespace: toolhive-system
spec:
embeddingServerRef:
name: prod-embedding
config:
groupRef: prod-tools
optimizer:
embeddingServiceTimeout: 15s
maxToolsToReturn: 10
hybridSearchSemanticRatio: '0.6'
semanticDistanceThreshold: '0.8'
incomingAuth:
type: oidc
oidcConfig:
type: inline
inline:
issuer: https://auth.example.com
audience: vmcp-prod
Related information
- MCP Optimizer tutorial — desktop/CLI setup
- Optimizing LLM context — background on tool filtering and context pollution
- Configure vMCP servers
- EmbeddingServer CRD specification
- VirtualMCPServer CRD specification