LlamaStack¶
LlamaStack is Meta's open framework for building AI applications with agents, RAG, tool use, and safety. RHOAI 3.3 includes a Llama Stack Operator as a DSC component (llamastackoperator) that installs the operator and its LlamaStackDistribution CRD. This use case deploys a specific LlamaStack instance on top of that operator.
How It Works -- Two Layers¶
graph TD
DSC["DataScienceCluster"] -->|"llamastackoperator: Managed"| LSO["Llama Stack Operator (installed by RHOAI)"]
LSO -->|"provides CRD"| CRD["LlamaStackDistribution CRD"]
CRD -->|"instance created by"| UC["usecases/services/llamastack/ manifests"]
UC --> LSD["LlamaStackDistribution CR"]
UC --> PG["PostgreSQL 16"]
UC --> CM["ConfigMap (custom config)"]
| Layer | What | Who manages it | Path in this repo |
|---|---|---|---|
| Operator (DSC component) | Installs the Llama Stack Operator and LlamaStackDistribution CRD |
RHOAI Operator via the DSC | components/instances/rhoai-instance/ -- set llamastackoperator: Managed |
| Instance (use case) | Creates a LlamaStackDistribution CR, PostgreSQL database, and custom config |
This repo's use case manifests | usecases/services/llamastack/ |
The operator must be installed first (via the DSC) before the use case manifests can create an instance.
What This Use Case Deploys¶
| Component | Resource | Description |
|---|---|---|
llamastack |
LlamaStackDistribution CR |
Runs the LlamaStack server (agents, inference, safety, eval, vector I/O) using a custom patched image |
postgres |
Deployment + PVC + Service | PostgreSQL 16 for agent state, conversations, and metadata |
llamastack-custom-config |
ConfigMap | LlamaStack v2 config with vLLM inference providers, FAISS, sentence-transformers, and tool runtimes |
Architecture¶
graph LR
Client["Client"] --> LS["LlamaStack Server"]
LS --> vLLM["vLLM (gpt-oss-120b)"]
LS --> Embed["Sentence Transformers"]
LS --> FAISS["FAISS Vector Store"]
LS --> PG["PostgreSQL"]
LlamaStack connects to GPT-OSS-120B for inference. By default, the config points to a remote vLLM endpoint; when the local model is running, you can switch to the in-cluster service endpoint. Embeddings use local sentence-transformers (nomic-embed-text-v1.5) and vector storage uses FAISS.
Prerequisites¶
1. RHOAI Platform with LlamaStack Operator Enabled¶
The llamastackoperator DSC component must be set to Managed. This is included in the full and dev DSC overlays. If using a custom overlay, add:
2. Official Dependencies (per RHOAI 3.3 Installation Guide)¶
Required before enabling llamastackoperator in the DSC
The official RHOAI 3.3 documentation (Section 3.1.2) lists these requirements:
- Red Hat OpenShift Service Mesh Operator 3.x
- cert-manager Operator
- GPU-enabled nodes -- NFD Operator + NVIDIA GPU Operator installed, GPU worker nodes available
- S3-compatible object storage -- for model artifacts and data persistence
3. Secrets¶
Secret placeholders are in Git -- real values are patched on the cluster
This use case includes two Secret manifests with CHANGE_ME placeholder values. ArgoCD's ignoreDifferences for Secret data/stringData prevents it from overwriting manually-patched values on the cluster. After the first GitOps sync, patch the secrets with real values:
postgres-secret (key: password):
oc patch secret postgres-secret -n llamastack \
-p '{"stringData":{"password":"<your-password>"}}'
oc rollout restart deployment/postgres -n llamastack
llama-stack-secret (keys: INFERENCE_MODEL, VLLM_URL, VLLM_TLS_VERIFY, VLLM_API_TOKEN, VLLM_MAX_TOKENS):
4. Inference Backend¶
Default: remote GPT-OSS-120B endpoint
The default config points both vLLM providers to a remote GPT-OSS-120B endpoint. To use a local model instead, update the VLLM_URL in llama-stack-secret and the vllm-orchestrator base_url in llamastack-custom-config.yaml to point to your in-cluster InferenceService:
Inference Providers¶
The config defines two vLLM providers for different roles:
| Provider | Role | URL Source | Purpose |
|---|---|---|---|
vllm-inference |
Primary inference | ${env.VLLM_URL} (from llama-stack-secret) |
Main model for chat completions |
vllm-orchestrator |
Orchestration / agents | Hard-coded in ConfigMap | Model for agent orchestration and tool routing |
Both can point to the same model endpoint. Keeping them separate allows swapping one independently (e.g., using a smaller model for orchestration while keeping a large model for inference).
Deploy¶
LlamaStack is auto-deployed by the cluster-services ApplicationSet when using the tier1-minimal profile.
After bootstrapping the cluster, the service-llamastack Application is created automatically. After the first sync, patch the secrets with real values (see Prerequisites above).
# 1. Deploy GPT-OSS-120B model (or use a remote endpoint)
oc apply -k usecases/models/gpt-oss-120b/profiles/tier1-minimal/
oc wait --for=condition=Ready inferenceservice/gpt-oss-120b \
-n gpt-oss-120b --timeout=3600s
# 2. Ensure the LlamaStack Operator is installed (DSC component)
oc apply -k components/instances/rhoai-instance/overlays/full/
# 3. Deploy the LlamaStack instance (creates namespace, secrets, config, CR)
oc apply -k usecases/services/llamastack/profiles/tier1-minimal/
# 4. Patch secrets with real values
oc patch secret postgres-secret -n llamastack \
-p '{"stringData":{"password":"<your-password>"}}'
oc patch secret llama-stack-secret -n llamastack \
-p '{"stringData":{
"VLLM_URL":"http://gpt-oss-120b-predictor.gpt-oss-120b.svc.cluster.local:8080/v1"
}}'
oc rollout restart deployment/postgres deployment/llamastack -n llamastack
Verify¶
# Check the LlamaStack Operator is running (installed by DSC)
oc get pods -n redhat-ods-applications -l app.kubernetes.io/name=llama-stack-operator
# Check PostgreSQL is running
oc get pods -n llamastack -l app=postgres
# Check LlamaStack distribution is ready
oc get llamastackdistribution -n llamastack
# Check the route
oc get route -n llamastack
# Test inference
curl -sk https://$(oc get route llamastack -n llamastack -o jsonpath='{.spec.host}')/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"vllm-orchestrator/gpt-oss-120b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
Sync Wave Ordering¶
| Wave | Resources | Purpose |
|---|---|---|
| -1 (default) | Namespace, PostgreSQL (Deployment, PVC, Service), Secrets | Infrastructure and database ready first |
| 0 | ConfigMap (llamastack-custom-config) |
Configuration available before server starts |
| 1 | LlamaStackDistribution CR | Server starts after database and config are ready |
Capabilities¶
The LlamaStack server exposes these APIs:
| API | Provider | Description |
|---|---|---|
| Inference | remote::vllm (x2) |
Proxies to vLLM-served models (inference + orchestrator) |
| Agents | inline::meta-reference |
Stateful agent conversations with tool use |
| Vector I/O | inline::faiss |
In-memory vector storage for RAG |
| Safety | inline::llama-guard |
Content safety filtering (requires SAFETY_MODEL env var) |
| Eval | inline::meta-reference |
Model evaluation benchmarks |
| Tool Runtime | remote::tavily-search, remote::brave-search, inline::rag-runtime, remote::model-context-protocol |
Web search, RAG, and MCP tools |
| Embeddings | inline::sentence-transformers |
Local embedding model (nomic-embed-text-v1.5) |
Customization¶
Changing the Inference Backend¶
Edit llama-stack-secret to change the VLLM_URL for the primary inference provider, or edit the llamastack-custom-config ConfigMap to change the vllm-orchestrator base URL:
providers:
inference:
- provider_id: vllm-inference
provider_type: remote::vllm
config:
base_url: ${env.VLLM_URL} # from llama-stack-secret
- provider_id: vllm-orchestrator
provider_type: remote::vllm
config:
base_url: <hard-coded-endpoint> # edit in ConfigMap
Enabling Safety (Llama Guard)¶
To enable content safety filtering, deploy a Llama Guard model and set the SAFETY_MODEL environment variable in the LlamaStackDistribution CR. Without this, the safety provider is registered but non-functional.
Gemini Provider¶
The Gemini inference provider is included in the default config but activates conditionally -- only when GEMINI_API_KEY is set (via gemini-secret). After deploying, patch the secret with your real API key:
oc patch secret gemini-secret -n llamastack \
-p '{"stringData":{"api_key":"<your-gemini-api-key>"}}'
oc rollout restart deployment/llamastack -n llamastack
Gemini models (up to 1M token context) are auto-discovered and appear in /v1/models once the provider is active.
Adding OpenAI Provider¶
To add OpenAI as an inference provider, add the provider block to llamastack-custom-config.yaml, create an openai-secret, and add the OPENAI_API_KEY env var to the Distribution CR.