LlamaStack¶

LlamaStack is Meta's open framework for building AI applications with agents, RAG, tool use, and safety. RHOAI 3.3 includes a Llama Stack Operator as a DSC component (llamastackoperator) that installs the operator and its LlamaStackDistribution CRD. This use case deploys a specific LlamaStack instance on top of that operator.

How It Works -- Two Layers¶

graph TD
  DSC["DataScienceCluster"] -->|"llamastackoperator: Managed"| LSO["Llama Stack Operator (installed by RHOAI)"]
  LSO -->|"provides CRD"| CRD["LlamaStackDistribution CRD"]
  CRD -->|"instance created by"| UC["usecases/services/llamastack/ manifests"]
  UC --> LSD["LlamaStackDistribution CR"]
  UC --> PG["PostgreSQL 16"]
  UC --> CM["ConfigMap (custom config)"]

Layer	What	Who manages it	Path in this repo
Operator (DSC component)	Installs the Llama Stack Operator and `LlamaStackDistribution` CRD	RHOAI Operator via the DSC	`components/instances/rhoai-instance/` -- set `llamastackoperator: Managed`
Instance (use case)	Creates a `LlamaStackDistribution` CR, PostgreSQL database, and custom config	This repo's use case manifests	`usecases/services/llamastack/`

The operator must be installed first (via the DSC) before the use case manifests can create an instance.

What This Use Case Deploys¶

Component	Resource	Description
`llamastack`	`LlamaStackDistribution` CR	Runs the LlamaStack server (agents, inference, safety, eval, vector I/O) using a custom patched image
`postgres`	Deployment + PVC + Service	PostgreSQL 16 for agent state, conversations, and metadata
`llamastack-custom-config`	ConfigMap	LlamaStack v2 config with vLLM inference providers, FAISS, sentence-transformers, and tool runtimes

Architecture¶

graph LR
  Client["Client"] --> LS["LlamaStack Server"]
  LS --> vLLM["vLLM (gpt-oss-120b)"]
  LS --> Embed["Sentence Transformers"]
  LS --> FAISS["FAISS Vector Store"]
  LS --> PG["PostgreSQL"]

LlamaStack connects to GPT-OSS-120B for inference. By default, the config points to a remote vLLM endpoint; when the local model is running, you can switch to the in-cluster service endpoint. Embeddings use local sentence-transformers (nomic-embed-text-v1.5) and vector storage uses FAISS.

Prerequisites¶

1. RHOAI Platform with LlamaStack Operator Enabled¶

The llamastackoperator DSC component must be set to Managed. This is included in the full and dev DSC overlays. If using a custom overlay, add:

- op: replace
  path: /spec/components/llamastackoperator/managementState
  value: Managed

2. Official Dependencies (per RHOAI 3.3 Installation Guide)¶

Required before enabling llamastackoperator in the DSC

The official RHOAI 3.3 documentation (Section 3.1.2) lists these requirements:

Red Hat OpenShift Service Mesh Operator 3.x
cert-manager Operator
GPU-enabled nodes -- NFD Operator + NVIDIA GPU Operator installed, GPU worker nodes available
S3-compatible object storage -- for model artifacts and data persistence

3. Secrets¶

Secret placeholders are in Git -- real values are patched on the cluster

This use case includes two Secret manifests with CHANGE_ME placeholder values. ArgoCD's ignoreDifferences for Secret data/stringData prevents it from overwriting manually-patched values on the cluster. After the first GitOps sync, patch the secrets with real values:

postgres-secret (key: password):

oc patch secret postgres-secret -n llamastack \
  -p '{"stringData":{"password":"<your-password>"}}'
oc rollout restart deployment/postgres -n llamastack

llama-stack-secret (keys: INFERENCE_MODEL, VLLM_URL, VLLM_TLS_VERIFY, VLLM_API_TOKEN, VLLM_MAX_TOKENS):

oc patch secret llama-stack-secret -n llamastack \
  -p '{"stringData":{
    "INFERENCE_MODEL":"gpt-oss-120b",
    "VLLM_URL":"<vllm-endpoint-url>",
    "VLLM_TLS_VERIFY":"false",
    "VLLM_API_TOKEN":"fake",
    "VLLM_MAX_TOKENS":"4096"
  }}'
oc rollout restart deployment/llamastack -n llamastack

4. Inference Backend¶

Default: remote GPT-OSS-120B endpoint

The default config points both vLLM providers to a remote GPT-OSS-120B endpoint. To use a local model instead, update the VLLM_URL in llama-stack-secret and the vllm-orchestrator base_url in llamastack-custom-config.yaml to point to your in-cluster InferenceService:

http://<model>-predictor.<namespace>.svc.cluster.local:8080/v1

Inference Providers¶

The config defines two vLLM providers for different roles:

Provider	Role	URL Source	Purpose
`vllm-inference`	Primary inference	`${env.VLLM_URL}` (from `llama-stack-secret`)	Main model for chat completions
`vllm-orchestrator`	Orchestration / agents	Hard-coded in ConfigMap	Model for agent orchestration and tool routing

Both can point to the same model endpoint. Keeping them separate allows swapping one independently (e.g., using a smaller model for orchestration while keeping a large model for inference).

Deploy¶

GitOpsManual

LlamaStack is auto-deployed by the cluster-services ApplicationSet when using the tier1-minimal profile.

After bootstrapping the cluster, the service-llamastack Application is created automatically. After the first sync, patch the secrets with real values (see Prerequisites above).

# 1. Deploy GPT-OSS-120B model (or use a remote endpoint)
oc apply -k usecases/models/gpt-oss-120b/profiles/tier1-minimal/
oc wait --for=condition=Ready inferenceservice/gpt-oss-120b \
  -n gpt-oss-120b --timeout=3600s

# 2. Ensure the LlamaStack Operator is installed (DSC component)
oc apply -k components/instances/rhoai-instance/overlays/full/

# 3. Deploy the LlamaStack instance (creates namespace, secrets, config, CR)
oc apply -k usecases/services/llamastack/profiles/tier1-minimal/

# 4. Patch secrets with real values
oc patch secret postgres-secret -n llamastack \
  -p '{"stringData":{"password":"<your-password>"}}'
oc patch secret llama-stack-secret -n llamastack \
  -p '{"stringData":{
    "VLLM_URL":"http://gpt-oss-120b-predictor.gpt-oss-120b.svc.cluster.local:8080/v1"
  }}'
oc rollout restart deployment/postgres deployment/llamastack -n llamastack

Verify¶

# Check the LlamaStack Operator is running (installed by DSC)
oc get pods -n redhat-ods-applications -l app.kubernetes.io/name=llama-stack-operator

# Check PostgreSQL is running
oc get pods -n llamastack -l app=postgres

# Check LlamaStack distribution is ready
oc get llamastackdistribution -n llamastack

# Check the route
oc get route -n llamastack

# Test inference
curl -sk https://$(oc get route llamastack -n llamastack -o jsonpath='{.spec.host}')/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"vllm-orchestrator/gpt-oss-120b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

Sync Wave Ordering¶

Wave	Resources	Purpose
-1 (default)	Namespace, PostgreSQL (Deployment, PVC, Service), Secrets	Infrastructure and database ready first
0	ConfigMap (`llamastack-custom-config`)	Configuration available before server starts
1	LlamaStackDistribution CR	Server starts after database and config are ready

Capabilities¶

The LlamaStack server exposes these APIs:

API	Provider	Description
Inference	`remote::vllm` (x2)	Proxies to vLLM-served models (inference + orchestrator)
Agents	`inline::meta-reference`	Stateful agent conversations with tool use
Vector I/O	`inline::faiss`	In-memory vector storage for RAG
Safety	`inline::llama-guard`	Content safety filtering (requires `SAFETY_MODEL` env var)
Eval	`inline::meta-reference`	Model evaluation benchmarks
Tool Runtime	`remote::tavily-search`, `remote::brave-search`, `inline::rag-runtime`, `remote::model-context-protocol`	Web search, RAG, and MCP tools
Embeddings	`inline::sentence-transformers`	Local embedding model (`nomic-embed-text-v1.5`)

Customization¶

Changing the Inference Backend¶

Edit llama-stack-secret to change the VLLM_URL for the primary inference provider, or edit the llamastack-custom-config ConfigMap to change the vllm-orchestrator base URL:

providers:
  inference:
  - provider_id: vllm-inference
    provider_type: remote::vllm
    config:
      base_url: ${env.VLLM_URL}          # from llama-stack-secret
  - provider_id: vllm-orchestrator
    provider_type: remote::vllm
    config:
      base_url: <hard-coded-endpoint>     # edit in ConfigMap

Enabling Safety (Llama Guard)¶

To enable content safety filtering, deploy a Llama Guard model and set the SAFETY_MODEL environment variable in the LlamaStackDistribution CR. Without this, the safety provider is registered but non-functional.

Gemini Provider¶

The Gemini inference provider is included in the default config but activates conditionally -- only when GEMINI_API_KEY is set (via gemini-secret). After deploying, patch the secret with your real API key:

oc patch secret gemini-secret -n llamastack \
  -p '{"stringData":{"api_key":"<your-gemini-api-key>"}}'
oc rollout restart deployment/llamastack -n llamastack

Gemini models (up to 1M token context) are auto-discovered and appear in /v1/models once the provider is active.

Adding OpenAI Provider¶

To add OpenAI as an inference provider, add the provider block to llamastack-custom-config.yaml, create an openai-secret, and add the OPENAI_API_KEY env var to the Distribution CR.