Adding open-source models via API
This guide explains how to register and deploy an open-source model (such as one from Hugging Face) in the Rational AI Control Room through the Management API. The process consists of two steps:
- Register the model — creates a catalog record pointing at a Hugging Face repo via
huggingFaceId. - Deploy the model — spins up the serving pod (consumes GPU) and exposes an in-cluster service URL.
POST /management/v0/models/register -> returns model id
POST /management/v0/models/{id}/deploy -> returns deploymentId
You can register a model without deploying it, and deploy whenever you’re ready.
The Management API (group backend-management-v0) is served under https://[your-domain].rational.is/api. You can browse it interactively in the Swagger UI at https://[your-domain].rational.is/api/swagger/index.html.
Replace [your-domain] with your own tenant identifier throughout this guide.
Prerequisites
Before you begin, ensure you have:
- An authenticated session or API key with admin/management rights on the Control Room.
- The exact Hugging Face repo ID of the model you want (e.g.
google/gemma-4-12B-it). Verify it exists on huggingface.co before registering — the API stores the string as-is. - Available GPU capacity in the cluster if you intend to deploy immediately.
All examples below assume you replace $TOKEN with a valid bearer token / API key. If you’re driving this from an authenticated browser session, the session cookie is used automatically and you can drop the Authorization header.
Step 1 — Check what’s already registered
Avoid duplicates by listing the current catalog first.
curl -s "https://[your-domain].rational.is/api/management/v0/models?page=0&size=50" \
-H "Accept: application/json" \
-H "Authorization: Bearer $TOKEN"
Each entry includes id, name, type, publisher, and huggingFaceId. If the model already appears, skip to Step 3 — Deploy the model using its existing id.
Step 2 — Register the model
POST /management/v0/models/register with a RegisterModelRequest body.
Required fields
| Field | Type | Notes |
|---|---|---|
name | string | Catalog name. Convention: the HF repo ID, e.g. google/gemma-4-12B-it. |
type | enum | One of base, awq, ggml, gguf. Use base for full-precision HF weights; use gguf/awq for quantized variants. |
isFineTuned | boolean | false for a stock open-source model. |
Optional but recommended fields
| Field | Type | Notes |
|---|---|---|
huggingFaceId | string | The HF repo ID. This is what makes it an open-source / HF model. |
publisher | string | e.g. google. |
numberOfParams | string | Parameter count as a string, e.g. "12000000000". |
quant | string | Quantization label, if applicable. |
size | string | On-disk size, if known. |
description | string | Free text / model card summary. |
card | uri | Link to the HF model card. |
defaultParameters | object | ModelDefaultParameters — context length, GPU offload, RoPE settings, etc. |
Example
curl -s -X POST "https://[your-domain].rational.is/api/management/v0/models/register" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "google/gemma-4-12B-it",
"type": "base",
"isFineTuned": false,
"publisher": "google",
"numberOfParams": "12000000000",
"description": "Gemma 4 12B instruction-tuned",
"card": "https://huggingface.co/google/gemma-4-12B-it",
"huggingFaceId": "google/gemma-4-12B-it"
}'
Success: 201 Created, returns the full model record including its id (a UUID). Save that id.
A model with that name already exists. List the catalog and reuse the existing id.
Step 3 — Deploy the model
POST /management/v0/models/{id}/deploy with a DeploymentParameters body. {id} is the model UUID from Step 2.
Body fields
| Field | Type | Required | Notes |
|---|---|---|---|
forceDeployment | boolean | yes | true to proceed even if a deployment exists / to force a fresh rollout. |
temperature | float | no | Default sampling temperature. |
topP | float | no | Nucleus sampling. |
topK | float | no | Top-k sampling. |
minP | float | no | Minimum-probability cutoff. |
repetitionPenalty | float | no | Repetition penalty. |
maxOutputTokens | int | no | Default max output tokens. |
Existing deployments in this cluster also carry an adapterScale value (e.g. 1). Include "adapterScale": 1 to match the established pattern.
Example
curl -s -X POST "https://[your-domain].rational.is/api/management/v0/models/<MODEL_ID>/deploy" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"forceDeployment": true,
"adapterScale": 1
}'
Success: 201 Created, returns the deploymentId as plain text, e.g. model-google-gemma-4-12-b-it-420f1c9176bc.
Step 4 — Verify deployment status
Fetch the model record and inspect its deployments array:
curl -s "https://[your-domain].rational.is/api/management/v0/models/<MODEL_ID>" \
-H "Accept: application/json" \
-H "Authorization: Bearer $TOKEN"
Each deployment shows:
deploymentId— the serving instance name.status— provisioning lifecycle (e.g.waitingForChatonce the pod is up and accepting traffic).gpuCount— GPUs allocated.serviceUrl— in-cluster endpoint, e.g.http://<deploymentId>.rational-ai.svc.cluster.local/v1/(OpenAI-compatible/v1/path).
You can also poll the dedicated status endpoint:
curl -s "https://[your-domain].rational.is/api/management/v0/models/<DEPLOYMENT_ID>/deployment-status" \
-H "Accept: application/json" \
-H "Authorization: Bearer $TOKEN"
Deployment takes a few minutes — the pod must pull the weights before it becomes ready.
Managing deployments
| Action | Endpoint |
|---|---|
| List all deployments | GET /management/v0/models/deployments |
| Pause a deployment | POST /management/v0/models/{deploymentId}/pause |
| Cancel a deployment | POST /management/v0/models/{deploymentId}/cancel |
| Update model record | PATCH /management/v0/models/{id} |
| Delete model | DELETE /management/v0/models/{id} |
Choosing the right type
base— full-precision Hugging Face weights (FP16/BF16). Highest quality, highest VRAM. Use for standard*-itrepos.gguf— GGUF-quantized (llama.cpp family). Smaller footprint, runs on less VRAM. PointhuggingFaceIdat a GGUF repo.awq— AWQ-quantized. Good quality/size tradeoff for supported architectures.ggml— legacy GGML format.
Pick the type that matches the actual format of the HF repo you reference — the field tells the serving backend how to load the weights.
Troubleshooting
404 Cannot POST ...— you hit the wrong origin/path. Calls must go tohttps://[your-domain].rational.is/api/.... If scripting from a browser, make sure the active page is on the Control Room origin so relative/api/...fetches resolve correctly.409 Conflicton register — name already taken; reuse the existing modelid.- Deployment stuck before
waitingForChat— the pod is still pulling weights or waiting on GPU scheduling. Check cluster GPU capacity; large models need proportionally more. - Architecture mismatch — the Control Room may record a generic
architecturevalue on register. For newer/multimodal architectures, confirm the serving backend actually supports that model architecture before relying on non-text capabilities.
Quick reference — full flow
# 1. Register
MODEL_ID=$(curl -s -X POST "https://[your-domain].rational.is/api/management/v0/models/register" \
-H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" \
-d '{"name":"google/gemma-4-12B-it","type":"base","isFineTuned":false,
"publisher":"google","huggingFaceId":"google/gemma-4-12B-it",
"card":"https://huggingface.co/google/gemma-4-12B-it"}' \
| jq -r '.id')
# 2. Deploy
curl -s -X POST "https://[your-domain].rational.is/api/management/v0/models/$MODEL_ID/deploy" \
-H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" \
-d '{"forceDeployment":true,"adapterScale":1}'
# 3. Verify
curl -s "https://[your-domain].rational.is/api/management/v0/models/$MODEL_ID" \
-H "Authorization: Bearer $TOKEN" | jq '.deployments'