MacLustr LLM API
One clean, OpenAI-compatible endpoint that routes to four local MLX models distributed across the MacLustr cluster. Bring your favourite OpenAI client — just change the base URL.
Quick Start
Send your first request in seconds. Replace the key with your own.
# Streaming chat completion
curl https://llm.maclustr.io/v1/chat/completions \
-H "Authorization: Bearer ml-admin-001" \
-H "Content-Type: application/json" -N \
-d '{
"model": "maclustr-coder",
"messages": [{"role":"user","content":"Write a FastAPI server."}],
"stream": true
}'Overview
The gateway exposes a single production endpoint while keeping models distributed internally across two Mac Studio (96 GB) nodes. Requests are authenticated, routed by the model field, proxied to the correct internal MLX server, and returned in standard OpenAI format — streaming or not.
Base URL
https://llm.maclustr.ioProtocol
OpenAI/v1 chat completionsAuth
Bearer tokenTransport
JSON · SSE streamingAuthentication
All API routes require a bearer token, configured server-side via the MACLUSTR_API_KEY environment variable (never hardcoded).
Authorization: Bearer $MACLUSTR_API_KEY401 with a clean JSON error. Rotate keys by setting a comma-separated list in MACLUSTR_API_KEY.Available Models
Use the short public aliases. The full internal model id is also accepted. Responses always report the public alias in the model field.
maclustr-coderCoding and agent development model
maclustr-generalGeneral-purpose assistant model
maclustr-reasoningReasoning, math, econometrics, and complex analysis model
maclustr-fastFast model for summaries, routing, rewriting, and lightweight chat
Routing Table
| Public alias | Purpose | Parameters |
|---|---|---|
maclustr-coder | Coding and agent development model | 30B (MoE · 3B active) |
maclustr-general | General-purpose assistant model | 32B |
maclustr-reasoning | Reasoning, math, econometrics, and complex analysis model | 32B |
maclustr-fast | Fast model for summaries, routing, rewriting, and lightweight chat | 24B |
Chat Completions
Supports the full OpenAI parameter set — temperature, top_p, top_k, stop, seed, max_tokens, presence_penalty, frequency_penalty, logit_bias, tools, stream_options — all passed through.
Non-streaming
curl https://llm.maclustr.io/v1/chat/completions \
-H "Authorization: Bearer ml-admin-001" \
-H "Content-Type: application/json" \
-d '{
"model": "maclustr-general",
"messages": [{"role":"user","content":"Explain MacLustr in one paragraph."}],
"temperature": 0.7,
"stream": false
}'Streaming
Set "stream": true to receive Server-Sent Events, identical to OpenAI's streaming format (terminated by data: [DONE]).
curl https://llm.maclustr.io/v1/chat/completions -N \
-H "Authorization: Bearer ml-admin-001" \
-H "Content-Type: application/json" \
-d '{"model":"maclustr-fast","messages":[{"role":"user","content":"Hi"}],"stream":true}'Python
# pip install openai
from openai import OpenAI
client = OpenAI(base_url="https://llm.maclustr.io/v1", api_key="ml-admin-001")
stream = client.chat.completions.create(
model="maclustr-reasoning",
messages=[{"role":"user","content":"Prove sqrt(2) is irrational."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")JavaScript
// npm i openai
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://llm.maclustr.io/v1", apiKey: "ml-admin-001" });
const res = await client.chat.completions.create({
model: "maclustr-general",
messages: [{ role: "user", content: "Hello MacLustr" }],
});
console.log(res.choices[0].message.content);List Models
curl https://llm.maclustr.io/v1/models -H "Authorization: Bearer ml-admin-001"Health Check
Returns gateway status and configured models. Add ?deep=true to ping each internal MLX server.
curl https://llm.maclustr.io/health
curl "https://llm.maclustr.io/health?deep=true"Error Format
Errors use the OpenAI envelope. Internal node addresses are never exposed.
{
"error": {
"message": "Model 'foo' not found. Available models: maclustr-coder, ...",
"type": "invalid_request_error",
"code": "model_not_found"
}
}| Status | Meaning |
|---|---|
401 | Missing or invalid API key |
404 | Unknown model alias |
400 | Malformed JSON body |
502 | Upstream model unreachable / temporarily unavailable |
Deployment Notes
The gateway runs on the public-facing node (localhost:9000) and is exposed at https://llm.maclustr.io via ngrok (or a Caddy/Nginx reverse proxy). Four MLX models are served with mlx_lm across two Mac Studio nodes and kept alive by launchd (KeepAlive = auto-restart on failure).
https://YOUR-NGROK-URL.ngrok-free.app/v1/chat/completions · Production: https://llm.maclustr.io/v1/chat/completionsSecurity Notes
- Bearer token required on every API call; keys come from env, never code.
- Internal node hostnames/ports are never leaked in public error responses.
- Run behind TLS (ngrok or Caddy/Nginx terminate HTTPS).
- Rotate keys via the comma-separated
MACLUSTR_API_KEYlist. - Keep the internal MLX ports (8001–8004) on the private cluster network only.
Troubleshooting
| Symptom | Fix |
|---|---|
401 on every call | Check the Authorization: Bearer header matches MACLUSTR_API_KEY. |
404 model_not_found | Use a valid alias (see routing). |
502 | The MLX model is still loading or down — check /health?deep=true. |
| Slow first request | First call loads the model into memory; subsequent calls are fast. |