Private AI cloud · MacLustr cluster

MacLustr LLM API

One clean, OpenAI-compatible endpoint that routes to four local MLX models distributed across the MacLustr cluster. Bring your favourite OpenAI client — just change the base URL.

POSThttps://llm.maclustr.io/v1/chat/completions
OpenAI-compatibleStreaming (SSE) Bearer auth4 routed models100% local

Quick Start

Send your first request in seconds. Replace the key with your own.

# Streaming chat completion
curl https://llm.maclustr.io/v1/chat/completions \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" -N \
  -d '{
    "model": "maclustr-coder",
    "messages": [{"role":"user","content":"Write a FastAPI server."}],
    "stream": true
  }'

Overview

The gateway exposes a single production endpoint while keeping models distributed internally across two Mac Studio (96 GB) nodes. Requests are authenticated, routed by the model field, proxied to the correct internal MLX server, and returned in standard OpenAI format — streaming or not.

Base URL

https://llm.maclustr.io

Protocol

OpenAI /v1 chat completions

Auth

Bearer token

Transport

JSON · SSE streaming

Authentication

All API routes require a bearer token, configured server-side via the MACLUSTR_API_KEY environment variable (never hardcoded).

Authorization: Bearer $MACLUSTR_API_KEY
Requests without a valid key receive 401 with a clean JSON error. Rotate keys by setting a comma-separated list in MACLUSTR_API_KEY.

Available Models

Use the short public aliases. The full internal model id is also accepted. Responses always report the public alias in the model field.

maclustr-coder

Coding and agent development model

30B (MoE · 3B active) parameters
maclustr-general

General-purpose assistant model

32B parameters
maclustr-reasoning

Reasoning, math, econometrics, and complex analysis model

32B parameters
maclustr-fast

Fast model for summaries, routing, rewriting, and lightweight chat

24B parameters

Routing Table

Public aliasPurposeParameters
maclustr-coderCoding and agent development model30B (MoE · 3B active)
maclustr-generalGeneral-purpose assistant model32B
maclustr-reasoningReasoning, math, econometrics, and complex analysis model32B
maclustr-fastFast model for summaries, routing, rewriting, and lightweight chat24B

Chat Completions

POSThttps://llm.maclustr.io/v1/chat/completions

Supports the full OpenAI parameter set — temperature, top_p, top_k, stop, seed, max_tokens, presence_penalty, frequency_penalty, logit_bias, tools, stream_options — all passed through.

Non-streaming

curl https://llm.maclustr.io/v1/chat/completions \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "maclustr-general",
    "messages": [{"role":"user","content":"Explain MacLustr in one paragraph."}],
    "temperature": 0.7,
    "stream": false
  }'

Streaming

Set "stream": true to receive Server-Sent Events, identical to OpenAI's streaming format (terminated by data: [DONE]).

curl https://llm.maclustr.io/v1/chat/completions -N \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" \
  -d '{"model":"maclustr-fast","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Python

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="https://llm.maclustr.io/v1", api_key="ml-admin-001")

stream = client.chat.completions.create(
    model="maclustr-reasoning",
    messages=[{"role":"user","content":"Prove sqrt(2) is irrational."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

JavaScript

// npm i openai
import OpenAI from "openai";

const client = new OpenAI({ baseURL: "https://llm.maclustr.io/v1", apiKey: "ml-admin-001" });

const res = await client.chat.completions.create({
  model: "maclustr-general",
  messages: [{ role: "user", content: "Hello MacLustr" }],
});
console.log(res.choices[0].message.content);

List Models

GEThttps://llm.maclustr.io/v1/models
curl https://llm.maclustr.io/v1/models -H "Authorization: Bearer ml-admin-001"

Health Check

GEThttps://llm.maclustr.io/health

Returns gateway status and configured models. Add ?deep=true to ping each internal MLX server.

curl https://llm.maclustr.io/health
curl "https://llm.maclustr.io/health?deep=true"

Error Format

Errors use the OpenAI envelope. Internal node addresses are never exposed.

{
  "error": {
    "message": "Model 'foo' not found. Available models: maclustr-coder, ...",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}
StatusMeaning
401Missing or invalid API key
404Unknown model alias
400Malformed JSON body
502Upstream model unreachable / temporarily unavailable

Deployment Notes

The gateway runs on the public-facing node (localhost:9000) and is exposed at https://llm.maclustr.io via ngrok (or a Caddy/Nginx reverse proxy). Four MLX models are served with mlx_lm across two Mac Studio nodes and kept alive by launchd (KeepAlive = auto-restart on failure).

Two modes. Development/testing: https://YOUR-NGROK-URL.ngrok-free.app/v1/chat/completions  ·  Production: https://llm.maclustr.io/v1/chat/completions

Security Notes

Troubleshooting

SymptomFix
401 on every callCheck the Authorization: Bearer header matches MACLUSTR_API_KEY.
404 model_not_foundUse a valid alias (see routing).
502The MLX model is still loading or down — check /health?deep=true.
Slow first requestFirst call loads the model into memory; subsequent calls are fast.
MacLustr LLM Gateway · OpenAI-compatible · https://llm.maclustr.io