Private AI cloud · MacLustr cluster

MacLustr LLM API

One clean, OpenAI-compatible endpoint that routes to four local MLX models distributed across the MacLustr cluster. Bring your favourite OpenAI client — just change the base URL.

POSThttps://llm.maclustr.io/v1/chat/completions

OpenAI-compatibleStreaming (SSE) Bearer auth4 routed models100% local

Quick Start

Send your first request in seconds. Replace the key with your own.

# Streaming chat completion
curl https://llm.maclustr.io/v1/chat/completions \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" -N \
  -d '{
    "model": "maclustr-coder",
    "messages": [{"role":"user","content":"Write a FastAPI server."}],
    "stream": true
  }'

Overview

The gateway exposes a single production endpoint while keeping models distributed internally across two Mac Studio (96 GB) nodes. Requests are authenticated, routed by the model field, proxied to the correct internal MLX server, and returned in standard OpenAI format — streaming or not.

Base URL

https://llm.maclustr.io

Protocol

OpenAI /v1 chat completions

Auth

Bearer token

Transport

JSON · SSE streaming

Authentication

All API routes require a bearer token, configured server-side via the MACLUSTR_API_KEY environment variable (never hardcoded).

Authorization: Bearer $MACLUSTR_API_KEY

Requests without a valid key receive 401 with a clean JSON error. Rotate keys by setting a comma-separated list in MACLUSTR_API_KEY.

Available Models

Use the short public aliases. The full internal model id is also accepted. Responses always report the public alias in the model field.

maclustr-coder

Coding and agent development model

30B (MoE · 3B active) parameters

maclustr-general

General-purpose assistant model

32B parameters

maclustr-reasoning

Reasoning, math, econometrics, and complex analysis model

32B parameters

maclustr-fast

Fast model for summaries, routing, rewriting, and lightweight chat

24B parameters

Routing Table

Public alias	Purpose	Parameters
`maclustr-coder`	Coding and agent development model	30B (MoE · 3B active)
`maclustr-general`	General-purpose assistant model	32B
`maclustr-reasoning`	Reasoning, math, econometrics, and complex analysis model	32B
`maclustr-fast`	Fast model for summaries, routing, rewriting, and lightweight chat	24B

Chat Completions

POSThttps://llm.maclustr.io/v1/chat/completions

Supports the full OpenAI parameter set — temperature, top_p, top_k, stop, seed, max_tokens, presence_penalty, frequency_penalty, logit_bias, tools, stream_options — all passed through.

Non-streaming

curl https://llm.maclustr.io/v1/chat/completions \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "maclustr-general",
    "messages": [{"role":"user","content":"Explain MacLustr in one paragraph."}],
    "temperature": 0.7,
    "stream": false
  }'

Streaming

Set "stream": true to receive Server-Sent Events, identical to OpenAI's streaming format (terminated by data: [DONE]).

curl https://llm.maclustr.io/v1/chat/completions -N \
  -H "Authorization: Bearer ml-admin-001" \
  -H "Content-Type: application/json" \
  -d '{"model":"maclustr-fast","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Python

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="https://llm.maclustr.io/v1", api_key="ml-admin-001")

stream = client.chat.completions.create(
    model="maclustr-reasoning",
    messages=[{"role":"user","content":"Prove sqrt(2) is irrational."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

JavaScript

// npm i openai
import OpenAI from "openai";

const client = new OpenAI({ baseURL: "https://llm.maclustr.io/v1", apiKey: "ml-admin-001" });

const res = await client.chat.completions.create({
  model: "maclustr-general",
  messages: [{ role: "user", content: "Hello MacLustr" }],
});
console.log(res.choices[0].message.content);

List Models

GEThttps://llm.maclustr.io/v1/models

curl https://llm.maclustr.io/v1/models -H "Authorization: Bearer ml-admin-001"

Health Check

GEThttps://llm.maclustr.io/health

Returns gateway status and configured models. Add ?deep=true to ping each internal MLX server.

curl https://llm.maclustr.io/health
curl "https://llm.maclustr.io/health?deep=true"

Error Format

Errors use the OpenAI envelope. Internal node addresses are never exposed.

{
  "error": {
    "message": "Model 'foo' not found. Available models: maclustr-coder, ...",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

Status	Meaning
`401`	Missing or invalid API key
`404`	Unknown model alias
`400`	Malformed JSON body
`502`	Upstream model unreachable / temporarily unavailable

Deployment Notes

The gateway runs on the public-facing node (localhost:9000) and is exposed at https://llm.maclustr.io via ngrok (or a Caddy/Nginx reverse proxy). Four MLX models are served with mlx_lm across two Mac Studio nodes and kept alive by launchd (KeepAlive = auto-restart on failure).

Two modes. Development/testing: https://YOUR-NGROK-URL.ngrok-free.app/v1/chat/completions · Production: https://llm.maclustr.io/v1/chat/completions

Security Notes

Bearer token required on every API call; keys come from env, never code.
Internal node hostnames/ports are never leaked in public error responses.
Run behind TLS (ngrok or Caddy/Nginx terminate HTTPS).
Rotate keys via the comma-separated MACLUSTR_API_KEY list.
Keep the internal MLX ports (8001–8004) on the private cluster network only.

Troubleshooting

Symptom	Fix
`401` on every call	Check the `Authorization: Bearer` header matches `MACLUSTR_API_KEY`.
`404 model_not_found`	Use a valid alias (see routing).
`502`	The MLX model is still loading or down — check `/health?deep=true`.
Slow first request	First call loads the model into memory; subsequent calls are fast.

MacLustr LLM Gateway · OpenAI-compatible · https://llm.maclustr.io