Production AI inference at any scale
Serve Zen models and 60+ architectures with CUDA, Metal, and CPU backends. OpenAI-compatible API. Deploy anywhere.
curl -sSL https://engine.hanzo.ai/install.sh | sh click to copyEverything you need to serve large language models at scale, with zero configuration overhead.
Llama, Qwen3, Phi, Gemma, Mistral, DBRX, Starcoder2 -- all major model families. Load from HuggingFace or local files.
CUDA 12+, Metal, CPU with AVX2/AVX-512. Automatic backend selection based on detected hardware.
Paged attention v2, continuous batching, speculative decoding, tensor parallelism. Built for throughput.
Drop-in replacement. Same API, same client libraries, your infrastructure. No code changes required.
First-class support for every Zen model. Pre-optimized serving profiles, automatic quantization selection, and native architecture support.
hanzo-engine serve \
--model zenlm/zen4 \
--port 8000
hanzo-engine serve \
--model zenlm/zen4-max \
--port 8000
hanzo-engine serve \
--model zenlm/zen4-coder \
--port 8000
hanzo-engine serve \
--model zenlm/zen4-mini \
--port 8000
hanzo-engine serve \
--model zenlm/zen4-ultra \
--port 8000
hanzo-engine serve \
--model zenlm/zen3-omni \
--port 8000
From a single CLI command to a multi-replica Kubernetes deployment. Choose the method that fits your infrastructure.
# Install curl -sSL https://engine.hanzo.ai/install.sh | sh # Or via cargo cargo install hanzo-engine # Serve a model hanzo-engine serve --model zenlm/zen4-mini --port 8000 # With CUDA backend hanzo-engine serve --model zenlm/zen4 --port 8000 --features cuda # With tensor parallelism across 4 GPUs hanzo-engine serve --model zenlm/zen4 --port 8000 --tp 4
# Pull and run with GPU support docker run -p 8000:8000 --gpus all \ ghcr.io/hanzoai/engine:latest \ serve --model zenlm/zen4 --port 8000 # With volume mount for local models docker run -p 8000:8000 --gpus all \ -v /models:/models \ ghcr.io/hanzoai/engine:latest \ serve --model /models/zen4-mini --port 8000 # CPU-only docker run -p 8000:8000 \ ghcr.io/hanzoai/engine:latest \ serve --model zenlm/zen4-mini --port 8000
apiVersion: apps/v1 kind: Deployment metadata: name: hanzo-engine spec: replicas: 2 selector: matchLabels: app: hanzo-engine template: metadata: labels: app: hanzo-engine spec: containers: - name: engine image: ghcr.io/hanzoai/engine:latest args: ["serve", "--model", "zenlm/zen4", "--port", "8000"] ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: 1 --- apiVersion: v1 kind: Service metadata: name: hanzo-engine spec: selector: app: hanzo-engine ports: - port: 8000 targetPort: 8000
Purpose-built inference primitives that maximize throughput and minimize latency at every layer of the stack.
Efficient KV-cache management with non-contiguous memory blocks. Serves more concurrent requests with less GPU memory. Eliminates fragmentation waste.
Dynamic request scheduling that adds and removes sequences on the fly. No wasted compute on padding. Requests start generating immediately.
Draft model acceleration. A smaller model proposes candidate tokens, verified by the target model in parallel. Mathematically equivalent output.
Split model layers across multiple GPUs. Serve models larger than single GPU memory with minimal communication overhead. Linear scaling.
Quantize models on load. Load BF16 weights and serve at INT4/INT8 with a single flag. No separate quantization step or tool required.
Automatically cache and reuse KV computations for shared prompt prefixes across requests. Dramatically reduces time-to-first-token for repeated contexts.
60+ architectures across text, code, vision, and embedding tasks. All major quantization formats supported.
| Family | Models | Quantization | Context |
|---|---|---|---|
| Zen | zen4, zen4-max, zen4-coder, zen4-mini, zen4-ultra | GGUF, BF16, FP16 | Up to 262K |
| Llama | Llama 3.1, 3.2, 3.3, 4 | GGUF, GPTQ, AWQ, EXL2 | Up to 128K |
| Qwen | Qwen3, QwQ | GGUF, GPTQ | Up to 131K |
| Phi | Phi-3, Phi-4, Phi-MoE | GGUF | Up to 128K |
| Gemma | Gemma 2, Gemma 3 | GGUF | Up to 8K |
| Mistral | Mistral, Mixtral, Mamba | GGUF, EXL2 | Up to 32K |
| Vision | Phi-3V, LLaVA, Idefics 3 | GGUF | Up to 128K |
| Embedding | GTE, E5, BGE | FP16 | Up to 8K |
Drop-in replacement for the OpenAI API. Use existing client libraries, tools, and integrations with zero changes.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zenlm/zen4-mini",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": true
}'
from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" ) response = client.chat.completions.create( model="zenlm/zen4-mini", messages=[ {"role": "user", "content": "Hello"} ] ) print(response.choices[0].message.content)
Hanzo Engine is for cloud servers and production API serving. Hanzo Edge is for on-device and in-browser inference. Same model format, different deployment targets.
| Hanzo Edge | Hanzo Engine | |
|---|---|---|
| Target | Devices, browsers | Cloud servers |
| Models | GGUF quantized | All formats |
| GPU | Metal, CPU, WASM | CUDA, Metal, CPU |
| Batching | Single request | Continuous batching |
| Scale | Single user | Production serving |
| Install | cargo install hanzo-edge |
cargo install hanzo-engine |
Token generation throughput across hardware configurations. Measured with continuous batching enabled, batch size 1, output length 512.