Cloud Inference Runtime

Hanzo Engine

Production AI inference at any scale

Serve Zen models and 60+ architectures with CUDA, Metal, and CPU backends. OpenAI-compatible API. Deploy anywhere.

curl -sSL https://engine.hanzo.ai/install.sh | sh click to copy

Built for production inference

Everything you need to serve large language models at scale, with zero configuration overhead.

60+ Architectures

Llama, Qwen3, Phi, Gemma, Mistral, DBRX, Starcoder2 -- all major model families. Load from HuggingFace or local files.

GPU Accelerated

CUDA 12+, Metal, CPU with AVX2/AVX-512. Automatic backend selection based on detected hardware.

Production Ready

Paged attention v2, continuous batching, speculative decoding, tensor parallelism. Built for throughput.

OpenAI Compatible

Drop-in replacement. Same API, same client libraries, your infrastructure. No code changes required.

Optimized for Engine

First-class support for every Zen model. Pre-optimized serving profiles, automatic quantization selection, and native architecture support.

zen4

Flagship reasoning
Params 744B MoE Context 202K
hanzo-engine serve \
  --model zenlm/zen4 \
  --port 8000

zen4-max

Maximum capability
Params 1.04T MoE Context 256K
hanzo-engine serve \
  --model zenlm/zen4-max \
  --port 8000

zen4-coder

Code generation
Params 480B MoE Context 262K
hanzo-engine serve \
  --model zenlm/zen4-coder \
  --port 8000

zen4-mini

Fast and efficient
Params 8B dense Context 40K
hanzo-engine serve \
  --model zenlm/zen4-mini \
  --port 8000

zen4-ultra

Extended thinking
Params 744B MoE + CoT Context 202K
hanzo-engine serve \
  --model zenlm/zen4-ultra \
  --port 8000

zen3-omni

Multimodal
Params ~200B Context 202K
hanzo-engine serve \
  --model zenlm/zen3-omni \
  --port 8000

Deploy anywhere

From a single CLI command to a multi-replica Kubernetes deployment. Choose the method that fits your infrastructure.

# Install
curl -sSL https://engine.hanzo.ai/install.sh | sh

# Or via cargo
cargo install hanzo-engine

# Serve a model
hanzo-engine serve --model zenlm/zen4-mini --port 8000

# With CUDA backend
hanzo-engine serve --model zenlm/zen4 --port 8000 --features cuda

# With tensor parallelism across 4 GPUs
hanzo-engine serve --model zenlm/zen4 --port 8000 --tp 4
# Pull and run with GPU support
docker run -p 8000:8000 --gpus all \
  ghcr.io/hanzoai/engine:latest \
  serve --model zenlm/zen4 --port 8000

# With volume mount for local models
docker run -p 8000:8000 --gpus all \
  -v /models:/models \
  ghcr.io/hanzoai/engine:latest \
  serve --model /models/zen4-mini --port 8000

# CPU-only
docker run -p 8000:8000 \
  ghcr.io/hanzoai/engine:latest \
  serve --model zenlm/zen4-mini --port 8000
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hanzo-engine
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hanzo-engine
  template:
    metadata:
      labels:
        app: hanzo-engine
    spec:
      containers:
      - name: engine
        image: ghcr.io/hanzoai/engine:latest
        args: ["serve", "--model", "zenlm/zen4", "--port", "8000"]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: hanzo-engine
spec:
  selector:
    app: hanzo-engine
  ports:
  - port: 8000
    targetPort: 8000

Architecture deep dive

Purpose-built inference primitives that maximize throughput and minimize latency at every layer of the stack.

Paged Attention v2

2-4x throughput improvement

Efficient KV-cache management with non-contiguous memory blocks. Serves more concurrent requests with less GPU memory. Eliminates fragmentation waste.

Continuous Batching

Maximum GPU utilization

Dynamic request scheduling that adds and removes sequences on the fly. No wasted compute on padding. Requests start generating immediately.

Speculative Decoding

2-3x faster generation

Draft model acceleration. A smaller model proposes candidate tokens, verified by the target model in parallel. Mathematically equivalent output.

Tensor Parallelism

Scale to any model size

Split model layers across multiple GPUs. Serve models larger than single GPU memory with minimal communication overhead. Linear scaling.

ISQ (In-Situ Quantization)

No pre-processing needed

Quantize models on load. Load BF16 weights and serve at INT4/INT8 with a single flag. No separate quantization step or tool required.

Prompt Caching

Reuse computed prefixes

Automatically cache and reuse KV computations for shared prompt prefixes across requests. Dramatically reduces time-to-first-token for repeated contexts.

Supported model architectures

60+ architectures across text, code, vision, and embedding tasks. All major quantization formats supported.

Family Models Quantization Context
Zen zen4, zen4-max, zen4-coder, zen4-mini, zen4-ultra GGUF, BF16, FP16 Up to 262K
Llama Llama 3.1, 3.2, 3.3, 4 GGUF, GPTQ, AWQ, EXL2 Up to 128K
Qwen Qwen3, QwQ GGUF, GPTQ Up to 131K
Phi Phi-3, Phi-4, Phi-MoE GGUF Up to 128K
Gemma Gemma 2, Gemma 3 GGUF Up to 8K
Mistral Mistral, Mixtral, Mamba GGUF, EXL2 Up to 32K
Vision Phi-3V, LLaVA, Idefics 3 GGUF Up to 128K
Embedding GTE, E5, BGE FP16 Up to 8K

OpenAI-compatible endpoints

Drop-in replacement for the OpenAI API. Use existing client libraries, tools, and integrations with zero changes.

POST /v1/chat/completions Chat with streaming support
POST /v1/completions Text completions
POST /v1/embeddings Vector embeddings
GET /v1/models List loaded models
GET /health Health check
curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zenlm/zen4-mini",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "stream": true
  }'
Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="zenlm/zen4-mini",
    messages=[
        {"role": "user",
         "content": "Hello"}
    ]
)
print(response.choices[0].message.content)

Choose your runtime

Hanzo Engine is for cloud servers and production API serving. Hanzo Edge is for on-device and in-browser inference. Same model format, different deployment targets.

Hanzo Edge Hanzo Engine
Target Devices, browsers Cloud servers
Models GGUF quantized All formats
GPU Metal, CPU, WASM CUDA, Metal, CPU
Batching Single request Continuous batching
Scale Single user Production serving
Install cargo install hanzo-edge cargo install hanzo-engine

Benchmarks

Token generation throughput across hardware configurations. Measured with continuous batching enabled, batch size 1, output length 512.

zen4-mini
8B dense
A100 80GB ~120 tok/s
M3 Max ~45 tok/s
zen4
744B MoE
8x A100 80GB ~35 tok/s
8x H100 SXM ~58 tok/s
zen4-coder
480B MoE
8x A100 80GB ~40 tok/s
8x H100 SXM ~65 tok/s