Cloud Inference Runtime

Hanzo Engine

Production AI inference at any scale

Serve Zen models and 60+ architectures with CUDA, Metal, and CPU backends. OpenAI-compatible API. Deploy anywhere.

curl -sSL https://engine.hanzo.ai/install.sh | sh click to copy

Get Started Docker Hub

Why Engine

Built for production inference

Everything you need to serve large language models at scale, with zero configuration overhead.

□

60+ Architectures

Zen, Llama, Qwen3, Mistral, DeepSeek, Gemma, Phi — all major model families. Load from HuggingFace or local files.

▵

GPU Accelerated

CUDA 12+, Metal, CPU with AVX2/AVX-512. Automatic backend selection based on detected hardware.

◇

Production Ready

Paged attention v2, continuous batching, speculative decoding, tensor parallelism. Built for throughput.

○

OpenAI Compatible

Drop-in replacement. Same API, same client libraries, your infrastructure. No code changes required.

Zen Models

Optimized for Engine

First-class support for every Zen model. Pre-optimized serving profiles, automatic quantization selection, and native architecture support.

zen4

Flagship reasoning

Params 744B MoE Context 202K

hanzo-engine serve \
  --model zenlm/zen4 \
  --port 8000

zen4-max

Maximum capability

Params 1.04T MoE Context 256K

hanzo-engine serve \
  --model zenlm/zen4-max \
  --port 8000

zen4-coder

Code generation

Params 480B MoE Context 262K

hanzo-engine serve \
  --model zenlm/zen4-coder \
  --port 8000

zen4-mini

Fast and efficient

Params 8B dense Context 40K

hanzo-engine serve \
  --model zenlm/zen4-mini \
  --port 8000

zen4-ultra

Extended thinking

Params 744B MoE + CoT Context 202K

hanzo-engine serve \
  --model zenlm/zen4-ultra \
  --port 8000

zen3-omni

Multimodal

Params ~200B Context 202K

hanzo-engine serve \
  --model zenlm/zen3-omni \
  --port 8000

Deployment

Deploy anywhere

From a single CLI command to a multi-replica Kubernetes deployment. Choose the method that fits your infrastructure.

# Install
curl -sSL https://engine.hanzo.ai/install.sh | sh

# Or via cargo
cargo install hanzo-engine

# Serve a model
hanzo-engine serve --model zenlm/zen4-mini --port 8000

# With CUDA backend
hanzo-engine serve --model zenlm/zen4 --port 8000 --features cuda

# With tensor parallelism across 4 GPUs
hanzo-engine serve --model zenlm/zen4 --port 8000 --tp 4

# Pull and run with GPU support
docker run -p 8000:8000 --gpus all \
  ghcr.io/hanzoai/engine:latest \
  serve --model zenlm/zen4 --port 8000

# With volume mount for local models
docker run -p 8000:8000 --gpus all \
  -v /models:/models \
  ghcr.io/hanzoai/engine:latest \
  serve --model /models/zen4-mini --port 8000

# CPU-only
docker run -p 8000:8000 \
  ghcr.io/hanzoai/engine:latest \
  serve --model zenlm/zen4-mini --port 8000

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hanzo-engine
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hanzo-engine
  template:
    metadata:
      labels:
        app: hanzo-engine
    spec:
      containers:
      - name: engine
        image: ghcr.io/hanzoai/engine:latest
        args: ["serve", "--model", "zenlm/zen4", "--port", "8000"]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: hanzo-engine
spec:
  selector:
    app: hanzo-engine
  ports:
  - port: 8000
    targetPort: 8000

Internals

Architecture deep dive

Purpose-built inference primitives that maximize throughput and minimize latency at every layer of the stack.

Paged Attention v2

2-4x throughput improvement

Efficient KV-cache management with non-contiguous memory blocks. Serves more concurrent requests with less GPU memory. Eliminates fragmentation waste.

Continuous Batching

Maximum GPU utilization

Dynamic request scheduling that adds and removes sequences on the fly. No wasted compute on padding. Requests start generating immediately.

Speculative Decoding

2-3x faster generation

Draft model acceleration. A smaller model proposes candidate tokens, verified by the target model in parallel. Mathematically equivalent output.

Tensor Parallelism

Scale to any model size

Split model layers across multiple GPUs. Serve models larger than single GPU memory with minimal communication overhead. Linear scaling.

ISQ (In-Situ Quantization)

No pre-processing needed

Quantize models on load. Load BF16 weights and serve at INT4/INT8 with a single flag. No separate quantization step or tool required.

Prompt Caching

Reuse computed prefixes

Automatically cache and reuse KV computations for shared prompt prefixes across requests. Dramatically reduces time-to-first-token for repeated contexts.

Compatibility

Supported model architectures

60+ architectures across text, code, vision, and embedding tasks. All major quantization formats supported.

Family	Models	Quantization	Context
Zen	zen4, zen4-max, zen4-coder, zen4-mini, zen4-ultra	GGUF, BF16, FP16	Up to 262K
Llama	Llama 3.1, 3.2, 3.3, 4	GGUF, GPTQ, AWQ, EXL2	Up to 128K
Qwen	Qwen3, QwQ	GGUF, GPTQ	Up to 131K
Mistral	Mistral, Mixtral, Mamba	GGUF, EXL2	Up to 32K
DeepSeek	V2, V3, R1, Coder	GGUF, GPTQ	Up to 128K
Gemma	Gemma 2, Gemma 3	GGUF	Up to 8K
Phi	Phi-3, Phi-4, Phi-MoE	GGUF	Up to 128K
Vision	Phi-3V, LLaVA, Idefics 3	GGUF	Up to 128K
Embedding	GTE, E5, BGE	FP16	Up to 8K

API

OpenAI-compatible endpoints

Drop-in replacement for the OpenAI API. Use existing client libraries, tools, and integrations with zero changes.

POST /v1/chat/completions Chat with streaming support

POST /v1/completions Text completions

POST /v1/embeddings Vector embeddings

GET /v1/models List loaded models

GET /health Health check

curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zenlm/zen4-mini",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "stream": true
  }'

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="zenlm/zen4-mini",
    messages=[
        {"role": "user",
         "content": "Hello"}
    ]
)
print(response.choices[0].message.content)

Edge vs Engine

Choose your runtime

Hanzo Engine is for cloud servers and production API serving. Hanzo Edge is for on-device and in-browser inference. Same model format, different deployment targets.

	Hanzo Edge	Hanzo Engine
Target	Devices, browsers	Cloud servers
Models	GGUF quantized	All formats
GPU	Metal, CPU, WASM	CUDA, Metal, CPU
Batching	Single request	Continuous batching
Scale	Single user	Production serving
Install	`cargo install hanzo-edge`	`cargo install hanzo-engine`

Performance

Benchmarks

Token generation throughput across hardware configurations. Measured with continuous batching enabled, batch size 1, output length 512.

zen4-mini

8B dense

A100 80GB ~120 tok/s

M3 Max ~45 tok/s

zen4

744B MoE

8x A100 80GB ~35 tok/s

8x H100 SXM ~58 tok/s

zen4-coder

480B MoE

8x A100 80GB ~40 tok/s

8x H100 SXM ~65 tok/s