Service

Private Inference

Run AI models on your own infrastructure — no data leaves your network, no per-token bills, no vendor lock-in. Predictable costs, full compliance, and production-grade performance.

Why self-host inference?

  • • Data never leaves your network — no third-party API calls
  • • Predictable costs — no per-token pricing surprises at scale
  • • Compliance-ready — VN AI Law, GDPR, banking secrecy, air-gapped environments
  • • Low latency — no network hop to external providers
  • • Full control — choose any model, swap anytime, no vendor lock-in
  • • Offline capable — runs without internet connectivity
  • • Single binary — deploy Candle or xinfer with zero runtime dependencies

Overview

What is Private Inference?

Private inference means running AI models on hardware you control — your servers, your VPS, or your edge devices — instead of sending data to third-party API providers. The models are the same frontier and open-weight models you would use via Anthropic, OpenAI, or Google, but inference happens locally.

This is not about training models. It is about serving them. You download a pre-trained model, deploy an inference engine on your infrastructure, and make predictions via a local API — just like calling OpenAI, but every request stays inside your network boundary.

Your data

Inference runs on your hardware — no data ever leaves your network boundary

Your model

Choose any open-weight model, swap anytime, no vendor lock-in

Your infrastructure

Single binary deployment — no Python, no Node, no Docker required

🕯️

General purpose

Candle

Minimalist ML framework from Hugging Face. Ideal for most self-hosted inference scenarios — small binary (~8 MB), broad model support (Llama, Mistral, Phi, Gemma, Stable Diffusion, Whisper, YOLO), GPU acceleration via CUDA and Metal, and first-class WASM support for browser or edge deployment.

  • • 20k+ GitHub stars, 260+ contributors — large, active community
  • • Quantization support via GGUF — run models on consumer GPUs
  • • Apache 2.0 licensed — fully open source

Large models, single GPU

xinfer

Blazing-fast LLM inference in pure Rust with zero Python dependencies. Designed for running 30B+ parameter models on single consumer GPUs. Achieves up to 197 tok/s decode for large models with aggressive KV cache compression (TurboQuant 2–4 bit) and continuous batching.

  • • Flash attention, FlashInfer, CUDA Graphs, prefix caching
  • • Runs on older GPUs (V100) with NVFP4 support
  • • OpenAI and Anthropic-compatible API — drop-in replacement
  • • Built-in Web UI and MCP tool calling

Comparison

API vs Self-Hosted Inference

Most teams should start with API-based inference (Claude, GPT, Gemini) — it is the fastest path to value. Self-hosting makes sense when volume, privacy, or compliance requirements outweigh the operational overhead.

API-Based

API-based inference (Claude, GPT, Gemini) is the fastest way to get value from AI. You sign up, get an API key, and start making requests in minutes. The provider handles infrastructure, model updates, and scaling. Per-token pricing is ideal for low-volume, variable, or exploratory workloads where fixed infrastructure costs would be wasteful.

  • • Time to value: Minutes — sign up and call an API
  • • Cost at low volume: Pay per token — no fixed cost
  • • Cost at high volume: Per-token cost grows linearly with usage
  • • Data privacy: Data sent to third-party servers
  • • Compliance: Depends on provider's certifications
  • • Model selection: Access to frontier models (Claude 4, GPT-5, Gemini 2.5)
  • • Latency: Adds network round-trip time
  • • Offline capable: Requires internet connectivity
  • • Operational overhead: Zero — provider manages infrastructure

Self-Hosted

Self-hosted inference runs open-weight models on your own hardware. The same models you would call via API (Llama, Mistral, Qwen, Gemma, Phi) run locally with zero data leaving your network. Candle and xinfer deploy as single Rust binaries with no Python or Node dependencies — just download, configure, and serve.

Our recommendation: Start with API-based inference to validate your use case and prove ROI. Migrate to self-hosted when your volume justifies the fixed infrastructure cost, your data privacy requirements demand it, or you need offline capability. We help you plan the transition at the right time.

  • • Time to value: Days — provision hardware, download model, configure engine
  • • Cost at low volume: Fixed GPU/server cost regardless of usage
  • • Cost at high volume: Fixed cost — marginal cost per request near zero
  • • Data privacy: Data stays on your hardware
  • • Compliance: Full control — meet any regulatory requirement
  • • Model selection: Limited to open-weight models
  • • Latency: Sub-millisecond local API calls
  • • Offline capable: Runs fully offline
  • • Operational overhead: You manage hardware, model updates, monitoring