← Case Studies

Self-Hosted LLM APIs

Project: Custom LLM Inference Engine Type: R&D / Internal Product Status: In development

The Problem

Companies using OpenAI/Anthropic APIs face three concerns:

  1. Cost volatility — providers can raise prices anytime
  2. Data privacy — your prompts and data flow through third-party systems
  3. Control — operational consistency depends on someone else's infrastructure

The standard self-hosted alternatives (vLLM, llama.cpp server) are Python-based or bring their own operational complexity.

The Approach

ML should be an engineering concern like databases or caching. No Python. Binary deployment. Custom GPU kernels.

What We're Building

  • GoLang API server with CGo bindings to llama.cpp
  • Custom binary tokenizer (written from scratch)
  • Custom GPU kernels (Metal first, portable to CUDA):
    • Concurrent Q/K/V dispatch
    • Fused QKV projection
    • Quantized KV cache (Q8)
    • Function constant specialization
  • Dynamic KV cache management (vs static preallocation in vLLM/llama.cpp)
  • Batched attention with continuous batching
  • Single binary deployment

Why Custom Kernels

The existing frameworks make tradeoffs we don't want. We're optimizing for dynamic memory management and operational simplicity.

R&D. Not yet deployed to production. Available for early adopter engagements.

GoLang CGo Metal CUDA