Deploying LLMs On-Prem: A Practical Guide to vLLM, llama.cpp, and KV Cache Optimization

Deploying large language models in production is a different beast than running them in a notebook. When we deployed MiniMax-M2.5 at Presight AI for an internal knowledge assistant, we quickly learned that the gap between "it works on my machine" and "it handles 200 concurrent users on enterprise GPUs" is enormous. This guide distills what we learned into actionable steps for anyone looking to deploy LLMs on-premise.

Why On-Prem LLM Deployment Still Matters

Not every organization can send sensitive data to OpenAI or Anthropic. In regulated industries—finance, defense, healthcare—on-prem LLM deployment is often the only option. At Presight AI, data sovereignty is non-negotiable. We needed full control over the inference stack: the model weights, the serving layer, the hardware, and every byte of data flowing through the system.

On-prem also gives you cost predictability. Once you own the GPUs, inference is essentially free at the margin. For high-throughput workloads—hundreds of thousands of requests per day—the economics of self-hosting beat API pricing within months.

Choosing Between vLLM and llama.cpp

The two dominant open-source inference engines are vLLM and llama.cpp. They solve the same problem but make very different tradeoffs.

vLLM: The Throughput King

vLLM is built for GPU-heavy, high-throughput serving. Its killer feature is PagedAttention, which manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling near-optimal GPU utilization. If you have NVIDIA A100s or H100s and need to serve many concurrent users, vLLM is almost always the right choice.

# Install vLLM
pip install vllm

# Launch an OpenAI-compatible API server with MiniMax-M2.5
python -m vllm.entrypoints.openai.api_server \
  --model MiniMaxAI/MiniMax-M1-80k \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --enforce-eager \
  --port 8000 \
  --api-key your-secret-key

Key flags explained:

--tensor-parallel-size 4: Shard the model across 4 GPUs. Essential for models that don't fit on a single GPU.
--gpu-memory-utilization 0.92: Use 92% of GPU memory. We found 0.90–0.93 to be the sweet spot—higher risks OOM under bursty traffic.
--max-model-len 32768: Cap the context window. Reducing this from the model's maximum dramatically reduces KV cache memory.
--enforce-eager: Disable CUDA graph capture. Useful during debugging; remove for production to get 10–15% throughput gains.

llama.cpp: The Efficiency Play

llama.cpp shines when you need to run models on constrained hardware—single GPUs, CPU-only servers, or mixed CPU+GPU setups. Its GGUF quantization format is mature, and the inference engine is ruthlessly optimized for memory efficiency.

# Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Quantize a model to 4-bit GGUF
./build/bin/llama-quantize \
  ./models/minimax-m2.5-f16.gguf \
  ./models/minimax-m2.5-Q4_K_M.gguf \
  Q4_K_M

# Serve with the built-in API server
./build/bin/llama-server \
  -m ./models/minimax-m2.5-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  --parallel 4 \
  -t 8

Key flags for llama.cpp serving:

-ngl 99: Offload all layers to GPU. Set lower if you want to split between CPU and GPU.
-c 8192: Context length. Directly controls KV cache size.
--parallel 4: Number of concurrent request slots.
-t 8: CPU threads for any layers not offloaded to GPU.

Decision Matrix

Factor	vLLM	llama.cpp
Multi-GPU support	Excellent (tensor parallel)	Limited
Throughput (high concurrency)	Superior	Good
Memory efficiency	Good (PagedAttention)	Excellent (quantization)
Hardware flexibility	GPU-centric	CPU, GPU, or mixed
Quantization	GPTQ, AWQ, FP8	GGUF (Q2–Q8, K-quants)
Ease of deployment	pip install	Build from source

When we deployed smaller SLMs (sub-7B parameter models) for low-latency classification tasks, llama.cpp on a single A10G was 40% cheaper than a vLLM setup and met our latency SLAs. For MiniMax-M2.5 serving the main knowledge assistant, vLLM across 4×A100-80GB was the only option that delivered the throughput we needed.

Quantization Strategies That Actually Work

Quantization is the single most impactful optimization you can make. It reduces model size, speeds up inference, and cuts memory usage—but the wrong quantization can destroy output quality.

For vLLM: AWQ and GPTQ

# Serve a pre-quantized AWQ model with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --port 8000

AWQ (Activation-aware Weight Quantization) consistently outperformed GPTQ in our benchmarks for 4-bit quantization. The quality loss compared to FP16 was negligible on our evaluation suite—less than 1% degradation on internal retrieval-augmented generation accuracy benchmarks.

For llama.cpp: K-Quants Are King

The K-quant variants (Q4_K_M, Q5_K_M, Q6_K) use mixed-precision quantization, keeping more important weights at higher precision. Our testing showed:

Q4_K_M: Best bang for the buck. 4.08 bits per weight. Quality loss barely noticeable for chat and summarization.
Q5_K_M: Sweet spot for tasks requiring precision—code generation, structured data extraction.
Q6_K: Near-FP16 quality at roughly 60% of the memory. Use when you can afford the extra VRAM.
Q2_K / Q3_K: Avoid for production. Quality drops are steep and unpredictable.

KV Cache Optimization: The Hidden Performance Lever

The KV (key-value) cache stores attention states for each token in the sequence. For long-context models, it can consume more GPU memory than the model weights themselves. Optimizing KV cache is critical for both throughput and cost.

vLLM KV Cache Tuning

vLLM's PagedAttention allocates KV cache in fixed-size blocks (similar to memory pages). You control it primarily through these parameters:

# vllm_config.py - Production configuration
from vllm import LLM, SamplingParams

llm = LLM(
    model="MiniMaxAI/MiniMax-M1-80k",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.92,
    max_model_len=32768,
    # KV cache specific tuning
    block_size=16,               # KV cache block size (default 16)
    swap_space=4,                # GiB of CPU swap for KV cache overflow
    max_num_seqs=256,            # Max concurrent sequences
    enable_prefix_caching=True,  # Reuse KV cache for shared prefixes
)

# Enable prefix caching for system prompts
# This avoids recomputing KV cache for the shared system prompt
# across all requests — saves 15-30% compute for RAG workloads

Prefix caching was a game-changer for us. Our RAG pipeline prepends a 2,000-token system prompt to every request. Without prefix caching, vLLM recomputes attention for that prompt on every request. With it enabled, we saw a 22% improvement in time-to-first-token (TTFT).

Estimating KV Cache Memory

A rough formula for KV cache memory per token:

KV cache per token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes

For a model with 80 layers, 8 KV heads, and 128-dimensional heads in FP16:

2 × 80 × 8 × 128 × 2 bytes = 327,680 bytes ≈ 320 KB per token

For 256 concurrent sequences at 8K context each: 256 × 8192 × 320 KB ≈ 640 GB. That is why you need to carefully balance max_model_len, max_num_seqs, and gpu_memory_utilization.

llama.cpp KV Cache Control

llama.cpp gives you direct control over KV cache sizing:

# Serve with explicit KV cache configuration
./build/bin/llama-server \
  -m ./models/model-Q4_K_M.gguf \
  -c 4096 \
  --parallel 8 \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

Using --cache-type-k q8_0 and --cache-type-v q8_0 quantizes the KV cache itself to 8-bit, cutting cache memory by roughly 50% compared to FP16 with minimal quality impact. For workloads where context length matters more than perfect recall, we even tested q4_0 KV cache quantization and found it acceptable for summarization tasks.

Batch Size Tuning and Throughput Optimization

The relationship between batch size and throughput is not linear. Larger batches amortize the cost of loading model weights from GPU HBM, but they also increase latency per request and require more KV cache memory.

Finding the Right Batch Size

# Benchmark vLLM with different concurrent request loads
python -m vllm.entrypoints.openai.api_server \
  --model your-model \
  --max-num-seqs 64 \
  --port 8000 &

# Use vLLM's built-in benchmarking tool
python -m vllm.entrypoints.openai.benchmark \
  --backend vllm \
  --model your-model \
  --num-prompts 500 \
  --request-rate 20 \
  --input-len 512 \
  --output-len 256

For our MiniMax-M2.5 deployment, here is what we observed across different max_num_seqs settings on 4×A100-80GB:

max_num_seqs	Throughput (tok/s)	Avg Latency (ms)	P99 Latency (ms)	GPU Mem Usage
32	2,140	89	210	71%
64	3,680	112	340	78%
128	4,950	158	520	86%
256	5,210	243	890	93%

We settled on max_num_seqs=128 for production—it delivered 92% of peak throughput while keeping P99 latency under 600ms and leaving headroom for traffic spikes.

Monitoring Your Deployment

You cannot optimize what you do not measure. We built a lightweight monitoring stack around Prometheus and Grafana:

# vLLM exposes Prometheus metrics natively
# Scrape from http://localhost:8000/metrics

# Key metrics to track:
# vllm:num_requests_running     - Current concurrent requests
# vllm:num_requests_waiting     - Queue depth (alert if consistently > 0)
# vllm:gpu_cache_usage_perc     - KV cache utilization (target: 70-85%)
# vllm:avg_generation_throughput_toks_per_s  - Tokens generated per second
# vllm:e2e_request_latency_seconds           - End-to-end request latency

For llama.cpp, we added a /health endpoint check and parsed the server logs for slot utilization. If all parallel slots are occupied for more than 30 seconds continuously, that is our signal to scale up.

Alerting Rules That Matter

From months of production operation, these are the alerts that actually caught real issues:

KV cache usage > 90% for 5 minutes: You are about to start rejecting requests. Reduce max_model_len or add GPUs.
Request queue depth > 10 for 2 minutes: Throughput is insufficient. Increase max_num_seqs or scale horizontally.
P99 latency > 2x baseline for 10 minutes: Something changed—check for new long-context prompts or model config drift.
GPU utilization < 50% with active requests: CPU bottleneck or suboptimal batching. Check tensor parallelism and data loading.

Putting It All Together: A Production Checklist

When we spin up a new on-prem LLM service at Presight AI, we follow this checklist:

Model selection: Choose the smallest model that meets quality requirements. We always benchmark against our evaluation suite before committing to a model.
Quantization: AWQ for vLLM, Q4_K_M or Q5_K_M for llama.cpp. Test quality before and after.
Context length: Set max_model_len to the actual maximum you need, not the model's maximum.
KV cache: Enable prefix caching if you have shared prompts. Size the cache for your expected concurrency.
Batch tuning: Start with max_num_seqs=64 and benchmark up/down. Watch P99 latency, not just throughput.
Monitoring: Prometheus metrics from day one. Alert on cache usage, queue depth, and latency.
Load testing: Simulate 2x your expected peak traffic before going live.

Key Takeaways

vLLM is the default choice for multi-GPU, high-throughput LLM serving. Its PagedAttention mechanism and native tensor parallelism make it hard to beat for enterprise workloads.
llama.cpp wins on flexibility and is often the better choice for single-GPU deployments, CPU-only environments, or when you need aggressive quantization (sub-4-bit).
KV cache is your biggest memory consumer for long-context models. Use prefix caching, set realistic context length limits, and quantize the cache when possible.
Quantization is not optional for production. AWQ or Q4_K_M will get you 90%+ of FP16 quality at a fraction of the memory cost.
Measure everything: Throughput numbers mean nothing without latency percentiles and cache utilization metrics. Build monitoring before you optimize.
Right-size your deployment: The cheapest GPU is the one you do not need. Start with the smallest model and least hardware that meets your SLAs, then scale up with data.

On-prem LLM deployment is hard, but it is increasingly a solved problem with the right tooling. The open-source inference ecosystem—vLLM, llama.cpp, and the models flowing from the community—has made it possible for any team with decent GPU infrastructure to serve production-grade LLMs. The key is treating it like any other production system: measure, iterate, and never stop optimizing.