How the estimates work

Local LLM generation is dominated by memory bandwidth: to produce each token, the hardware streams the model's active weights (and the growing KV cache) out of memory. But raw bandwidth isn't the whole story — there's also a fixed per-token overhead that puts a ceiling on speed, which is why a chip with twice the bandwidth isn't twice as fast. We model the time for each token directly:

tokens/sec = 1 ÷ ( weight_read_time + kv_cache_read_time + fixed_overhead ) weight_read_time = active_model_size ÷ (memory_bandwidth × efficiency) kv_cache_read_time = kv_cache_size ÷ (memory_bandwidth × efficiency)

Active model size = active parameters × bits-per-weight ÷ 8. For mixture-of-experts models only a fraction of parameters run per token, so they are faster than their download size suggests.
KV cache grows with your context length, so longer contexts mean slower generation — the context selector affects the speed estimate, not just whether a model fits. By default we assume an fp16 cache (2 bytes per element); the KV-cache selector models Q8_0 (≈½) and Q4_0 (≈¼) quantization, a runtime setting (e.g. llama.cpp --cache-type-k/-v) that shrinks the cache and speeds long-context generation for a little quality. It can't be read from a model on Hugging Face — it's your choice at inference time — so we let you pick it here.
Efficiency and fixed overhead are calibrated per hardware class (Apple unified memory, discrete GPU, CPU).
Concurrent streams (serving several requests at once) share one copy of the weights but each needs its own KV cache, so memory use grows with the stream count and large models may stop fitting. Because the weight read is amortized across the batch, the speed we show is per stream — aggregate throughput is roughly the stream count times that, with diminishing returns as batched decode becomes compute-bound. Real batching efficiency varies by runtime (vLLM is built for it; llama.cpp less so), so treat the multi-stream numbers as a first-order guide.

Time to first token is a separate story. Before any token is generated the model must read your whole prompt — “prefill” — which is compute-bound(≈ 2 FLOPs per active parameter per token), not bandwidth-bound like generation. So a chip with great memory bandwidth but modest compute can stream tokens quickly yet still take many seconds to start at long context. We estimate prefill from each device's approximate fp16 compute and show the time to first token at your selected context length. These compute figures aren't benchmark-calibrated, so treat the time-to-first-token as a rough ballpark.

Number format (NVFP4, INT8/FP8, and quantization) matters for prefill but barely for decode — and the reason is the same bound-by split. Decode is bandwidth-bound, so what counts is how many bitseach weight occupies: a 4-bit weight streams in the same time whether it's an integer K-quant (Q4) or NVIDIA's NVFP4, a hardware 4-bit floaton Blackwell. NVFP4's advantage over a plain 4-bit integer is quality per bit(a shared scale and a floating-point mantissa preserve more of the model), not raw decode speed — so it doesn't move the tokens/sec we show, which are calibrated to llama.cpp/Ollama integer quants. Prefill is compute-bound, and that's where a card's tensor-core formats decide throughput. Each NVIDIA generation added a lower-precision tensor path that roughly doubles peak math: INT8 on Ampere (A100), FP8 on Ada and Hopper (L40S, H100/H200), and FP4/NVFP4on Blackwell (RTX Pro 6000, B200, GB300). The prefill compute figures for the workstation and datacenter cards below already reflect that tensor-core advantage, which is why they sit well ahead of a consumer card at the same memory bandwidth. NVFP4 itself is consumed today by TensorRT-LLM and vLLM, not GGUF/llama.cpp, so on this site it shows up as faster prefill on Blackwell rather than a new decode number.

Whether a model fits is decided by the total weights at a given quantization, plus the KV cache for your chosen context length, plus runtime overhead — compared against the usable portion of your VRAM or unified memory.

On a discrete GPU, a model that overflows VRAM isn't necessarily out of reach: llama.cpp can keep some layers on the GPU and offload the rest to system RAM, which runs — slowly, because the RAM-resident weights are read at a fraction of VRAM bandwidth each token. When that applies we show the offload speed and how much spills to RAM, assuming a typical desktop with ~64 GB of DDR5. Apple unified memory has no VRAM/RAM split, so offload doesn't apply there.

Fine-tuning memory

Fine-tuning is a different memory problem from inference, which is why we give it its own page. Running a model needs the weights plus a KV cache; training also has to hold a gradient and optimizer statefor every parameter it updates. With mixed-precision AdamW that's a bf16 gradient (2 bytes), an fp32 master copy (4 bytes), and two fp32 optimizer moments (8 bytes) — about 16 bytes per trainable parameter, on top of the weights. So a full fine-tune of a 7–8B model needs well over 100 GB and only fits datacenter GPUs.

Full fine-tune trains every weight — the 16-bytes-per-param cost applies to the whole model.
LoRA freezes the base weights (kept in bf16) and trains only a small low-rank adapter, so gradients and optimizer state apply to a fraction of a percent of the parameters — collapsing that cost to near zero.
QLoRA goes further and holds the frozen base in 4-bit (≈¼ the weight memory) while still training the adapter — which is how a 7B fine-tune fits in single-digit gigabytes.

On top of that sits activation memory, which grows with batch size and sequence length; gradient checkpointing trades extra compute to store only a fraction of it, and an 8-bit optimizerhalves the optimizer state. We estimate all of these from the model's shape and your chosen settings. The training figures are calibrated to standard references (a QLoRA-7B run in single-digit GB, a full 7B fine-tune north of 100 GB) but, like the speed numbers, are ballpark estimates — real usage varies with framework, kernel, and config.

Vision-language models

A vision-language model costs more memory than its text size suggests, for two reasons. First, it carries a vision encoder (a ~0.3–0.7B ViT) that stays resident in ~fp16 even when the language weights are quantized. Second — and this is the one people miss — every image becomes hundreds to thousands of tokens that are prepended into the context, so they consume KV cache exactly like a long prompt. We add both terms to the normal weight + KV-cache math: the vision encoder to the weights, and images × tokens-per-image to the context.

Tokens-per-image varies by model and resolution — LLaVA's CLIP encoder emits a fixed 576, while Qwen2-VL and Pixtral scale from a few hundred to a couple thousand — so the calculator has a resolution control, and the image KV cost is shown as its own line. Because that cost scales with the attention shape, it's far heavier on a multi-head model (LLaVA) than a grouped-query one (Qwen-VL) for the same token count. We model the common decoder-only VLMs that prepend visual tokens; cross-attention designs (e.g. Llama 3.2 Vision) condition differently and aren't covered yet. Specs are hand-curated and approximate.

Calibration

The speed model is fitted against real measured token-generation benchmarks from the llama.cpp benchmark threads, XiongjieDai's GPU-Benchmarks-on-LLM-Inference, and LocalScore. The Apple-silicon fit explains 98% of the variance in the measured data; the discrete-GPU fit, 90%.

As the crowdsourced reports below accumulate, we periodically re-fit the same constants against the accepted submissions and update them when the data warrants — so every benchmark you contribute directly sharpens the estimates everyone sees.

These are estimates, shown as ranges. They're calibrated to the mainstream llama.cpp / Ollama setup, which is the default; the runtime selector adjusts the estimate for faster backends — MLX on Apple silicon, vLLM or ExLlamaV2 on discrete GPUs — using approximate per-runtime factors (themselves refined over time by the crowdsourced reports, which record the engine used). Real numbers also vary with OS, thermal state, and build. The goal is a reliable ballpark for every machine, not a benchmark. CPU-only estimates are not yet benchmark-calibrated and are rougher.

Measured speeds

Alongside our estimates, we show crowdsourced measured speeds when people report them. On the contribute page anyone can paste the raw timing output from their own llama.cpp or Ollama run; we parse the tokens-per-second from it (never a self-typed number), sanity-check it against the estimate, and store it anonymously. Once a given device, model, and quantization has at least three accepted reports, cards and the chart show the median of them, with a count of how many back it — so no single submission can move the number. A measured median is a real number from real hardware, so trust it over the estimate when both are present — the estimate is the prediction, the measured value is the ground truth filling it in. Submissions are rate-limited and gated by a lightweight proof-of-work check (no third-party CAPTCHA); we keep only the one parsed benchmark line, not your full paste.

Capability score

The capability score (0–100) lets you pick the strongest model you can actually run. Its grounding varies by model, and we say which on every card:

Benchmark-anchored — where a model has a public LMArena Elo, the score is anchored to it and the card shows the Elo. These are grounded in a real, independent benchmark.
Editorial estimate — brand-new open models often have no clean, machine-readable public benchmark yet, so their score is our estimate from published results and size class, clearly labeled as such. A score upgrades to benchmark-anchored automatically once the model is rated.

Scores reflect full-precision weights; heavy quantization (e.g. Q4) may run a few percent weaker on math and reasoning. Quality data credit: LMArena (CC BY 4.0).