Vision engine

Can I run this vision model?

A vision-language model needs more than its text size suggests. The vision encoder adds resident weights, and every imageis turned into hundreds-to-thousands of tokens that sit in the context — so they eat KV cache like a long prompt. Pick a model, your GPU, and how many images you'll feed it; we'll show where the memory goes and whether it fits.

Vision-language model

Your GPU

Images in context

Image resolution

Quantization

Text context

Pick your GPU above to see if Qwen2-VL 7B fits.

Text-only model? Back to the main calculator →