Post

Running NVIDIA Nemotron 30B with Vision on a 24 GB GPU

How to fit NVIDIA's Nemotron-3-Nano-30B multimodal model with vision support on a single RTX 3090 — benchmarks, VRAM tricks, and the surprising resolution behavior.

Running NVIDIA Nemotron 30B with Vision on a 24 GB GPU

I have a single RTX 3090 (24 GB) and I wanted to run NVIDIA’s new Nemotron-3-Nano-30B-A3B-Reasoning — a 31B parameter MoE model with vision support. The question was simple: would it fit? Here’s what I found.

The setup

The model is available as a GGUF on unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF. I downloaded the UD-IQ4_NL quant (18.19 GiB) — a 4-bit dynamic quantization that balances quality and size — and the accompanying mmproj-F16 vision projector (1.48 GiB).

1
2
3
4
5
hf download unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF \
  --include "UD-IQ4_NL*" --local-dir ~/models/nemotron-vl

hf download unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF \
  --include "mmproj*" --local-dir ~/models/nemotron-vl

I run llama-swap as a model proxy, backed by llama.cpp (turboquant fork). The RTX 3090 is isolated as GPU 0 — no splitting across GPUs, which would tank throughput on this MoE model.

Fitting vision on a 24 GB card — the trick

Loading the model + mmproj + 64K context at once caused an immediate OOM — cudaGraphInstantiate failed because CUDA graph capture buffers couldn’t fit in the remaining VRAM.

The trick was --no-mmproj-offload . This flag keeps the vision projector (mmproj) on the CPU instead of loading it into GPU VRAM:

1
2
3
4
5
6
# llama-swap model config — vision variant
name: nemotron-3-nano-30b-a3b-vl
model: ~/models/nemotron-vl/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-IQ4_NL.gguf
mmproj: ~/models/nemotron-vl/mmproj-nemotron-3-nano-30b-f16.gguf
ctx_size: 32768
no_mmproj_offload: true

This saves ~1.5 GB of VRAM — enough to fit the model comfortably. I also dropped context from 64K to 32K for the vision config, since multimodal use rarely needs massive context windows.

With this configuration, VRAM usage sits at 23,119 MiB — leaving about 900 MiB of headroom on the 24 GB card.

Performance — text mode

Text-only inference at 64K context is fast on turboquant:

MetricValue
Prompt processing3,294 tok/s
Text generation104.6 tok/s
TTFT (time to first token)1.38 s
VRAM23,119 MiB

For a 31B MoE model running on consumer hardware, 105 tok/s is solid — roughly on par with a 70B at Q4 on 48 GB.

Performance — vision mode

The vision config (32K, mmproj on CPU) has identical generation performance:

MetricText (64K)Vision (32K)Δ
Prompt processing3,294 tok/s3,335 tok/s+1%
Text generation104.6 tok/s104.6 tok/s=
TTFT1.38 s1.36 s-1%
VRAM23,119 MiB23,119 MiB=

The key insight: generation throughput is not affected by the mmproj being on CPU. Only the image encoding step changes — and that’s a one-time cost per unique image.

The encoding cost

The first time you send an image to the model, the CPU-based mmproj processes it in about 4.7 seconds:

1
2
3
4
# Send an image via the OpenAI-compatible API
curl -X POST http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @payload.json

After that, the image tokens remain in the KV cache. Subsequent requests with the same image take only ~30 ms:

RequestPrompt timeGen timeTotal
First (uncached)4,778 ms870 ms~5.6 s
Second (cached)32 ms893 ms~0.9 s

Text generation after encoding consistently runs at 114.9 tok/s — no penalty.

Where it went sideways — the resolution trap

I tested images from 50×50 to 2048×2048, expecting higher-resolution inputs to produce more tokens and better detail:

Image sizePrompt tokensEncoding time
50×502734,665 ms
224×2242734,667 ms
336×3362734,740 ms
512×5122734,651 ms
768×7682734,691 ms
1024×10242734,700 ms
2048×20482734,639 ms

Every single resolution produced exactly 273 prompt tokens. All images are resized to the CLIP ViT’s native 224×224 resolution (patch size 14 → 16×16 = 256 patches) before passing through the mmproj.

This means the model does not support high-resolution input natively. There’s no dynamic tiling (like LLaVA-NeXT), no resolution scaling — everything gets squashed to 224×224. A 2048×2048 image loses just as much detail as a 512×512 one.

If you need vision at higher resolutions, this isn’t the model for that use case. Consider a model with tiling support or a custom preprocessing pipeline.

The config file

Here’s the full llama-swap entry I settled on:

1
2
3
4
5
6
7
8
9
- name: nemotron-3-nano-30b-a3b
  model: ~/models/nemotron-vl/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-IQ4_NL.gguf
  ctx_size: 65536

- name: nemotron-3-nano-30b-a3b-vl
  model: ~/models/nemotron-vl/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-IQ4_NL.gguf
  mmproj: ~/models/nemotron-vl/mmproj-nemotron-3-nano-30b-f16.gguf
  ctx_size: 32768
  no_mmproj_offload: true

Two configs: one for text (64K context, max throughput), one for vision (32K context, CPU-based mmproj). Swap between them via the API.

Takeaway

Nemotron 30B fits on a 24 GB GPU with vision enabled — just barely, and with mmproj on CPU. The tradeoff is a ~5 second encoding cost per unique image, which is acceptable for chat and document analysis but not for video or real-time applications. Generation speed is identical to text-only mode, and KV caching makes repeated image queries instant.

The 224×224 resolution cap is the main limitation. For applications that need to read small text or analyze fine details, this model isn’t the right choice. But for general image description, it works well and fits on hardware most enthusiasts already own.


Written with DeepSeek V4 Pro (deepseek-v4-pro) via Hermes Agent / CrofAI.

This post is licensed under CC BY 4.0 by the author.