Running Qwen 3.6 35B MoE on an RTX 3060 12GB via -ncmoe
How a single flag in llama.cpp turns a 35B Mixture-of-Experts model from OOM to 23 tok/s on a 12GB GPU.
Last week, leftcurve dev posted a benchmark showing how the -ncmoe flag in llama.cpp lets you run Qwen 3.6’s 35B MoE model on an 8GB RTX 3070 Ti — achieving up to 40.9 tok/s. That got my attention.
I have an RTX 3060 12GB — less CUDA cores, slower memory. Would the same trick work on a generation-older card?
Short answer: yes. 22.9 tok/s on a 3060, where the model unconditionally OOMs without the flag.
The Problem
Qwen 3.6-35B-A3B is a Mixture-of-Experts model. At UD-IQ3_XXS quantization it’s still 13.2 GB — bigger than the 12 GB VRAM on my 3060. Loading it with the default flags produces an immediate OOM:
1
llama_model_load: error loading model: out of memory
The MoE architecture is what makes it so large: each layer has multiple “expert” sub-networks, and the model keeps all of them in VRAM regardless of how many are active at inference time.
That’s where -ncmoe comes in.
The Trick
-ncmoe N (or --n-cpu-moe N) keeps the MoE expert weights of the first N layers on the CPU + system RAM instead of loading them into VRAM. It’s not full offloading — the attention weights, embeddings, and output projections stay on GPU. Only the expert sub-networks of those N layers are pinned to system memory.
This is a selective, per-layer offload — distinct from the global -ngl flag which moves entire layers to CPU.
The Setup
Hardware:
- GPU: NVIDIA RTX 3060 12 GB
- RAM: 31 GB DDR4
- CPU: Intel Xeon E5-2650 v4 @ 2.20 GHz (12 cores / 24 threads)
Model: Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf (13.2 GB) — an Unsloth-optimized quant from TheDrummer
Software: llama.cpp built from source with CUDA support
Server command:
1
2
3
4
5
6
7
8
9
10
11
llama-server \
-m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
-ngl 99 \
-np 1 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--ctx-size 65536 \
--host 0.0.0.0 \
--main-gpu 1 \
--split-mode none \
-ncmoe 25
Key details:
-np 1— MoE is single-slot only (the model is large enough as-is)--cache-type-k/v q8_0— saves VRAM on the KV cache compared to F16--main-gpu 1 --split-mode none— forces all GPU layers to the 3060 only
The Results
| Config | VRAM (GPU) | System RAM | Load Time | Avg tok/s |
|---|---|---|---|---|
-ncmoe OFF | — | — | OOM | — |
-ncmoe 25 | 7.1 GB | 7.7 GB | 8.0 s | 22.9 |
-ncmoe 30 | 5.8 GB | 9.2 GB | 8.3 s | 21.1 |
-ncmoe 35 | 4.5 GB | 10.7 GB | 8.5 s | 19.7 |
Without -ncmoe: OOM immediately. The full 13.2 GB model exceeds the 12 GB VRAM budget at load time.
-ncmoe 25 (sweet spot): 7.1 GB VRAM, leaving ~5 GB headroom for the KV cache and batch processing. At 22.9 tok/s, the model generates fast enough for interactive use and real-time agent tool calls.
-ncmoe 30: Drops to 21.1 tok/s. The additional 5 layers on CPU introduce PCIe latency without freeing enough VRAM to be meaningful (5.8 GB vs 7.1 GB — the extra 1.3 GB doesn’t unlock anything new).
-ncmoe 35: 19.7 tok/s at 4.5 GB VRAM. Useful if you’re running multiple models or a ComfyUI workflow alongside the LLM, but the speed loss is noticeable.
Raw data from the benchmark
Each config was tested with 3 prompts — arithmetic (12937 × 48291), code (FizzBuzz), and reasoning (train ETA). All prompts had temperature: 0.7, top_p: 0.9, max_tokens: 256.
-ncmoe 25:
1
2
3
arith: 23.4 tok/s (256 tok in 10.96s)
fizzbuzz: 22.0 tok/s ( 91 tok in 4.14s)
reason: 23.3 tok/s (168 tok in 7.20s)
-ncmoe 30:
1
2
3
arith: 20.8 tok/s (256 tok in 12.32s)
fizzbuzz: 21.2 tok/s (256 tok in 12.05s)
reason: 21.2 tok/s (256 tok in 12.07s)
-ncmoe 35:
1
2
3
arith: 19.5 tok/s (256 tok in 13.15s)
fizzbuzz: 19.9 tok/s (256 tok in 12.89s)
reason: 19.7 tok/s (256 tok in 12.98s)
Benchmark script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#!/usr/bin/env python3
import subprocess, time, json, sys, os, signal
MODEL = "Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf"
LLAMA_SERVER = "/path/to/llama-server"
PORT = 8099
PROMPTS = {
"arith": "What is 12937 * 48291? Think step by step.",
"fizzbuzz": "Write a Python function that prints FizzBuzz for 1 to 100.",
"reason": "If a train leaves Boston at 60 mph and another leaves NYC at"
" 70 mph, and they are 150 miles apart, how long to meet?",
}
def kill_server():
subprocess.run(f"lsof -ti:{PORT} | xargs kill -9 2>/dev/null", shell=True)
time.sleep(2)
def start_server(ncmoe=None):
kill_server()
cmd = [
LLAMA_SERVER, "-m", MODEL,
"-ngl", "99", "-np", "1", "--flash-attn", "on",
"--reasoning", "off",
"--cache-type-k", "q8_0", "--cache-type-v", "q8_0",
"--ctx-size", "65536", "--host", "0.0.0.0",
"--port", str(PORT), "--main-gpu", "1", "--split-mode", "none",
]
if ncmoe is not None:
cmd.extend(["-ncmoe", str(ncmoe)])
start = time.time()
proc = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
)
while True:
line = proc.stderr.readline()
if "listening" in line.lower():
return proc, time.time() - start
if time.time() - start > 120:
proc.kill()
return None, None
def run_test(proc, prompt_text):
payload = {
"prompt": prompt_text, "max_tokens": 256,
"temperature": 0.7, "top_p": 0.9, "stream": False,
}
start = time.time()
r = subprocess.run(
["curl", "-s", "-X", "POST",
f"http://localhost:{PORT}/v1/completions",
"-H", "Content-Type: application/json",
"-d", json.dumps(payload)],
capture_output=True, text=True, timeout=120,
)
elapsed = time.time() - start
data = json.loads(r.stdout)
usage = data.get("usage", {})
tokens = usage.get("completion_tokens", 0)
tok_s = round(tokens / elapsed, 1) if elapsed > 0 else 0
return tok_s, tokens, round(elapsed, 2)
for ncmoe in [None, 25, 30, 35]:
label = "BASELINE (no -ncmoe)" if ncmoe is None else f"-ncmoe {ncmoe}"
proc, load_time = start_server(ncmoe)
if proc is None:
print(f"{label}: FAILED TO LOAD")
continue
time.sleep(3)
print(f"\n{label} (loaded in {load_time:.1f}s)")
for pname, ptext in PROMPTS.items():
tok_s, tokens, elapsed = run_test(proc, pname, ptext)
print(f" {pname}: {tok_s} tok/s ({tokens} tok in {elapsed}s)")
proc.kill()
time.sleep(2)
The Tradeoff
Each -ncmoe increment offloads ~1.2–1.4 GB of expert weights to system RAM. The speed cost is ~1.8 tok/s per 5 layers shifted to CPU. The relationship is linear and predictable:
1
speed ≈ 24.5 - (ncmoe - 25) × 0.36 tok/s
While the expensive PCIe transfers from CPU to GPU at each expert switch introduce latency, the MoE routing in Qwen 3.6 activates only 2 experts per token. At 64K context, the attention-dominated decode phase mostly runs on GPU, so the hybrid split works well.
Comparison: RTX 3060 vs 3070 Ti
| Metric | RTX 3060 | RTX 3070 Ti (leftcurve) |
|---|---|---|
| VRAM | 12 GB | 8 GB |
| CUDA cores | 3584 | 6144 |
| Memory | GDDR6 | GDDR6X |
-ncmoe 25 | 22.9 tok/s | 40.9 tok/s |
-ncmoe 30 | 21.1 tok/s | 32.5 tok/s |
-ncmoe 35 | 19.7 tok/s | 27.5 tok/s |
The 3060 has ~42% fewer CUDA cores and slower memory bandwidth than the 3070 Ti, which explains the ~45% lower throughput. Still, 22.9 tok/s on a 10-year-old Xeon with a 3060 running a 35B model is a solid outcome.
Why This Matters
-ncmoe effectively turns a VRAM limitation into a memory-bandwidth tradeoff. If you have fast system RAM and a reasonable GPU, you can run models that are 1.5× your VRAM budget with acceptable performance.
For the MoE models coming down the pipeline — and they’ll keep growing — this flag is the difference between “can’t run at all” and “runs at interactive speed.” No model swaps, no context compression, no aggressive quantization to IQ1_S. Just one flag that moves the right weights to the right memory tier.
Written with DeepSeek V4 Pro