Post

Running Qwen 3.6 35B MoE on an RTX 3060 12GB via -ncmoe

How a single flag in llama.cpp turns a 35B Mixture-of-Experts model from OOM to 23 tok/s on a 12GB GPU.

Running Qwen 3.6 35B MoE on an RTX 3060 12GB via -ncmoe

Last week, leftcurve dev posted a benchmark showing how the -ncmoe flag in llama.cpp lets you run Qwen 3.6’s 35B MoE model on an 8GB RTX 3070 Ti — achieving up to 40.9 tok/s. That got my attention.

I have an RTX 3060 12GB — less CUDA cores, slower memory. Would the same trick work on a generation-older card?

Short answer: yes. 22.9 tok/s on a 3060, where the model unconditionally OOMs without the flag.

The Problem

Qwen 3.6-35B-A3B is a Mixture-of-Experts model. At UD-IQ3_XXS quantization it’s still 13.2 GB — bigger than the 12 GB VRAM on my 3060. Loading it with the default flags produces an immediate OOM:

1
llama_model_load: error loading model: out of memory

The MoE architecture is what makes it so large: each layer has multiple “expert” sub-networks, and the model keeps all of them in VRAM regardless of how many are active at inference time.

That’s where -ncmoe comes in.

The Trick

-ncmoe N (or --n-cpu-moe N) keeps the MoE expert weights of the first N layers on the CPU + system RAM instead of loading them into VRAM. It’s not full offloading — the attention weights, embeddings, and output projections stay on GPU. Only the expert sub-networks of those N layers are pinned to system memory.

This is a selective, per-layer offload — distinct from the global -ngl flag which moves entire layers to CPU.

The Setup

Hardware:

  • GPU: NVIDIA RTX 3060 12 GB
  • RAM: 31 GB DDR4
  • CPU: Intel Xeon E5-2650 v4 @ 2.20 GHz (12 cores / 24 threads)

Model: Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf (13.2 GB) — an Unsloth-optimized quant from TheDrummer

Software: llama.cpp built from source with CUDA support

Server command:

1
2
3
4
5
6
7
8
9
10
11
llama-server \
  -m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  -ngl 99 \
  -np 1 \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --ctx-size 65536 \
  --host 0.0.0.0 \
  --main-gpu 1 \
  --split-mode none \
  -ncmoe 25

Key details:

  • -np 1 — MoE is single-slot only (the model is large enough as-is)
  • --cache-type-k/v q8_0 — saves VRAM on the KV cache compared to F16
  • --main-gpu 1 --split-mode none — forces all GPU layers to the 3060 only

The Results

ConfigVRAM (GPU)System RAMLoad TimeAvg tok/s
-ncmoe OFFOOM
-ncmoe 257.1 GB7.7 GB8.0 s22.9
-ncmoe 305.8 GB9.2 GB8.3 s21.1
-ncmoe 354.5 GB10.7 GB8.5 s19.7

Without -ncmoe: OOM immediately. The full 13.2 GB model exceeds the 12 GB VRAM budget at load time.

-ncmoe 25 (sweet spot): 7.1 GB VRAM, leaving ~5 GB headroom for the KV cache and batch processing. At 22.9 tok/s, the model generates fast enough for interactive use and real-time agent tool calls.

-ncmoe 30: Drops to 21.1 tok/s. The additional 5 layers on CPU introduce PCIe latency without freeing enough VRAM to be meaningful (5.8 GB vs 7.1 GB — the extra 1.3 GB doesn’t unlock anything new).

-ncmoe 35: 19.7 tok/s at 4.5 GB VRAM. Useful if you’re running multiple models or a ComfyUI workflow alongside the LLM, but the speed loss is noticeable.

Raw data from the benchmark

Each config was tested with 3 prompts — arithmetic (12937 × 48291), code (FizzBuzz), and reasoning (train ETA). All prompts had temperature: 0.7, top_p: 0.9, max_tokens: 256.

-ncmoe 25:

1
2
3
arith:    23.4 tok/s  (256 tok in 10.96s)
fizzbuzz: 22.0 tok/s  ( 91 tok in  4.14s)
reason:   23.3 tok/s  (168 tok in  7.20s)

-ncmoe 30:

1
2
3
arith:    20.8 tok/s  (256 tok in 12.32s)
fizzbuzz: 21.2 tok/s  (256 tok in 12.05s)
reason:   21.2 tok/s  (256 tok in 12.07s)

-ncmoe 35:

1
2
3
arith:    19.5 tok/s  (256 tok in 13.15s)
fizzbuzz: 19.9 tok/s  (256 tok in 12.89s)
reason:   19.7 tok/s  (256 tok in 12.98s)

Benchmark script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#!/usr/bin/env python3
import subprocess, time, json, sys, os, signal

MODEL = "Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf"
LLAMA_SERVER = "/path/to/llama-server"
PORT = 8099

PROMPTS = {
    "arith": "What is 12937 * 48291? Think step by step.",
    "fizzbuzz": "Write a Python function that prints FizzBuzz for 1 to 100.",
    "reason": "If a train leaves Boston at 60 mph and another leaves NYC at"
              " 70 mph, and they are 150 miles apart, how long to meet?",
}

def kill_server():
    subprocess.run(f"lsof -ti:{PORT} | xargs kill -9 2>/dev/null", shell=True)
    time.sleep(2)

def start_server(ncmoe=None):
    kill_server()
    cmd = [
        LLAMA_SERVER, "-m", MODEL,
        "-ngl", "99", "-np", "1", "--flash-attn", "on",
        "--reasoning", "off",
        "--cache-type-k", "q8_0", "--cache-type-v", "q8_0",
        "--ctx-size", "65536", "--host", "0.0.0.0",
        "--port", str(PORT), "--main-gpu", "1", "--split-mode", "none",
    ]
    if ncmoe is not None:
        cmd.extend(["-ncmoe", str(ncmoe)])

    start = time.time()
    proc = subprocess.Popen(
        cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )
    while True:
        line = proc.stderr.readline()
        if "listening" in line.lower():
            return proc, time.time() - start
        if time.time() - start > 120:
            proc.kill()
            return None, None

def run_test(proc, prompt_text):
    payload = {
        "prompt": prompt_text, "max_tokens": 256,
        "temperature": 0.7, "top_p": 0.9, "stream": False,
    }
    start = time.time()
    r = subprocess.run(
        ["curl", "-s", "-X", "POST",
         f"http://localhost:{PORT}/v1/completions",
         "-H", "Content-Type: application/json",
         "-d", json.dumps(payload)],
        capture_output=True, text=True, timeout=120,
    )
    elapsed = time.time() - start
    data = json.loads(r.stdout)
    usage = data.get("usage", {})
    tokens = usage.get("completion_tokens", 0)
    tok_s = round(tokens / elapsed, 1) if elapsed > 0 else 0
    return tok_s, tokens, round(elapsed, 2)

for ncmoe in [None, 25, 30, 35]:
    label = "BASELINE (no -ncmoe)" if ncmoe is None else f"-ncmoe {ncmoe}"
    proc, load_time = start_server(ncmoe)
    if proc is None:
        print(f"{label}: FAILED TO LOAD")
        continue
    time.sleep(3)
    print(f"\n{label} (loaded in {load_time:.1f}s)")
    for pname, ptext in PROMPTS.items():
        tok_s, tokens, elapsed = run_test(proc, pname, ptext)
        print(f"  {pname}: {tok_s} tok/s ({tokens} tok in {elapsed}s)")
    proc.kill()
    time.sleep(2)

The Tradeoff

Each -ncmoe increment offloads ~1.2–1.4 GB of expert weights to system RAM. The speed cost is ~1.8 tok/s per 5 layers shifted to CPU. The relationship is linear and predictable:

1
speed ≈ 24.5 - (ncmoe - 25) × 0.36 tok/s

While the expensive PCIe transfers from CPU to GPU at each expert switch introduce latency, the MoE routing in Qwen 3.6 activates only 2 experts per token. At 64K context, the attention-dominated decode phase mostly runs on GPU, so the hybrid split works well.

Comparison: RTX 3060 vs 3070 Ti

MetricRTX 3060RTX 3070 Ti (leftcurve)
VRAM12 GB8 GB
CUDA cores35846144
MemoryGDDR6GDDR6X
-ncmoe 2522.9 tok/s40.9 tok/s
-ncmoe 3021.1 tok/s32.5 tok/s
-ncmoe 3519.7 tok/s27.5 tok/s

The 3060 has ~42% fewer CUDA cores and slower memory bandwidth than the 3070 Ti, which explains the ~45% lower throughput. Still, 22.9 tok/s on a 10-year-old Xeon with a 3060 running a 35B model is a solid outcome.

Why This Matters

-ncmoe effectively turns a VRAM limitation into a memory-bandwidth tradeoff. If you have fast system RAM and a reasonable GPU, you can run models that are 1.5× your VRAM budget with acceptable performance.

For the MoE models coming down the pipeline — and they’ll keep growing — this flag is the difference between “can’t run at all” and “runs at interactive speed.” No model swaps, no context compression, no aggressive quantization to IQ1_S. Just one flag that moves the right weights to the right memory tier.


Written with DeepSeek V4 Pro

This post is licensed under CC BY 4.0 by the author.