Running Qwen 3.6 35B MoE on an RTX 3060 12GB via -ncmoe

How a single flag in llama.cpp turns a 35B Mixture-of-Experts model from OOM to 23 tok/s on a 12GB GPU.

Posted May 9, 2026

By Jean Brito

6 min read

Last week, leftcurve dev posted a benchmark showing how the -ncmoe flag in llama.cpp lets you run Qwen 3.6’s 35B MoE model on an 8GB RTX 3070 Ti — achieving up to 40.9 tok/s. That got my attention.

I have an RTX 3060 12GB — less CUDA cores, slower memory. Would the same trick work on a generation-older card?

Short answer: yes. 22.9 tok/s on a 3060, where the model unconditionally OOMs without the flag.

The Problem

Qwen 3.6-35B-A3B is a Mixture-of-Experts model. At UD-IQ3_XXS quantization it’s still 13.2 GB — bigger than the 12 GB VRAM on my 3060. Loading it with the default flags produces an immediate OOM:

llama_model_load: error loading model: out of memory

The MoE architecture is what makes it so large: each layer has multiple “expert” sub-networks, and the model keeps all of them in VRAM regardless of how many are active at inference time.

That’s where -ncmoe comes in.

The Trick

-ncmoe N (or --n-cpu-moe N) keeps the MoE expert weights of the first N layers on the CPU + system RAM instead of loading them into VRAM. It’s not full offloading — the attention weights, embeddings, and output projections stay on GPU. Only the expert sub-networks of those N layers are pinned to system memory.

This is a selective, per-layer offload — distinct from the global -ngl flag which moves entire layers to CPU.

The Setup

Hardware:

GPU: NVIDIA RTX 3060 12 GB
RAM: 31 GB DDR4
CPU: Intel Xeon E5-2650 v4 @ 2.20 GHz (12 cores / 24 threads)

Model: Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf (13.2 GB) — an Unsloth-optimized quant from TheDrummer

Software: llama.cpp built from source with CUDA support

Server command:

  
llama-server \
  -m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  -ngl 99 \
  -np 1 \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --ctx-size 65536 \
  --host 0.0.0.0 \
  --main-gpu 1 \
  --split-mode none \
  -ncmoe 25

Key details:

-np 1 — MoE is single-slot only (the model is large enough as-is)
--cache-type-k/v q8_0 — saves VRAM on the KV cache compared to F16
--main-gpu 1 --split-mode none — forces all GPU layers to the 3060 only

The Results

Config	VRAM (GPU)	System RAM	Load Time	Avg tok/s
`-ncmoe` OFF	—	—	OOM	—
`-ncmoe 25`	7.1 GB	7.7 GB	8.0 s	22.9
`-ncmoe 30`	5.8 GB	9.2 GB	8.3 s	21.1
`-ncmoe 35`	4.5 GB	10.7 GB	8.5 s	19.7

Without -ncmoe: OOM immediately. The full 13.2 GB model exceeds the 12 GB VRAM budget at load time.

-ncmoe 25 (sweet spot): 7.1 GB VRAM, leaving ~5 GB headroom for the KV cache and batch processing. At 22.9 tok/s, the model generates fast enough for interactive use and real-time agent tool calls.

-ncmoe 30: Drops to 21.1 tok/s. The additional 5 layers on CPU introduce PCIe latency without freeing enough VRAM to be meaningful (5.8 GB vs 7.1 GB — the extra 1.3 GB doesn’t unlock anything new).

-ncmoe 35: 19.7 tok/s at 4.5 GB VRAM. Useful if you’re running multiple models or a ComfyUI workflow alongside the LLM, but the speed loss is noticeable.

Raw data from the benchmark

Each config was tested with 3 prompts — arithmetic (12937 × 48291), code (FizzBuzz), and reasoning (train ETA). All prompts had temperature: 0.7, top_p: 0.9, max_tokens: 256.

-ncmoe 25:

arith:    23.4 tok/s  (256 tok in 10.96s)
fizzbuzz: 22.0 tok/s  ( 91 tok in  4.14s)
reason:   23.3 tok/s  (168 tok in  7.20s)

-ncmoe 30:

arith:    20.8 tok/s  (256 tok in 12.32s)
fizzbuzz: 21.2 tok/s  (256 tok in 12.05s)
reason:   21.2 tok/s  (256 tok in 12.07s)

-ncmoe 35:

arith:    19.5 tok/s  (256 tok in 13.15s)
fizzbuzz: 19.9 tok/s  (256 tok in 12.89s)
reason:   19.7 tok/s  (256 tok in 12.98s)

Benchmark script

  
#!/usr/bin/env python3
import subprocess, time, json, sys, os, signal

MODEL = "Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf"
LLAMA_SERVER = "/path/to/llama-server"
PORT = 8099

PROMPTS = {
    "arith": "What is 12937 * 48291? Think step by step.",
    "fizzbuzz": "Write a Python function that prints FizzBuzz for 1 to 100.",
    "reason": "If a train leaves Boston at 60 mph and another leaves NYC at"
              " 70 mph, and they are 150 miles apart, how long to meet?",
}

def kill_server():
    subprocess.run(f"lsof -ti:{PORT} | xargs kill -9 2>/dev/null", shell=True)
    time.sleep(2)

def start_server(ncmoe=None):
    kill_server()
    cmd = [
        LLAMA_SERVER, "-m", MODEL,
        "-ngl", "99", "-np", "1", "--flash-attn", "on",
        "--reasoning", "off",
        "--cache-type-k", "q8_0", "--cache-type-v", "q8_0",
        "--ctx-size", "65536", "--host", "0.0.0.0",
        "--port", str(PORT), "--main-gpu", "1", "--split-mode", "none",
    ]
    if ncmoe is not None:
        cmd.extend(["-ncmoe", str(ncmoe)])

    start = time.time()
    proc = subprocess.Popen(
        cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )
    while True:
        line = proc.stderr.readline()
        if "listening" in line.lower():
            return proc, time.time() - start
        if time.time() - start > 120:
            proc.kill()
            return None, None

def run_test(proc, prompt_text):
    payload = {
        "prompt": prompt_text, "max_tokens": 256,
        "temperature": 0.7, "top_p": 0.9, "stream": False,
    }
    start = time.time()
    r = subprocess.run(
        ["curl", "-s", "-X", "POST",
         f"http://localhost:{PORT}/v1/completions",
         "-H", "Content-Type: application/json",
         "-d", json.dumps(payload)],
        capture_output=True, text=True, timeout=120,
    )
    elapsed = time.time() - start
    data = json.loads(r.stdout)
    usage = data.get("usage", {})
    tokens = usage.get("completion_tokens", 0)
    tok_s = round(tokens / elapsed, 1) if elapsed > 0 else 0
    return tok_s, tokens, round(elapsed, 2)

for ncmoe in [None, 25, 30, 35]:
    label = "BASELINE (no -ncmoe)" if ncmoe is None else f"-ncmoe {ncmoe}"
    proc, load_time = start_server(ncmoe)
    if proc is None:
        print(f"{label}: FAILED TO LOAD")
        continue
    time.sleep(3)
    print(f"\n{label} (loaded in {load_time:.1f}s)")
    for pname, ptext in PROMPTS.items():
        tok_s, tokens, elapsed = run_test(proc, pname, ptext)
        print(f"  {pname}: {tok_s} tok/s ({tokens} tok in {elapsed}s)")
    proc.kill()
    time.sleep(2)

The Tradeoff

Each -ncmoe increment offloads ~1.2–1.4 GB of expert weights to system RAM. The speed cost is ~1.8 tok/s per 5 layers shifted to CPU. The relationship is linear and predictable:

speed ≈ 24.5 - (ncmoe - 25) × 0.36 tok/s

While the expensive PCIe transfers from CPU to GPU at each expert switch introduce latency, the MoE routing in Qwen 3.6 activates only 2 experts per token. At 64K context, the attention-dominated decode phase mostly runs on GPU, so the hybrid split works well.

Comparison: RTX 3060 vs 3070 Ti

Metric	RTX 3060	RTX 3070 Ti (leftcurve)
VRAM	12 GB	8 GB
CUDA cores	3584	6144
Memory	GDDR6	GDDR6X
`-ncmoe 25`	22.9 tok/s	40.9 tok/s
`-ncmoe 30`	21.1 tok/s	32.5 tok/s
`-ncmoe 35`	19.7 tok/s	27.5 tok/s

The 3060 has ~42% fewer CUDA cores and slower memory bandwidth than the 3070 Ti, which explains the ~45% lower throughput. Still, 22.9 tok/s on a 10-year-old Xeon with a 3060 running a 35B model is a solid outcome.

Why This Matters

-ncmoe effectively turns a VRAM limitation into a memory-bandwidth tradeoff. If you have fast system RAM and a reasonable GPU, you can run models that are 1.5× your VRAM budget with acceptable performance.

For the MoE models coming down the pipeline — and they’ll keep growing — this flag is the difference between “can’t run at all” and “runs at interactive speed.” No model swaps, no context compression, no aggressive quantization to IQ1_S. Just one flag that moves the right weights to the right memory tier.

Written with DeepSeek V4 Pro

AI, Engineering

This post is licensed under CC BY 4.0 by the author.