Validated Qwen v4 Snapshot

The current Qwen result is still slower than HF CPU, but v4 moved the baseline forward.

Disk-LLM is an inspectable disk-backed LLM research kit. After the HF CPU image cleanup and a small runtime caching pass, the latest validated full-model Modal run for Qwen/Qwen3.5-9B is now published here as v4: real plots, real CSV-backed numbers, and a measurable improvement over v3 without hiding the remaining gap.

run qwen35-9b-postfix-v4 32 layers exercised 427 tensors touched resolved SHA c202236

Compared with the prior validated v3 snapshot, Disk-LLM throughput improved by 24.4% at 8 prompt tokens and 31.2% at 128 prompt tokens, while peak RSS fell by about 2.6 GB. HF CPU still wins this matchup.

Snapshot Metrics

The packed artifact is stable, and the runtime path is improving, but it still needs more work.

The storage-side story is steady: the pack is still compact enough to inspect and the full-model path is still honest. What changed in v4 is that the validated baseline moved in the right direction without changing the project's native NumPy memmap identity.

Current validated baseline

Packed tensors 427 text tensors kept in the current Qwen pack

Packed shards 34 layer-oriented shards in the memmap layout

Footprint 16.68 GiB packed footprint on disk

Runtime path 32 layers executed in the validated benchmark

Real Plots

The public visuals now come from the tracked `v4` CSV bundle, rendered again for the site with seaborn.

These plots come from the tracked result bundle at modal-results-postfix/qwen35-9b-postfix-v4. They are the current evidence layer for the project and should still be read as a research snapshot, not a victory screen.

Tracked in-repo

Qwen postfix v4 first-token latency plot — First-token latency remains the dominant pain point, though `v4` shaved meaningful time off both prompt cases compared with `v3`.

Qwen postfix v4 logical mapped bytes plot — Logical mapped bytes remain a distinctive Disk-LLM telemetry surface. They show how much of the storage path is being touched, even though they should not be read as resident RAM.

Qwen postfix v4 RSS timeline plot — The full-model Disk-LLM baseline now sits lower in RSS than `v3`, but it still peaks above the HF CPU reference on this Modal setup.

Comparison Table

`v4` is better than `v3`, but the honest reading is still that Qwen is not yet competitive here.

The old checked-in postfix bundle looked much better, but it was not a trustworthy full-model comparison. This table reflects the current validated path instead.

Prompt lengths 8 and 128, generate 2

Prompt	Backend	Tokens/s	First token	Peak RSS	Logical mapped	Read
8	Disk-LLM	0.0183	88.501 s	21.87 GB	170,780 MB	`v4` improves on `v3`, but HF CPU is still much faster
8	HF CPU	0.1646	4.773 s	19.40 GB	-	Current reference
128	Disk-LLM	0.00170	1157.395 s	21.89 GB	2,220,144 MB	Better than `v3`, but prompt scaling is still the main pain point
128	HF CPU	0.0795	17.142 s	19.41 GB	-	Current reference

Research Notes

Why publish this anyway.

Disk-LLM is a research repo. The point is not to only show flattering artifacts. The point is to make the storage path, runtime path, and benchmark truth visible enough that improvements can be measured honestly.

Evidence over optics

The most useful finding from this rerun is that v4 is now a better full-model baseline than v3 without changing the project's core identity. The repo can now show both things at once: the current comparison is still unfavorable, and the direction of travel is finally measurable.

The earlier checked-in bundle at qwen35-9b-modal-cpu-postfix remains a legacy pre-guard artifact and should not be treated as the current headline.
The new public charts are rendered from the tracked qwen35-9b-postfix-v4 CSV bundle, not copied from decorative SVGs.
The next benchmark work should branch from this validated v4 baseline: matching prefetch runs, more small runtime tweaks, and then broader Qwen sweeps.

The current Qwen result is still slower than HF CPU, but v4 moved the baseline forward.

The packed artifact is stable, and the runtime path is improving, but it still needs more work.

The public visuals now come from the tracked v4 CSV bundle, rendered again for the site with seaborn.

v4 is better than v3, but the honest reading is still that Qwen is not yet competitive here.

Why publish this anyway.

The public visuals now come from the tracked `v4` CSV bundle, rendered again for the site with seaborn.

`v4` is better than `v3`, but the honest reading is still that Qwen is not yet competitive here.