Validated Qwen v4 Snapshot

The current Qwen result is still slower than HF CPU, but v4 moved the baseline forward.

Disk-LLM is an inspectable disk-backed LLM research kit. After the HF CPU image cleanup and a small runtime caching pass, the latest validated full-model Modal run for Qwen/Qwen3.5-9B is now published here as v4: real plots, real CSV-backed numbers, and a measurable improvement over v3 without hiding the remaining gap.

run qwen35-9b-postfix-v4 32 layers exercised 427 tensors touched resolved SHA c202236
Compared with the prior validated v3 snapshot, Disk-LLM throughput improved by 24.4% at 8 prompt tokens and 31.2% at 128 prompt tokens, while peak RSS fell by about 2.6 GB. HF CPU still wins this matchup.
Qwen postfix v4 throughput plot
Snapshot Metrics

The packed artifact is stable, and the runtime path is improving, but it still needs more work.

The storage-side story is steady: the pack is still compact enough to inspect and the full-model path is still honest. What changed in v4 is that the validated baseline moved in the right direction without changing the project's native NumPy memmap identity.

Current validated baseline
Packed tensors 427 text tensors kept in the current Qwen pack
Packed shards 34 layer-oriented shards in the memmap layout
Footprint 16.68 GiB packed footprint on disk
Runtime path 32 layers executed in the validated benchmark
Real Plots

The public visuals now come from the tracked v4 CSV bundle, rendered again for the site with seaborn.

These plots come from the tracked result bundle at modal-results-postfix/qwen35-9b-postfix-v4. They are the current evidence layer for the project and should still be read as a research snapshot, not a victory screen.

Tracked in-repo
Qwen postfix v4 throughput plot
Throughput improved versus v3, but Disk-LLM still trails the HF CPU reference at both prompt lengths in this validated run.
Qwen postfix v4 first-token latency plot
First-token latency remains the dominant pain point, though v4 shaved meaningful time off both prompt cases compared with v3.
Qwen postfix v4 logical mapped bytes plot
Logical mapped bytes remain a distinctive Disk-LLM telemetry surface. They show how much of the storage path is being touched, even though they should not be read as resident RAM.
Qwen postfix v4 RSS timeline plot
The full-model Disk-LLM baseline now sits lower in RSS than v3, but it still peaks above the HF CPU reference on this Modal setup.
Comparison Table

v4 is better than v3, but the honest reading is still that Qwen is not yet competitive here.

The old checked-in postfix bundle looked much better, but it was not a trustworthy full-model comparison. This table reflects the current validated path instead.

Prompt lengths 8 and 128, generate 2
Prompt Backend Tokens/s First token Peak RSS Logical mapped Read
8 Disk-LLM 0.0183 88.501 s 21.87 GB 170,780 MB v4 improves on v3, but HF CPU is still much faster
8 HF CPU 0.1646 4.773 s 19.40 GB - Current reference
128 Disk-LLM 0.00170 1157.395 s 21.89 GB 2,220,144 MB Better than v3, but prompt scaling is still the main pain point
128 HF CPU 0.0795 17.142 s 19.41 GB - Current reference
Research Notes

Why publish this anyway.

Disk-LLM is a research repo. The point is not to only show flattering artifacts. The point is to make the storage path, runtime path, and benchmark truth visible enough that improvements can be measured honestly.

Evidence over optics

The most useful finding from this rerun is that v4 is now a better full-model baseline than v3 without changing the project's core identity. The repo can now show both things at once: the current comparison is still unfavorable, and the direction of travel is finally measurable.