Validated Workflow

Use the pipeline. Publish after the audit.

Disk-LLM works best when inspection, conversion, generation, benchmarking, and review are treated as one chain. The project now enforces that more clearly by refusing to save benchmark rows when telemetry shows a zero-layer execution path.

Current Qwen 3.5 note The repo now reads nested text_config correctly and resolves model.language_model.layers.*. The remaining runtime milestone is a native linear_attention adapter before a new Disk-LLM vs HF benchmark should be treated as final.
Qwen3.5 audit summary
01 / Install

Base package first, then the optional extras.

Keep the core importable with minimal dependencies, then layer in Hugging Face, plotting, demo, and testing support when you need them.

Python 3.11+
pip install -e .

pip install -e .[hf,demo,test,bench]
HF baseline note A CPU PyTorch build is required if you want to run the `hf_cpu` comparison backend.
Local environment note The test suite stays importable even when optional runtime packages are missing, which is useful for source inspection and CLI work.
02 / Local Workflow

Inspect, convert, verify, then generate.

The repo is designed so you can understand the model before you ever benchmark it.

CLI first
disk-llm inspect --source-dir /path/to/Qwen3.5-9B

disk-llm convert /path/to/Qwen3.5-9B ./packed-qwen35

disk-llm inspect --manifest ./packed-qwen35/manifest.json

disk-llm generate ./packed-qwen35/manifest.json \
  --prompt "Explain disk-backed inference in one paragraph."
03 / Benchmarking

Benchmarks are part of the codebase, not an afterthought.

The benchmark harness exports repeatable CSVs, RSS timelines, and comparison plots. It also now refuses to save misleading zero-layer Disk-LLM runs.

CSV + plots + guardrails
python scripts/benchmark.py ./packed-qwen35/manifest.json \
  --prompt "Explain disk-backed inference in one paragraph." \
  --tokenizer /path/to/Qwen3.5-9B \
  --backends disk_llm,hf_cpu \
  --hf-model /path/to/Qwen3.5-9B \
  --prompt-lengths 8,64,256,512 \
  --max-new-tokens 16 \
  --runs 3 \
  --output-dir ./benchmark-results/qwen35-cpu

python scripts/plot_results.py ./benchmark-results/qwen35-cpu
Artifacts written benchmark_runs.csv, benchmark_summary.csv, memory_timeline.csv, benchmark_metadata.json, plots, and Markdown summary output.
Why the new guard matters If the config expects real layers but telemetry shows none, the benchmark now fails instead of writing rows that look valid on the surface.
05 / Archived Artifact

Keep the old run visible, but frame it honestly.

The archived Modal run remains useful because it proves the pipeline ran end to end. The figures below are redesigned from the archived numbers, but they should still be treated as pre-fix audit evidence until the Qwen 3.5 linear-attention runtime path is implemented and rerun.

Archive, not final claim
Archived throughput audit figure
Redesigned throughput audit based on the archived Modal artifact bundle.
Archived first-token latency audit figure
Redesigned latency audit retained for the eventual post-fix comparison rerun.