Validated Workflow

Use the pipeline. Publish after the audit.

Disk-LLM works best when inspection, conversion, generation, benchmarking, and review are treated as one chain. The project now enforces that more clearly by refusing to save benchmark rows when telemetry shows a zero-layer execution path.

Current Qwen 3.5 note The repo now reads nested text_config correctly and resolves model.language_model.layers.*. The remaining runtime milestone is a native linear_attention adapter before a new Disk-LLM vs HF benchmark should be treated as final.

01 / Install

Base package first, then the optional extras.

Keep the core importable with minimal dependencies, then layer in Hugging Face, plotting, demo, and testing support when you need them.

Python 3.11+

pip install -e .

pip install -e .[hf,demo,test,bench]

HF baseline note A CPU PyTorch build is required if you want to run the `hf_cpu` comparison backend.

Local environment note The test suite stays importable even when optional runtime packages are missing, which is useful for source inspection and CLI work.

02 / Local Workflow

Inspect, convert, verify, then generate.

The repo is designed so you can understand the model before you ever benchmark it.

CLI first

disk-llm inspect --source-dir /path/to/Qwen3.5-9B

disk-llm convert /path/to/Qwen3.5-9B ./packed-qwen35

disk-llm inspect --manifest ./packed-qwen35/manifest.json

disk-llm generate ./packed-qwen35/manifest.json \
  --prompt "Explain disk-backed inference in one paragraph."

The default packer targets the text path and records skipped multimodal tensors in the manifest.
Manifests now resolve both `model.layers.*` and `model.language_model.layers.*` during inspection.
For Qwen 3.5 specifically, nested `text_config` values are now surfaced correctly to the runtime.

03 / Benchmarking

Benchmarks are part of the codebase, not an afterthought.

The benchmark harness exports repeatable CSVs, RSS timelines, and comparison plots. It also now refuses to save misleading zero-layer Disk-LLM runs.

CSV + plots + guardrails

python scripts/benchmark.py ./packed-qwen35/manifest.json \
  --prompt "Explain disk-backed inference in one paragraph." \
  --tokenizer /path/to/Qwen3.5-9B \
  --backends disk_llm,hf_cpu \
  --hf-model /path/to/Qwen3.5-9B \
  --prompt-lengths 8,64,256,512 \
  --max-new-tokens 16 \
  --runs 3 \
  --output-dir ./benchmark-results/qwen35-cpu

python scripts/plot_results.py ./benchmark-results/qwen35-cpu

Artifacts written benchmark_runs.csv, benchmark_summary.csv, memory_timeline.csv, benchmark_metadata.json, plots, and Markdown summary output.

Why the new guard matters If the config expects real layers but telemetry shows none, the benchmark now fails instead of writing rows that look valid on the surface.

05 / Archived Artifact

Keep the old run visible, but frame it honestly.

The archived Modal run remains useful because it proves the pipeline ran end to end. The figures below are redesigned from the archived numbers, but they should still be treated as pre-fix audit evidence until the Qwen 3.5 linear-attention runtime path is implemented and rerun.

Archive, not final claim

Archived throughput audit figure — Redesigned throughput audit based on the archived Modal artifact bundle.

Archived first-token latency audit figure — Redesigned latency audit retained for the eventual post-fix comparison rerun.

Use the pipeline. Publish after the audit.

Base package first, then the optional extras.

Inspect, convert, verify, then generate.

Benchmarks are part of the codebase, not an afterthought.

Keep the model off your local machine.

Keep the old run visible, but frame it honestly.