Skip to content

perf(tableau-sum): 5.04x faster GeneralizedTableauSum build (lazy branch materialization + faster fingerprinting)#157

Open
david-pl wants to merge 26 commits into
mainfrom
autotune/tableau-sum-build
Open

perf(tableau-sum): 5.04x faster GeneralizedTableauSum build (lazy branch materialization + faster fingerprinting)#157
david-pl wants to merge 26 commits into
mainfrom
autotune/tableau-sum-build

Conversation

@david-pl

Copy link
Copy Markdown
Collaborator

Summary

Autotune session optimizing GeneralizedTableauSum build time on the msd-noisy workload (magic-state-distillation circuit with single-qubit depolarizing + loss noise).

Build time: 2620 ms → 520 ms — a 5.04× speedup. Sampling throughput is unchanged (~22 µs/shot) and the result is bit-identical: branch count stays 2025, sum_p2 = 0.725135705447, and the top-5 probabilities match to the recorded reference. All changes live in crates/ppvm-tableau-sum/src/.

What landed

# Change Build Step Cumulative
0 baseline 2620 ms 1.0×
1 Lazy branch materialization — noise no longer deep-clones a ~7 KB tableau per branch; it derives each branch's fingerprint cheaply and only clones survivors (~85% of branches merge/drop) 958 ms 2.73× 2.73×
2 Bulk word-hash — gather a row's words once + a single gxhash64, instead of ~340 tiny hash writes 573 ms 1.67× 4.57×
3 Precompute per-row masks — build the splitmix mask table once per op, not per-row-per-entry 552 ms 1.04× 4.75×
4 Direct-word column reads — replace bitvec per-bit indexing with direct storage-word access in the noise hot loops 520 ms 1.04× 5.04×

The dominant lever was #1: profiling showed fork/clone was 47% of build (32% raw memmove); after lazy materialization it dropped to ~0.1%.

Detail on #1 (lazy materialization)

GeneralizedTableau X/Y/Z gates only flip per-row sign bits and leave words/coefficients/is_lost identical to the parent; loss only sets one is_lost bit. So a branch's fingerprint is derivable from the parent without cloning. This PR adds BranchMutation + structurally_equal_mutated + apply_branch_mutation in storage/mod.rs, and an insert_or_merge_mutated_branches trait method (VecStorage does the lazy/index-based path; MapStorage keeps an eager-clone fallback). loss_channel and pauli_error emit virtual branches and only clone the survivors. The two-qubit / correlated-loss / reset-loss paths are left eager (not exercised by msd-noisy).

One discarded experiment (kept for the record)

A scalar single-pass word hash (replacing the gather + gxhash) regressed to 921 ms and was reverted. This confirms the fingerprint rebuild is SIMD-throughput-bound — full re-hashing via gxhash is near-optimal, so the only remaining lever is to avoid re-hashing entirely (incremental/Zobrist fingerprinting). That is not in this PR.

Where the time goes now (520 ms plateau)

  • fingerprint rebuild ~37% (word-hash gather 22% + phase_loss_hash 14.5%) — proven SIMD-compute-bound
  • noise branch-building ~21%
  • inherent per-entry gate application ~23% (cz 10.5%, sqrt_* 12.7%)

The next lever would be cell-level incremental ("Zobrist") fingerprinting maintained through the Clifford gates to skip the rebuild (~10–20% est.). It is deliberately not included — it's a large, clever change (new hash scheme + per-gate delta logic incl. cz cross-terms) that runs against the repo's "readability over cleverness" preference, for a modest further gain.

Testing

  • cargo test -p ppvm-tableau-sum — all crate tests pass
  • Accuracy is guarded by test_word_fingerprint_cache_stays_consistent plus the branches == 2025 invariant in the bench harness, so any fingerprint drift fails loudly.

Notes for reviewers

  • crates/ppvm-tableau-sum/examples/msd-noisy-bench.rs is the deterministic, seeded measurement harness used throughout (median-of-5 build time + accuracy fingerprint). It's included so the numbers are reproducible; drop it if you'd rather not carry a bench example.
  • The session's working notes (per-iteration log, metric ledger, prompt records) were kept out of the PR — they're summarized above.

🤖 Generated with Claude Code

david-pl and others added 22 commits June 23, 2026 20:31
loss_channel and single-qubit pauli_error now describe each branch as a
BranchMutation of a parent entry instead of deep-cloning a tableau up
front. The merge resolves structural identity against the virtual
(parent + mutation) tableau via structurally_equal_mutated and only
clones the parent when the branch survives as a new entry; merges and
below-cutoff drops never clone. The VecStorage path is the optimized
one; MapStorage materializes parents eagerly (correctness only).

Math is unchanged: msd-noisy benchmark still ends at exactly 2025
branches with sum_p = 1.0, and all 76 ppvm-tableau-sum tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ltas

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xhash)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…compare

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-iteration log, metric ledger, and prompt records were working
notes for the tuning session; they are summarized in the PR description
rather than checked in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 24, 2026 06:53
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://QuEraComputing.github.io/ppvm/pr-preview/pr-157/

Built to branch gh-pages at 2026-06-25 08:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR speeds up ppvm-tableau-sum’s GeneralizedTableauSum build phase by avoiding deep clones during single-qubit noise branching and by reducing fingerprinting overhead, while aiming to keep sampling behavior unchanged.

Changes:

  • Add lazy branch materialization via BranchMutation + storage support for merging “virtual” branches without cloning unless they survive as new entries.
  • Speed up fingerprint maintenance by bulk-hashing row word bytes and precomputing per-row/qubit phase/loss masks (RowMasks), plus faster per-column bit reads in noise hot loops.
  • Add a deterministic msd-noisy timing/accuracy harness example for reproducible benchmarking.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
crates/ppvm-tableau-sum/src/storage/vec.rs Implements lazy mutated-branch merging for VecStorage and reduces per-entry hash rebuild overhead with shared RowMasks.
crates/ppvm-tableau-sum/src/storage/mod.rs Adds BranchMutation, lazy structural equality checks, bulk word hashing, RowMasks, and fast bit reads used by noise and storage merging.
crates/ppvm-tableau-sum/src/storage/map.rs Adds a correctness-first fallback implementation of mutated-branch insertion for MapStorage and uses shared RowMasks when iterating.
crates/ppvm-tableau-sum/src/storage/entry_store.rs Extends EntryStore with an API for inserting/merging lazily-described mutated branches.
crates/ppvm-tableau-sum/src/noise.rs Switches loss_channel/pauli_error to emit virtual branches (parent index + mutation) and compute phase/loss deltas without cloning.
crates/ppvm-tableau-sum/examples/msd-noisy-bench.rs Adds a deterministic performance harness intended to guard against math-changing regressions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/ppvm-tableau-sum/src/storage/mod.rs
Comment thread crates/ppvm-tableau-sum/src/storage/mod.rs
Comment on lines +51 to +60
/// View a `Copy` plain-old-data value's bytes. Sound because `A: PauliStorage`
/// implies `bytemuck::Pod`: no padding, every bit pattern valid, so the bytes
/// are fully initialized and `u8`-aligned.
#[inline]
fn pod_bytes<A: Copy>(value: &A) -> &[u8] {
// SAFETY: `A` is POD (PauliStorage: bytemuck::Pod); reading its
// `size_of::<A>()` initialized bytes as `[u8]` is sound, and the borrow is
// tied to `value`.
unsafe { std::slice::from_raw_parts(value as *const A as *const u8, std::mem::size_of::<A>()) }
}
Comment on lines +318 to +323
BranchMutation::Pauli { op, addr0 } => match op {
1 => tab.x(addr0),
2 => tab.y(addr0),
3 => tab.z(addr0),
_ => {}
},
Comment thread crates/ppvm-tableau-sum/src/noise.rs
Comment on lines +233 to +241
// Accuracy guard: the optimizations under test must not change the math,
// so the final branch count must stay at the baseline value.
const EXPECTED_BRANCHES: usize = 2025;
if branches != EXPECTED_BRANCHES {
eprintln!(
"WARNING: branch count {} != baseline {} — accuracy/structure changed!",
branches, EXPECTED_BRANCHES
);
}
Comment on lines +12 to +16
use ppvm_pauli_sum::config::fx64hash::Byte8F64;
use ppvm_tableau::prelude::*;
use ppvm_tableau_sum::data::GeneralizedTableauSum;
use ppvm_tableau_sum::storage::EntryStore;

@david-pl david-pl marked this pull request as draft June 24, 2026 06:59
Replace the hand-rolled `unsafe pod_bytes` byte view with
`bytemuck::bytes_of`. `PauliStorage` already requires `bytemuck::Pod`,
so the byte view is sound without `unsafe`, matching the existing idiom
in `PauliWord::rehash`. Identical codegen (same pointer cast), so build
time and accuracy are unchanged (branches=2025, sum_p2 bit-identical).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@david-pl david-pl force-pushed the autotune/tableau-sum-build branch from 2319212 to e35ddd9 Compare June 24, 2026 07:23
@Roger-luo

Copy link
Copy Markdown
Collaborator

@david-pl I want to merge this pr so I can start refactor the trait system

@Roger-luo Roger-luo marked this pull request as ready for review June 24, 2026 17:02
Copilot AI review requested due to automatic review settings June 24, 2026 17:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The depolarizing-branch op was a `u8` (1=X, 2=Y, 3=Z), so both matches
on it carried a dead `_` catch-all that silently ignored invalid ops.
Reuse the existing `ppvm_pauli_word::pattern::NotIdentity` enum (X/Y/Z)
instead, which makes `apply_branch_mutation` and the
`structurally_equal_mutated` flip rule exhaustive with no catch-all —
the invalid state is now unrepresentable.

Promotes `NotIdentity` from `pub(crate)` to `pub` and re-exports it from
the `pattern` module. Matching is by variant name, so the enum's
`X=1, Z=2, Y=3` discriminants don't affect behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@david-pl

Copy link
Copy Markdown
Collaborator Author

@Roger-luo I went through it again and I think it's in an okay shape. So, fine to merge since you want to run cleanup anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants