[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos by JulianCloudNTH · Pull Request #20086 · pytorch/executorch

JulianCloudNTH · 2026-06-06T07:14:57Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
-> [ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086

Adds the fused sdpa_with_kv_cache op (QK attention-weights, softmax, attention-output sub-kernels over the KV cache), composing the three enablers below it: the base graph's inter-dispatch buffer passing (scratch buffers + multi-pass execute), the update_cache op, and the SymInt live-scalar mechanism. The QK/softmax/AV kernels mirror the Vulkan reference's flat-index/GQA/causal-mask math (NCHW, buffer-only, fp32).

input_pos is consumed dynamically via the SymInt mechanism: the op reads symint_buffer() as a uniform, sizes its scratch + dispatches for the max context length, and registers a resize hook so a single delegate runs an autoregressive decode loop (feed only the new token + advancing input_pos) instead of a fixed baked position. Mirrors the Vulkan SymInt = live uniform-buffer design.

Tests live in the stacked test-suite diff above (clean op diff here).

Authored with assistance from Claude.
@exported-using-ghexport

Differential Revision: D107595125

[ghstack-poisoned]

pytorch-bot · 2026-06-06T07:15:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20086

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

⏳ 134 Pending, 1 Unrelated Failure

As of commit e29cfe2 with merge base 5526971 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-06T07:15:35Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

JulianCloudNTH · 2026-06-09T19:21:32Z

@claude review

claude · 2026-06-09T19:22:31Z

Claude finished @JulianCloudNTH's task in 3m 59s —— View job

Code Review: Fused SDPA (`sdpa_with_kv_cache`) for WebGPU

Read changed files (Sdpa.cpp, 3 WGSL kernels + headers, build, WebGPUGraph.h)
Analyze correctness, performance, and style
Post review feedback

Reviewed against origin/gh/JulianCloudNTH/19/base. Overall this is clean, well-structured work — the flat-index layouts are self-consistent across the three kernels, the GQA/causal-mask math is correct, and the dynamic input_pos resize path is handled carefully. Findings below, ordered by significance.

Correctness (verified, no blocking bugs found)

I traced the index math across all three kernels and the C++ dispatch sizing; they agree:

Layouts consistent. QK writes attn_weights as [Hq, S, context_len] (idx = h·S·ctx + s·ctx + c); softmax reads with row_width = context_len; AV reads aw_base = h·S·ctx + s·ctx. All three use the context_len stride, not Cmax, so rows pack contiguously into the front of the Cmax-capacity scratch and the unused tail is harmless. ✅
Resize hook updates exactly what varies with context_len. Only the QK dispatch's workgroup_count_x depends on context_len (= Hq·S·ctx) and it is rewritten (Sdpa.cpp:521). Softmax (Hq·S rows) and AV (S·Hq·D) dispatch counts are context_len-independent, so leaving their counts fixed while only rewriting their uniforms is correct. ✅
Prefill/first-run path is sound. The build placeholder input_pos = read_symint(...) means that when the first real input_pos equals the build value (e.g. prefill at 0), set_symint won't mark it dirty and the hook won't fire — but the baked build params already match, so output is correct. Nicely consistent. ✅
Masked-row softmax is safe. Causal guarantees ≥1 unmasked entry per row (c ≤ s + input_pos, and context_len = S + input_pos), so row_max is finite and exp(NEG_INF − row_max) → 0. The row_sum > 0 guard in sdpa_softmax.wgsl:91 is belt-and-suspenders. ✅
Idle-thread reduction. When context_len < 64, idle lanes seed local_max = NEG_INF / local_sum = 0 and still hit every workgroupBarrier() in uniform control flow. Correct. ✅

Suggestions (non-blocking)

1. Pipeline/shader/BGL caches go unused — Sdpa.cpp:93 build_dispatch.
The graph exposes get_or_create_shader/pipeline/bgl (WebGPUGraph.h:181-193), but build_dispatch creates a fresh shader module, BGL, pipeline layout, and pipeline on every call. The two update_cache dispatches (K and V) compile the identical kernel twice, and SDPA shaders rebuild from scratch even though other ops in the graph may use them. Build-time only, but routing through the existing caches would cut redundant compilation. Fix this →

2. Full O(Hq·S·context_len) attention matrix in scratch, ×2 — Sdpa.cpp:303-311.
Two scratch buffers are each Hq·S·Cmax·4 bytes in the dynamic case. Fine for decode (S=1), but for a long prefill this is large — e.g. Hq=32, S=2048, Cmax=2048 → ~512 MB each. This mirrors the Vulkan reference (no flash-attention tiling), so it's a known limitation rather than a defect, but it's worth a comment noting prefill memory scales as Hq·S·Cmax so future readers don't get surprised.

3. No bound check on kMaxEntries — Sdpa.cpp:113.
kMaxEntries = 8 and n_storage is currently ≤ 3, so n_storage + 1 is safe today. A static_assert/runtime guard (or just a comment that callers must keep n_storage + 1 ≤ kMaxEntries) would make the silent buffer-overflow risk explicit if a future kernel adds bindings.

4. K/V sequence length is assumed equal to q's S — Sdpa.cpp:327.
kv_numel and the update_cache dispatch derive the new-token count from q.dims (S), never from k/v. If k/v ever had a different seq dim this would silently mis-copy. A k.dims[kn-3] == S check alongside the existing validation would harden this.

5. Softmax workgroup size isn't clamped to device limits.
QK/AV use utils::clamp_workgroup_size and pass it as an override; sdpa_softmax.wgsl hardcodes WG_SIZE = 64u with wg_size = 0 sentinel (Sdpa.cpp:428). Since the WebGPU spec floor for maxComputeInvocationsPerWorkgroup is 256, 64 is always safe — so this is fine as-is, just noting the intentional asymmetry in case it ever confuses.

Style

Matches the repo's WebGPU conventions well — minimal comments, fail-loud validation mirroring the Vulkan reference, explicit param structs with static_assert size checks. No style concerns. Note the op-level tests are in the stacked diff (#20087), so I couldn't validate runtime behavior here.

_{· branch gh/JulianCloudNTH/19/head}

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

… input_pos Pull Request resolved: #20086 Adds the fused `sdpa_with_kv_cache` op (QK attention-weights, softmax, attention-output sub-kernels over the KV cache), composing the three enablers below it: the base graph's inter-dispatch buffer passing (scratch buffers + multi-pass execute), the `update_cache` op, and the SymInt live-scalar mechanism. The QK/softmax/AV kernels mirror the Vulkan reference's flat-index/GQA/causal-mask math (NCHW, buffer-only, fp32). `input_pos` is consumed dynamically via the SymInt mechanism: the op reads `symint_buffer()` as a uniform, sizes its scratch + dispatches for the max context length, and registers a resize hook so a single delegate runs an autoregressive decode loop (feed only the new token + advancing `input_pos`) instead of a fixed baked position. Mirrors the Vulkan SymInt = live uniform-buffer design. Tests live in the stacked test-suite diff above (clean op diff here). Authored with assistance from Claude. ghstack-source-id: 392609088 @exported-using-ghexport Differential Revision: [D107595125](https://our.internmc.facebook.com/intern/diff/D107595125/)

@JulianCloudNTH

… input_pos (#20259) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #20086 by @JulianCloudNTH ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/19/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/19/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/19/orig @diff-train-skip-merge --------- Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>

Update

ded2b6f

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 6, 2026 07:14

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2026

JulianCloudNTH closed this Jun 6, 2026

JulianCloudNTH had a problem deploying to cherry-pick-bot June 6, 2026 07:16 — with GitHub Actions Failure

JulianCloudNTH reopened this Jun 9, 2026

Update

b7d1e44

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 9, 2026

JulianCloudNTH added 6 commits June 8, 2026 22:47

Update

405fde9

[ghstack-poisoned]

Update

65133d0

[ghstack-poisoned]

Update

13b12d5

[ghstack-poisoned]

Update

08b0d9d

[ghstack-poisoned]

Update

88e55d6

[ghstack-poisoned]

Update

2b85840

[ghstack-poisoned]

JulianCloudNTH added 2 commits June 9, 2026 13:17

Update

18b313c

[ghstack-poisoned]

Update

0af217e

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 9, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167

Merged

JulianCloudNTH added 2 commits June 9, 2026 17:17

Update

08abf3b

[ghstack-poisoned]

Update

c4cab5e

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 10, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Merged

JulianCloudNTH added 2 commits June 10, 2026 14:37

Update

a09a082

[ghstack-poisoned]

Update

bcf608b

[ghstack-poisoned]

This was referenced Jun 11, 2026

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226

Merged

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227

Merged

psiddh approved these changes Jun 12, 2026

View reviewed changes

SS-JIA approved these changes Jun 12, 2026

View reviewed changes

Update

e29cfe2

[ghstack-poisoned]

meta-codesync Bot merged commit e41cf0e into gh/JulianCloudNTH/19/base Jun 12, 2026
175 of 178 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/19/head branch June 12, 2026 23:21

meta-codesync Bot temporarily deployed to cherry-pick-bot June 12, 2026 23:21 Inactive

This was referenced Jun 12, 2026

[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20259

Merged

[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos#20086

[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos#20086
meta-codesync[bot] merged 15 commits into
gh/JulianCloudNTH/19/basefrom
gh/JulianCloudNTH/19/head

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

JulianCloudNTH commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JulianCloudNTH commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20086

❗ 1 Active SEVs

⏳ 134 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 6, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Fused SDPA (sdpa_with_kv_cache) for WebGPU

Correctness (verified, no blocking bugs found)

Suggestions (non-blocking)

Style

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 9, 2026 •

edited

Loading

Code Review: Fused SDPA (`sdpa_with_kv_cache`) for WebGPU