[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding)#20263
[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding)#20263JulianCloudNTH wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20263
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 22 New Failures, 1 Unrelated FailureAs of commit 0d9b542 with merge base 5526971 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Stack from ghstack (oldest at bottom):
Adds the WebGPU backend handler for
et_vk.embedding_q4gsw.default(a 4-bit groupwise-symmetric quantized embedding gather) plus the host-side integer-input infra it requires.The op is a single compute dispatch composed of one stage: one thread per 32-element block of each gathered row dequantizes the packed 4-bit table (
q = (nibble - 8) * scale; even dim = high nibble, odd dim = low) into the fp32 output, mirroring the Vulkanembedding_q4gswreference (flat buffer-backed weight;is_linear_weight=trueis unsupported and throws). The workgroup size is awg_sizepipeline-override constant clamped to the device limit viaWebGPUUtils::clamp_workgroup_size, the 1D dispatch count goes throughWebGPUUtils::compute_1d_workgroup_count(validated before any GPU-object allocation), and the embedded WGSL string header is generated bygen_wgsl_headers.py.Embedding indices arrive as int64 at the program boundary but the serialized graph stores them as int32, so the shared input path is extended with a host-side
InputDataview ({data, nbytes, host_is_int64}) andcopy_inputsgains three branches: a byte-for-byte fast path when host and GPU sizes match, an int64->int32 narrowing copy when the buffer is int32 and the host input is twice as wide (mirrors the VulkankLong->kIntstaging cast), and a fail-loud throw otherwise.WebGPUTensorgainselem_size/is_intto drive the narrowing decision, andupdate_symints_from_inputstakes the sameInputDatavector soexecute()builds a single input list consumed by both.Differential Revision: D108428753