Reduce FLUX int8 test peak memory with sequential offload#13776
Reduce FLUX int8 test peak memory with sequential offload#13776jiqing-feng wants to merge 9 commits into
Conversation
|
require change: huggingface/accelerate#4044 merged. |
|
Hi @sayakpaul . Would you please review the PR? Thanks! |
| # enable_model_cpu_offload moves an entire sub-model to GPU at once, which OOMs on | ||
| # <=24 GB cards for FLUX.1-dev even with int8 quantization. | ||
| # This requires the bitsandbytes fix that preserves Int8Params.SCB across .to() calls. | ||
| self.pipeline_8bit.enable_sequential_cpu_offload() |
There was a problem hiding this comment.
Why do we keep making the same kind of changes i.e., if something fails on your particular environment, it's always better to guard them accordingly rather than doing it in a straightforward way like this.
There was a problem hiding this comment.
Thanks for the feedback! Updated to guard by device memory instead of unconditionally switching:
_, total_mem = torch.accelerator.get_memory_info(0)
if total_mem <= 25 * (1024**3):
self.pipeline_8bit.enable_sequential_cpu_offload()
else:
self.pipeline_8bit.enable_model_cpu_offload()This keeps the original enable_model_cpu_offload path on large-memory devices and only falls back to sequential offload on ≤24 GB cards. torch.accelerator works across CUDA/XPU/ROCm.
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
Hi @sayakpaul . I have fixed your comment, please review the new change. Thanks! |
|
Hi @sayakpaul . Would you please review the PR? Thanks! |
Summary
Update the slow FLUX bitsandbytes int8 tests to use sequential CPU offload instead of model CPU offload.
enable_model_cpu_offload()can move an entire sub-model onto the GPU at once. Forblack-forest-labs/FLUX.1-dev, this can OOM on <=24 GB cards even when the T5 encoder and transformer are loaded from the pre-quantized int8 test checkpoint. Sequential CPU offload keeps peak memory lower by materializing one layer at a time, which lets the int8 FLUX tests run in more constrained environments.The LoRA-loading assertion tolerance is also relaxed from
1e-3to2e-3to account for small backend-specific numerical differences observed in the slow int8 path.Changes
SlowBnb8bitFluxTestssetup fromenable_model_cpu_offload()toenable_sequential_cpu_offload().test_lora_loadingcosine-distance tolerance to2e-3.Validation
Run the affected slow tests:
Observed result: