Skip to content

fix: keep queued_requests in sync with ml_tasks (stop unbounded queue growth)#12

Merged
Tanmaypatil123 merged 1 commit into
mainfrom
fix/queued-requests-sync
Jun 19, 2026
Merged

fix: keep queued_requests in sync with ml_tasks (stop unbounded queue growth)#12
Tanmaypatil123 merged 1 commit into
mainfrom
fix/queued-requests-sync

Conversation

@adhikjoshi

Copy link
Copy Markdown
Contributor

Problem

queued_requests (the Redis sorted set scored by queued_at, which queue_num / queue_time are computed from) is zadd-ed on enqueue but only zrem-ed in remove_task_from_queue() (explicit cancel). The normal lifecycle — blpop("ml_tasks") claim → process → complete — never removes the task from queued_requests.

So queued_requests grows without bound: every task ever enqueued stays in it forever. In production one deepfake queue held 934,500 entries of which ~99.5% were already-resolved (success/error) tasks — only ~44 were genuinely pending. This made queue_num/queue_time meaningless (and drove bogus autoscaling/ETA) and bloated the memory-capped Redis containers.

ml_tasks (the actual work list) drains correctly via blpop; it's only the parallel index that leaks.

Fix

Keep queued_requests mirroring ml_tasks at every boundary:

  • Claim (blpop in the worker loop): zrem("queued_requests", task_id) once the task is picked up — it's no longer queued.
  • Re-queue paths re-add it: stuck-processing requeue, delayed-task requeue, and task-not-allowed requeue now zadd("queued_requests", ...) (the not-allowed path also srems processing_tasks).
  • delete_queue() now also clears queued_requests.

Tests

All 49 tests pass (tests/test_base.py, fakeredis). The existing queued_requests assertions are post-enqueue and unaffected.

Note: this prevents future growth. The existing backlog of stale entries should be pruned separately (a one-time purge of ~6.8M entries across 36 queues was already run).

`queued_requests` (the sorted-set index behind queue_num / queue_time) was
added on enqueue but only removed on explicit cancel — never when a worker
claimed or completed a task. So it accumulated every task ever enqueued
(observed one deepfake queue at 934k, ~99% already-resolved), wildly inflating
reported queue depth/ETA and bloating Redis memory.

Now mirror `ml_tasks` at every boundary:
- zrem queued_requests when a worker claims a task (blpop)
- zadd queued_requests on every re-queue (stuck / delayed / task-not-allowed)
- clear queued_requests in delete_queue()

All 49 tests pass.
@Tanmaypatil123 Tanmaypatil123 merged commit bf81549 into main Jun 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants