feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163joanreyero wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.
Changes:
- Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
- Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
- Extends packages-db schema with
docker_checked_at, backlog/staleness indexes, andrepo_docker_pulls_daily(range-partitioned).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/src/dockerhub/index.ts | Core discovery/refresh loop, rate-limit parking, page processing |
| services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts | Docker Hub API client + error classification |
| services/apps/packages_worker/src/dockerhub/detectDockerfile.ts | GitHub GraphQL probe for Dockerfile presence |
| services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts | Upserts into repo_docker and daily snapshot table |
| services/apps/packages_worker/src/dockerhub/types.ts | Shared types + FetchError |
| services/apps/packages_worker/src/dockerhub/candidates.ts | Candidate image-name generation + validation |
| services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts | Unit tests for Hub fetch behavior |
| services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts | Unit tests for candidate generation |
| services/apps/packages_worker/src/bin/dockerhub-sync.ts | Worker entrypoint and shutdown wiring |
| services/apps/packages_worker/src/config.ts | Adds getDockerhubConfig() env parsing |
| services/apps/packages_worker/package.json | Adds start/dev scripts for dockerhub-sync (protected file) |
| scripts/services/dockerhub-sync.yaml | Compose service for running dockerhub-sync |
| backend/src/osspckgs/migrations/V1779710880__initial_schema.sql | Schema: docker_checked_at, indexes, repo_docker_pulls_daily table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Standalone loop worker (modeled on github-repos-enricher) that: - discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name> probing on hub.docker.com/v2 - refreshes pull/star counts daily into repo_docker - snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time daily granularity Schema (V1779710880 edited in place, pre-prod): - repos.docker_checked_at + partial index for discovery backlog - repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily) - repo_docker_stale_idx on last_synced_at Tested against a 1000-repo random sample from prod public.repositories: 2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
6fb6051 to
b23c298
Compare
- Retry the same row after a GitHub rate-limit park instead of abandoning it (cursor would otherwise advance past unprobed repos until end-of-sweep). - Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out cannot fire concurrent requests against the per-IP Hub budget. - 401/403 from Hub now classified AUTH and propagated, so a misconfigured base URL fails fast instead of silently marking every image gone. - Stop discarding valid 200 responses when x-ratelimit-remaining=0. - Wrap repo_docker + repo_docker_pulls_daily writes in a transaction. - Classify non-JSON GitHub GraphQL bodies as MALFORMED. Not addressed (replied on PR): - Inline SQL stays per packages_worker convention (matches enricher/osv). - repo_docker_pulls_daily partition setup deferred to pg_partman, same as downloads_daily in the same migration. - Loop-level retry/parking tests deferred; validated against 1065 real repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Per themarolt's review on #4149, packages-db queries belong in services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow, upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from @crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx orchestrator. Query strings unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Hub calls are serialized via hubChain, so a stalled socket would block all subsequent probes indefinitely. AbortSignal.timeout(30s) on both the Hub and GitHub GraphQL requests; aborts surface as TRANSIENT and retry with backoff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Conflict in packages_worker/package.json scripts: kept both sides; moved dockerhub-sync inspector port 9235 -> 9238 (deps-dev-ingest took 9235 on main). getDockerhubConfig still reads ENRICHER_GITHUB_TOKENS directly so the enricher-v2 GitHub-App switch on main doesn't affect this worker; migrating dockerhub-sync to getGithubAppConfig() is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig + resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out runs one worker per installation id with parkedUntil keyed on installationId. getInstallationToken is called per-request so token refresh/caching is shared with the enricher. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
githubFetchWithRetries was returning null on AUTH (same bucket as NOT_FOUND), which caused discoverRepo to mark docker_checked_at and move on. With a bad installation token that would silently stamp every repo for DOCKERHUB_DISCOVERY_INTERVAL_DAYS. Now AUTH re-throws through processDiscoveryPage so the worker exits and restarts with a fresh resolveInstallations() — symmetric to the Hub AUTH path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
- Initial schema is now deployed; revert in-place edits to V1779710880 and move docker_checked_at / repos_docker_pending_idx / repo_docker_stale_idx / repo_docker_pulls_daily into a new migration V1780928852__dockerhub_sync (created via cli scaffold create-packages-migration). - processRefreshPage: wrap per-row Hub fetch + DB writes in try/catch so a transient DB failure logs and continues instead of crashing the worker. Hub AUTH still propagates fatally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
- Register repo_docker_pulls_daily with pg_partman in the migration (mirrors V1780231200 for downloads_daily; guarded against re-registration). - Normalize trailing slash on DOCKERHUB_API_BASE_URL to avoid // in the path. - Update stale 'per-token' comment to 'per-installation' after the App-auth switch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a4528db. Configure here.
| ON CONFLICT (image_name) DO UPDATE SET | ||
| repo_id = COALESCE(repo_docker.repo_id, EXCLUDED.repo_id), |
| } else { | ||
| log.error({ url: row.url, err }, 'Unexpected discovery error') | ||
| failed++ | ||
| done = true | ||
| } |

Summary
Standalone loop worker (sibling of
github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.Discovery (Option B-lite): one GitHub GraphQL call per repo checks for
DockerfileatHEAD:Dockerfile,docker/Dockerfile,build/Dockerfile. If present, probeshub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted torepo_docker; every repo getsrepos.docker_checked_atset so the backlog drains.Refresh: known images with stale
last_synced_atare re-fetched daily; lifetimepull_countis written torepo_docker.pullsand snapshotted intorepo_docker_pulls_daily(per-day deltas viaLAG()at query time — Hub doesn't expose daily counts).Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across
ENRICHER_GITHUB_TOKENSwith per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).Schema (V1779710880 edited in place — pre-prod)
repos.docker_checked_at timestamptz+ partial indexrepos_docker_pending_idx(WHERE host='github' AND docker_checked_at IS NULL)repo_docker_pulls_daily(image_name, date, pulls_total)partitioned by date (register with pg_partman alongsidedownloads_daily)repo_docker_stale_idxonrepo_docker(last_synced_at)Files
src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts+ 15 vitest casessrc/bin/dockerhub-sync.ts,src/config.ts(getDockerhubConfig)scripts/services/dockerhub-sync.yaml,package.jsonscripts (port 9235)Validation against prod data
Ran against a random 1000-repo sample from
public.repositories(prod):<owner>/<name>on Hub<owner>/<name>on Hub7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds:
ollama/ollama(140M pulls),hashicorp/packer(47M),semgrep/semgrep(32M).Follow-ups (scoped, not in this PR)
A CI-workflow-parsing census on the same 1000 repos showed:
scaleway/cli,qmcgaw/gluetun,nervos/ckb,paketobuildpacks/*, …)Ranked by ROI:
registrycolumn onrepo_docker. ~2× total coverage..github/workflowsextraction. +17 Hub images/1000.library/official-image allowlist; broader Dockerfile path probing.Reviewer notes
backend/.env.dist.{local,composed}need theDOCKERHUB_*block appended (couldn't write.env*from the dev session — see commit message for values).pnpm formatinpackages_workeris currently broken (strips TS generics);format-checkfails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.package.jsonchange is +3 script entries only.🤖 Generated with Claude Code
Note
Medium Risk
New long-running worker writes to packages-db and calls GitHub/Docker Hub APIs with migration-backed schema changes; failure modes are mostly mitigated (transactions, fail-fast on auth), but operational load and external rate limits affect backlog drain.
Overview
Adds a new
dockerhub-syncloop worker (alongsidegithub-repos-enricher) that discovers Docker Hub images for GitHub repos in packages-db and keeps pull/star metadata fresh.Schema: migration adds
repos.docker_checked_at, indexes for discovery/refresh backlogs, and monthly-partitionedrepo_docker_pulls_dailyfor lifetime pull snapshots (daily deltas viaLAG()at query time).Discovery: GitHub GraphQL checks common Dockerfile paths; if present, probes Hub at
<owner>/<repo>(validated slugs only—nolibrary/heuristic). Setsdocker_checked_atso the backlog drains.Refresh: Re-fetches stale
repo_dockerrows, upserts pulls/stars, writes daily snapshots in a transaction; 404s only touchlast_synced_at.Runtime:
getDockerhubConfig, DAL inrepoDocker.ts, serialized Hub calls with per-IP rate-limit parking, GitHub App installation fan-out with retry/park on rate limits, Docker Compose service andpackage.jsonscripts.Reviewed by Cursor Bugbot for commit cf32149. Bugbot is set up for automated code reviews on this repo. Configure here.