Skip to content

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163

Open
joanreyero wants to merge 9 commits into
mainfrom
feat/CM-1213-dockerhub-sync
Open

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
joanreyero wants to merge 9 commits into
mainfrom
feat/CM-1213-dockerhub-sync

Conversation

@joanreyero

@joanreyero joanreyero commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Standalone loop worker (sibling of github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.

Discovery (Option B-lite): one GitHub GraphQL call per repo checks for Dockerfile at HEAD:Dockerfile, docker/Dockerfile, build/Dockerfile. If present, probes hub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted to repo_docker; every repo gets repos.docker_checked_at set so the backlog drains.

Refresh: known images with stale last_synced_at are re-fetched daily; lifetime pull_count is written to repo_docker.pulls and snapshotted into repo_docker_pulls_daily (per-day deltas via LAG() at query time — Hub doesn't expose daily counts).

Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across ENRICHER_GITHUB_TOKENS with per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).

Schema (V1779710880 edited in place — pre-prod)

  • repos.docker_checked_at timestamptz + partial index repos_docker_pending_idx (WHERE host='github' AND docker_checked_at IS NULL)
  • repo_docker_pulls_daily(image_name, date, pulls_total) partitioned by date (register with pg_partman alongside downloads_daily)
  • repo_docker_stale_idx on repo_docker(last_synced_at)

Files

  • src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts + 15 vitest cases
  • src/bin/dockerhub-sync.ts, src/config.ts (getDockerhubConfig)
  • scripts/services/dockerhub-sync.yaml, package.json scripts (port 9235)

Validation against prod data

Ran against a random 1000-repo sample from public.repositories (prod):

Outcome n %
Hit — <owner>/<name> on Hub 26 2.6%
Dockerfile present, no <owner>/<name> on Hub 102 10.2%
No Dockerfile at probed paths 869 86.9%
GitHub 404 3 0.3%

7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds: ollama/ollama (140M pulls), hashicorp/packer (47M), semgrep/semgrep (32M).

Follow-ups (scoped, not in this PR)

A CI-workflow-parsing census on the same 1000 repos showed:

  • 66 repos (6.6%) have GHA that publishes a container
  • Registry split: ghcr.io 41 · docker.io 35 · quay.io 8 — ghcr is the dominant target for CDP's LF/CNCF-heavy population
  • CI parsing recovers +17 Hub images v1 misses (org/name differs: scaleway/cli, qmcgaw/gluetun, nervos/ckb, paketobuildpacks/*, …)
  • 31 repos publish only to non-Hub registries

Ranked by ROI:

  1. ghcr.io/quay.io probes — ghcr namespaces == GitHub orgs by design, so the existing heuristic works as-is; needs a registry column on repo_docker. ~2× total coverage.
  2. CI-workflow parsing — replace Dockerfile gate with .github/workflows extraction. +17 Hub images/1000.
  3. library/ official-image allowlist; broader Dockerfile path probing.

Reviewer notes

  • backend/.env.dist.{local,composed} need the DOCKERHUB_* block appended (couldn't write .env* from the dev session — see commit message for values).
  • pnpm format in packages_worker is currently broken (strips TS generics); format-check fails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.
  • package.json change is +3 script entries only.

🤖 Generated with Claude Code


Note

Medium Risk
New long-running worker writes to packages-db and calls GitHub/Docker Hub APIs with migration-backed schema changes; failure modes are mostly mitigated (transactions, fail-fast on auth), but operational load and external rate limits affect backlog drain.

Overview
Adds a new dockerhub-sync loop worker (alongside github-repos-enricher) that discovers Docker Hub images for GitHub repos in packages-db and keeps pull/star metadata fresh.

Schema: migration adds repos.docker_checked_at, indexes for discovery/refresh backlogs, and monthly-partitioned repo_docker_pulls_daily for lifetime pull snapshots (daily deltas via LAG() at query time).

Discovery: GitHub GraphQL checks common Dockerfile paths; if present, probes Hub at <owner>/<repo> (validated slugs only—no library/ heuristic). Sets docker_checked_at so the backlog drains.

Refresh: Re-fetches stale repo_docker rows, upserts pulls/stars, writes daily snapshots in a transaction; 404s only touch last_synced_at.

Runtime: getDockerhubConfig, DAL in repoDocker.ts, serialized Hub calls with per-IP rate-limit parking, GitHub App installation fan-out with retry/park on rate limits, Docker Compose service and package.json scripts.

Reviewed by Cursor Bugbot for commit cf32149. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings June 3, 2026 13:40
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.

Changes:

  • Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
  • Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
  • Extends packages-db schema with docker_checked_at, backlog/staleness indexes, and repo_docker_pulls_daily (range-partitioned).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
services/apps/packages_worker/src/dockerhub/index.ts Core discovery/refresh loop, rate-limit parking, page processing
services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Docker Hub API client + error classification
services/apps/packages_worker/src/dockerhub/detectDockerfile.ts GitHub GraphQL probe for Dockerfile presence
services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts Upserts into repo_docker and daily snapshot table
services/apps/packages_worker/src/dockerhub/types.ts Shared types + FetchError
services/apps/packages_worker/src/dockerhub/candidates.ts Candidate image-name generation + validation
services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts Unit tests for Hub fetch behavior
services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts Unit tests for candidate generation
services/apps/packages_worker/src/bin/dockerhub-sync.ts Worker entrypoint and shutdown wiring
services/apps/packages_worker/src/config.ts Adds getDockerhubConfig() env parsing
services/apps/packages_worker/package.json Adds start/dev scripts for dockerhub-sync (protected file)
scripts/services/dockerhub-sync.yaml Compose service for running dockerhub-sync
backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Schema: docker_checked_at, indexes, repo_docker_pulls_daily table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/__tests__/fetchDockerhub.test.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
Comment thread backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Outdated
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Standalone loop worker (modeled on github-repos-enricher) that:
- discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name>
  probing on hub.docker.com/v2
- refreshes pull/star counts daily into repo_docker
- snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time
  daily granularity

Schema (V1779710880 edited in place, pre-prod):
- repos.docker_checked_at + partial index for discovery backlog
- repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily)
- repo_docker_stale_idx on last_synced_at

Tested against a 1000-repo random sample from prod public.repositories:
2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant
registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as
follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
@joanreyero joanreyero force-pushed the feat/CM-1213-dockerhub-sync branch from 6fb6051 to b23c298 Compare June 3, 2026 15:42
- Retry the same row after a GitHub rate-limit park instead of abandoning it
  (cursor would otherwise advance past unprobed repos until end-of-sweep).
- Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out
  cannot fire concurrent requests against the per-IP Hub budget.
- 401/403 from Hub now classified AUTH and propagated, so a misconfigured base
  URL fails fast instead of silently marking every image gone.
- Stop discarding valid 200 responses when x-ratelimit-remaining=0.
- Wrap repo_docker + repo_docker_pulls_daily writes in a transaction.
- Classify non-JSON GitHub GraphQL bodies as MALFORMED.

Not addressed (replied on PR):
- Inline SQL stays per packages_worker convention (matches enricher/osv).
- repo_docker_pulls_daily partition setup deferred to pg_partman, same as
  downloads_daily in the same migration.
- Loop-level retry/parking tests deferred; validated against 1065 real repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 3, 2026 16:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Outdated
Per themarolt's review on #4149, packages-db queries belong in
services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now
imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow,
upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from
@crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx
orchestrator. Query strings unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Hub calls are serialized via hubChain, so a stalled socket would block all
subsequent probes indefinitely. AbortSignal.timeout(30s) on both the Hub and
GitHub GraphQL requests; aborts surface as TRANSIENT and retry with backoff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 5, 2026 14:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.

Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
joanreyero and others added 2 commits June 5, 2026 15:14
Conflict in packages_worker/package.json scripts: kept both sides; moved
dockerhub-sync inspector port 9235 -> 9238 (deps-dev-ingest took 9235 on main).

getDockerhubConfig still reads ENRICHER_GITHUB_TOKENS directly so the
enricher-v2 GitHub-App switch on main doesn't affect this worker; migrating
dockerhub-sync to getGithubAppConfig() is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the
ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig +
resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out
runs one worker per installation id with parkedUntil keyed on installationId.
getInstallationToken is called per-request so token refresh/caching is shared
with the enricher.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 5, 2026 14:19
@joanreyero joanreyero requested review from epipav and themarolt June 5, 2026 14:20
githubFetchWithRetries was returning null on AUTH (same bucket as NOT_FOUND),
which caused discoverRepo to mark docker_checked_at and move on. With a bad
installation token that would silently stamp every repo for
DOCKERHUB_DISCOVERY_INTERVAL_DAYS. Now AUTH re-throws through
processDiscoveryPage so the worker exits and restarts with a fresh
resolveInstallations() — symmetric to the Hub AUTH path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Outdated
Comment thread backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Outdated
Comment thread backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Outdated
Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
joanreyero and others added 2 commits June 8, 2026 15:29
- Initial schema is now deployed; revert in-place edits to V1779710880 and
  move docker_checked_at / repos_docker_pending_idx / repo_docker_stale_idx /
  repo_docker_pulls_daily into a new migration V1780928852__dockerhub_sync
  (created via cli scaffold create-packages-migration).
- processRefreshPage: wrap per-row Hub fetch + DB writes in try/catch so a
  transient DB failure logs and continues instead of crashing the worker.
  Hub AUTH still propagates fatally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
- Register repo_docker_pulls_daily with pg_partman in the migration
  (mirrors V1780231200 for downloads_daily; guarded against re-registration).
- Normalize trailing slash on DOCKERHUB_API_BASE_URL to avoid // in the path.
- Update stale 'per-token' comment to 'per-installation' after the App-auth
  switch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a4528db. Configure here.

Comment thread backend/src/osspckgs/migrations/V1780928852__dockerhub_sync.sql
Copilot AI review requested due to automatic review settings June 8, 2026 14:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Comment on lines +91 to +92
ON CONFLICT (image_name) DO UPDATE SET
repo_id = COALESCE(repo_docker.repo_id, EXCLUDED.repo_id),
Comment on lines +262 to +266
} else {
log.error({ url: row.url, err }, 'Unexpected discovery error')
failed++
done = true
}
@joanreyero joanreyero requested a review from themarolt June 8, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants