chore: use packages instead of universe [CM-1226]#4177
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
| FROM unnest($(packageIds)::bigint[], $(scores)::numeric[]) AS v(package_id, score) | ||
| JOIN packages p ON p.id = v.package_id | ||
| WHERE pu.purl = p.purl`, | ||
| WHERE p.id = v.package_id`, |
There was a problem hiding this comment.
Missing packages centrality column
High Severity
mergeCentralityScores now runs UPDATE packages SET centrality_score, but the schema only defines centrality_score on packages_universe, not on packages. PageRank merges will fail at the database until packages gains that column (or the update targets the correct table).
Reviewed by Cursor Bugbot for commit 1885523. Configure here.
| log.info('Starting packages rank pass') | ||
| const qx = await getPackagesDb() | ||
|
|
||
| const result = await qx.selectOne(`SELECT * FROM rank_packages()`) |
There was a problem hiding this comment.
rank_packages SQL missing
High Severity
Worker code now calls rank_packages() (with or without arguments), but the repository migrations only define rank_packages_universe(). Ranking and pnpm run:impact will fail until a migration adds or aliases rank_packages on packages.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1885523. Configure here.
| 3. For each matching changed name, fetch the full document from `registry.npmjs.com/<package>`. | ||
| 4. Normalize into `packages`, `versions`, `maintainers`, and `package_maintainers` using the write rules above. | ||
| 4. Downloads: two Temporal workflows — `backfillDailyDownloads` (per-day rows into `downloads_daily`) and `refreshLast30dDownloads` (rolling 30-day windows into `downloads_last_30d`). Both are self-healing: they detect and fill missing windows on each run rather than assuming continuity. Both currently source packages from a static watch list. Once the deps.dev BQ import is operational, `backfillDailyDownloads` will source from `packages` (Tier 2 critical slice) and `refreshLast30dDownloads` will source from `packages_universe` (full Tier 3 population). | ||
| 4. Downloads: two Temporal workflows — `backfillDailyDownloads` (per-day rows into `downloads_daily`) and `refreshLast30dDownloads` (rolling 30-day windows into `downloads_last_30d`). Both are self-healing: they detect and fill missing windows on each run rather than assuming continuity. Both source packages from `packages WHERE ecosystem = 'npm' AND is_critical = TRUE`. |
There was a problem hiding this comment.
Daily backfill scope mismatch
Medium Severity
The ADR and last-30d queries now limit npm download work to is_critical = TRUE, but getNpmPackagesNeedingDailyBackfill still selects every npm row in packages. After the full universe lives in packages, daily backfill can target millions of non-critical packages contrary to the documented design.
Reviewed by Cursor Bugbot for commit 1885523. Configure here.
There was a problem hiding this comment.
Pull request overview
This PR updates the oss-packages criticality/ranking architecture to treat packages as the single workspace/source of truth (retiring packages_universe in docs and worker code), and switches the criticality ranking entrypoint from rank_packages_universe() to rank_packages().
Changes:
- Updates npm last-30d selection/mirroring to operate on
packages(scoped toecosystem = 'npm' AND is_critical = TRUE) and mirror intopackages.downloads_last_30d. - Replaces the deps.dev ranking activity
rankPackagesUniversewith a newrankPackagesactivity that callsrank_packages(). - Updates ADR-0001 and various worker strings/comments to consistently reference
packages+rank_packages().
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/packages/downloadsLast30d.ts | Switches candidate selection from packages_universe to packages (critical-only) and changes the mirror UPDATE target to packages. |
| services/apps/packages_worker/src/scripts/triggerBootstrap.ts | Updates CLI help text to reference rankPackages. |
| services/apps/packages_worker/src/npm/workflows.ts | Updates comment to reference packages.downloads_last_30d. |
| services/apps/packages_worker/src/npm/schedule.ts | Updates schedule docs to reflect critical-only scope + packages mirroring. |
| services/apps/packages_worker/src/npm/activities.ts | Updates comments describing latest-window mirroring target. |
| services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts | Updates comment referencing the ranking activity name. |
| services/apps/packages_worker/src/deps-dev/activities/rankPackagesUniverse.ts | Removes the old universe-based ranking activity. |
| services/apps/packages_worker/src/deps-dev/activities/rankPackages.ts | Adds a new activity calling rank_packages(). |
| services/apps/packages_worker/src/deps-dev/activities/index.ts | Re-exports rankPackages instead of rankPackagesUniverse. |
| services/apps/packages_worker/src/criticality/run-impact.ts | Switches the on-demand script from rank_packages_universe(...) to rank_packages(...). |
| services/apps/packages_worker/src/criticality/queries.ts | Changes centrality merge to update packages by id. |
| services/apps/packages_worker/src/criticality/activities.ts | Updates comment to reference merging into packages. |
| services/apps/packages_worker/src/bin/criticality-worker.ts | Updates startup log message to reference rank_packages(). |
| docs/adr/0001-oss-packages-design-decisions.md | Updates ADR to remove packages_universe from the flow and describe packages + rank_packages() as the single path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| `UPDATE packages | ||
| SET downloads_last_30d = $(count) | ||
| WHERE purl = $(purl) AND downloads_last_30d IS DISTINCT FROM $(count)`, | ||
| { count, purl }, |
| `SELECT p.purl AS purl, p.first_release_at::text AS first_release_at | ||
| FROM packages p | ||
| LEFT JOIN npm_package_universe_state s ON s.purl = p.purl | ||
| WHERE p.ecosystem = 'npm' | ||
| AND p.is_critical = TRUE | ||
| AND (((hashtext(p.purl) % $(laneCount)) + $(laneCount)) % $(laneCount)) = $(laneIndex) |
| `SELECT p.purl AS purl, p.first_release_at::text AS first_release_at, | ||
| s.downloads_30d_last_run_at AS last_run_at | ||
| FROM packages p | ||
| JOIN npm_package_universe_state s ON s.purl = p.purl |
| `UPDATE packages p | ||
| SET centrality_score = v.score | ||
| FROM unnest($(packageIds)::bigint[], $(scores)::numeric[]) AS v(package_id, score) | ||
| JOIN packages p ON p.id = v.package_id | ||
| WHERE pu.purl = p.purl`, | ||
| WHERE p.id = v.package_id`, |
| const result = await qx.selectOne(`SELECT * FROM rank_packages()`) | ||
|
|
| const [result] = await qx.select( | ||
| `SELECT * FROM rank_packages_universe($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`, | ||
| `SELECT * FROM rank_packages($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`, | ||
| { wDownloads, wDepPkgs, wTransitive, topN }, |
| | `packages` | Upsert on `purl`. Each worker only writes columns it owns; ecosystem isolation means column-level conflicts cannot occur in practice. | | ||
| | `packages_universe` | Incremental upsert keyed on `purl`. The deps.dev import only touches rows whose underlying deps.dev snapshot date has advanced since the previous import (initial run is a one-time full backfill). | | ||
| | `packages` | Incremental upsert keyed on `purl`. The deps.dev import only touches rows whose underlying deps.dev snapshot date has advanced since the previous import (initial run is a one-time full backfill). `rank_packages()` scores and flags in place — no separate ranking workspace. | |
| Downloads is the strongest criticality signal. `packages` is the single source of truth — `packages_universe` has been retired. `packages.downloads_last_30d` is the single column used by `rank_packages()`. | ||
|
|
||
| Tier 2 (`packages`) downloads are stored in `downloads_daily` — one row per `(package_id, date)`, consumers sum over any window they need. Tier 3 (`packages_universe`) downloads are stored in `downloads_last_30d` — one row per `(purl, end_date)` capturing a rolling 30-day window — with the latest window's count also cached on `packages_universe.downloads_last_30d bigint` for direct use by `rank_packages_universe()` without a join. | ||
| Per-day history is stored in `downloads_daily` — one row per `(package_id, date)`, consumers sum over any window they need. The latest 30-day window count is written directly to `packages.downloads_last_30d` for `is_critical = TRUE` packages. | ||
|
|
| Weights sum to 1.0 → impact ∈ `[0, 1]`. `dependent_count` is direct dependent packages only; `transitive_dependent_count` is indirect dependents only. All weights are call-time numeric parameters to `rank_packages()` — tunable without schema or code changes. | ||
|
|
||
| `centrality_score` (PageRank) is computed and stored on `packages_universe` by the criticality worker and will be added to the formula if needed. | ||
| `centrality_score` (PageRank) is computed and stored on `packages` by the criticality worker and will be added to the formula if needed. | ||
|
|
||
| **Current weights** (defaults in `rank_packages_universe()`, iterate once the ranked list is observable): | ||
| **Current weights** (defaults in `rank_packages()`, iterate once the ranked list is observable): |
| FROM packages p | ||
| LEFT JOIN npm_package_universe_state s ON s.purl = p.purl | ||
| WHERE p.ecosystem = 'npm' | ||
| AND p.is_critical = TRUE |
There was a problem hiding this comment.
Can you remove this flag for this query in specific? Everything that was previously running against packages_universe should still run against all packages in packages. downloads_last_30d should be ran against the entire universe as we need it for the impact/critical score.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 6 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.
|
|
||
| const [result] = await qx.select( | ||
| `SELECT * FROM rank_packages_universe($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`, | ||
| `SELECT * FROM rank_packages($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`, |
There was a problem hiding this comment.
Missing rank_packages database function
High Severity
The worker now invokes rank_packages(), but the oss-packages migrations only define rank_packages_universe() on packages_universe. Running pnpm run:impact or the rankPackages activity fails at query time because the renamed function is not present in the schema.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.
| if (mirrorToUniverse) { | ||
| const rowCount = await qx.result( | ||
| `UPDATE packages_universe | ||
| `UPDATE packages |
There was a problem hiding this comment.
Updates nonexistent packages column
High Severity
Latest-window download mirroring now runs UPDATE packages SET downloads_last_30d, yet downloads_last_30d exists on packages_universe, not on packages, in the current migrations. The npm last-30d breadth lane errors when it tries to denormalize the latest count.
Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.
| FROM packages p | ||
| JOIN npm_package_universe_state s ON s.purl = p.purl | ||
| WHERE p.ecosystem = 'npm' | ||
| AND p.is_critical = TRUE |
There was a problem hiding this comment.
History backfill limits critical packages
Medium Severity
Last-30d history selection adds p.is_critical = TRUE, so only critical npm packages get older rolling windows. Prior universe-wide behavior and the PR review ask for full-universe downloads_last_30d coverage needed for impact scoring.
Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.


This pull request updates the architecture decision record to consolidate the criticality scoring workflow onto a single
packagestable, removing the previous use of a separatepackages_universeranking workspace. All references to the ranking and scoring process, data flow, and worker implementation have been revised to reflect this simplification. The documentation now consistently describes the use ofrank_packages()as the scoring function and clarifies which worker is responsible for writing to each table.Key changes:
Criticality scoring and ranking process:
packagestable, eliminating thepackages_universeworkspace. The scoring function is nowrank_packages()instead ofrank_packages_universe(). [1] [2] [3] [4] [5] [6] [7] [8]packagesand callsrank_packages(), rather than using a separate workspace. [1] [2]Data ingestion and worker responsibilities:
packages. The documentation clarifies that the criticality worker and other sub-workers are responsible for writing only the columns they own inpackages. [1] [2]downloads_last_30dworkflow now sources directly frompackages WHERE ecosystem = 'npm' AND is_critical = TRUEfor both daily and rolling 30-day download counts, instead of relying on a static watch list or the old universe table.Documentation and terminology:
packagestable andrank_packages()function, ensuring consistency and clarity throughout the document. [1] [2] [3]These changes streamline the criticality scoring architecture, reduce redundancy, and clarify ownership and workflow for future development and maintenance.
Note
Medium Risk
Changes which tables and SQL paths ranking, PageRank merges, and download refresh use—misaligned
rank_packages()DB function or breadth vsis_criticalscope could skew criticality scores or increase npm API load.Overview
Retires
packages_universeas a ranking workspace — ADR-0001 and worker code now treatpackagesas the single read/write surface for criticality signals, deps.dev universe import, and download denormalization.Ranking pipeline:
rankPackagesUniverse(TRUNCATEpackages_universe, copy frompackages,rank_packages_universe(), propagateis_criticalback) is removed and replaced by a thinrankPackagesactivity that callsrank_packages()in place. CLIrun:impactand the criticality worker log the new function name and no longer reportpropagated_rows. PageRankmergeCentralityScoresnowUPDATE packagesbyidinstead of joining throughpackages_universe.npm downloads (DAL): last-30d selection and upsert mirroring query
packagesdirectly; the latest-window cache writespackages.downloads_last_30d(notpackages_universe). History backfill selection addsis_critical = TRUE. Comments/schedules align with critical-only refresh for scoring inputs.Reviewed by Cursor Bugbot for commit ff92e3b. Bugbot is set up for automated code reviews on this repo. Configure here.