Skip to content

chore: use packages instead of universe [CM-1226]#4177

Open
mbani01 wants to merge 6 commits into
mainfrom
fix/use_packages_instead_of_universe
Open

chore: use packages instead of universe [CM-1226]#4177
mbani01 wants to merge 6 commits into
mainfrom
fix/use_packages_instead_of_universe

Conversation

@mbani01

@mbani01 mbani01 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This pull request updates the architecture decision record to consolidate the criticality scoring workflow onto a single packages table, removing the previous use of a separate packages_universe ranking workspace. All references to the ranking and scoring process, data flow, and worker implementation have been revised to reflect this simplification. The documentation now consistently describes the use of rank_packages() as the scoring function and clarifies which worker is responsible for writing to each table.

Key changes:

Criticality scoring and ranking process:

  • All criticality scoring, ranking, and signal storage now occur directly on the packages table, eliminating the packages_universe workspace. The scoring function is now rank_packages() instead of rank_packages_universe(). [1] [2] [3] [4] [5] [6] [7] [8]
  • The in-memory computation of transitive dependents and PageRank centrality now merges results into packages and calls rank_packages(), rather than using a separate workspace. [1] [2]

Data ingestion and worker responsibilities:

  • The deps.dev import and all enrichment jobs now write directly to packages. The documentation clarifies that the criticality worker and other sub-workers are responsible for writing only the columns they own in packages. [1] [2]
  • The downloads_last_30d workflow now sources directly from packages WHERE ecosystem = 'npm' AND is_critical = TRUE for both daily and rolling 30-day download counts, instead of relying on a static watch list or the old universe table.

Documentation and terminology:

  • All diagrams, flowcharts, and table descriptions have been updated to reference the packages table and rank_packages() function, ensuring consistency and clarity throughout the document. [1] [2] [3]
  • The write rules table and worker layout sections have been revised to reflect the new table ownership and update policies.

These changes streamline the criticality scoring architecture, reduce redundancy, and clarify ownership and workflow for future development and maintenance.


Note

Medium Risk
Changes which tables and SQL paths ranking, PageRank merges, and download refresh use—misaligned rank_packages() DB function or breadth vs is_critical scope could skew criticality scores or increase npm API load.

Overview
Retires packages_universe as a ranking workspace — ADR-0001 and worker code now treat packages as the single read/write surface for criticality signals, deps.dev universe import, and download denormalization.

Ranking pipeline: rankPackagesUniverse (TRUNCATE packages_universe, copy from packages, rank_packages_universe(), propagate is_critical back) is removed and replaced by a thin rankPackages activity that calls rank_packages() in place. CLI run:impact and the criticality worker log the new function name and no longer report propagated_rows. PageRank mergeCentralityScores now UPDATE packages by id instead of joining through packages_universe.

npm downloads (DAL): last-30d selection and upsert mirroring query packages directly; the latest-window cache writes packages.downloads_last_30d (not packages_universe). History backfill selection adds is_critical = TRUE. Comments/schedules align with critical-only refresh for scoring inputs.

Reviewed by Cursor Bugbot for commit ff92e3b. Bugbot is set up for automated code reviews on this repo. Configure here.

mbani01 added 5 commits June 8, 2026 16:26
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 self-assigned this Jun 8, 2026
Copilot AI review requested due to automatic review settings June 8, 2026 15:31
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

FROM unnest($(packageIds)::bigint[], $(scores)::numeric[]) AS v(package_id, score)
JOIN packages p ON p.id = v.package_id
WHERE pu.purl = p.purl`,
WHERE p.id = v.package_id`,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing packages centrality column

High Severity

mergeCentralityScores now runs UPDATE packages SET centrality_score, but the schema only defines centrality_score on packages_universe, not on packages. PageRank merges will fail at the database until packages gains that column (or the update targets the correct table).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1885523. Configure here.

log.info('Starting packages rank pass')
const qx = await getPackagesDb()

const result = await qx.selectOne(`SELECT * FROM rank_packages()`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_packages SQL missing

High Severity

Worker code now calls rank_packages() (with or without arguments), but the repository migrations only define rank_packages_universe(). Ranking and pnpm run:impact will fail until a migration adds or aliases rank_packages on packages.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1885523. Configure here.

3. For each matching changed name, fetch the full document from `registry.npmjs.com/<package>`.
4. Normalize into `packages`, `versions`, `maintainers`, and `package_maintainers` using the write rules above.
4. Downloads: two Temporal workflows — `backfillDailyDownloads` (per-day rows into `downloads_daily`) and `refreshLast30dDownloads` (rolling 30-day windows into `downloads_last_30d`). Both are self-healing: they detect and fill missing windows on each run rather than assuming continuity. Both currently source packages from a static watch list. Once the deps.dev BQ import is operational, `backfillDailyDownloads` will source from `packages` (Tier 2 critical slice) and `refreshLast30dDownloads` will source from `packages_universe` (full Tier 3 population).
4. Downloads: two Temporal workflows — `backfillDailyDownloads` (per-day rows into `downloads_daily`) and `refreshLast30dDownloads` (rolling 30-day windows into `downloads_last_30d`). Both are self-healing: they detect and fill missing windows on each run rather than assuming continuity. Both source packages from `packages WHERE ecosystem = 'npm' AND is_critical = TRUE`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daily backfill scope mismatch

Medium Severity

The ADR and last-30d queries now limit npm download work to is_critical = TRUE, but getNpmPackagesNeedingDailyBackfill still selects every npm row in packages. After the full universe lives in packages, daily backfill can target millions of non-critical packages contrary to the documented design.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1885523. Configure here.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the oss-packages criticality/ranking architecture to treat packages as the single workspace/source of truth (retiring packages_universe in docs and worker code), and switches the criticality ranking entrypoint from rank_packages_universe() to rank_packages().

Changes:

  • Updates npm last-30d selection/mirroring to operate on packages (scoped to ecosystem = 'npm' AND is_critical = TRUE) and mirror into packages.downloads_last_30d.
  • Replaces the deps.dev ranking activity rankPackagesUniverse with a new rankPackages activity that calls rank_packages().
  • Updates ADR-0001 and various worker strings/comments to consistently reference packages + rank_packages().

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/packages/downloadsLast30d.ts Switches candidate selection from packages_universe to packages (critical-only) and changes the mirror UPDATE target to packages.
services/apps/packages_worker/src/scripts/triggerBootstrap.ts Updates CLI help text to reference rankPackages.
services/apps/packages_worker/src/npm/workflows.ts Updates comment to reference packages.downloads_last_30d.
services/apps/packages_worker/src/npm/schedule.ts Updates schedule docs to reflect critical-only scope + packages mirroring.
services/apps/packages_worker/src/npm/activities.ts Updates comments describing latest-window mirroring target.
services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts Updates comment referencing the ranking activity name.
services/apps/packages_worker/src/deps-dev/activities/rankPackagesUniverse.ts Removes the old universe-based ranking activity.
services/apps/packages_worker/src/deps-dev/activities/rankPackages.ts Adds a new activity calling rank_packages().
services/apps/packages_worker/src/deps-dev/activities/index.ts Re-exports rankPackages instead of rankPackagesUniverse.
services/apps/packages_worker/src/criticality/run-impact.ts Switches the on-demand script from rank_packages_universe(...) to rank_packages(...).
services/apps/packages_worker/src/criticality/queries.ts Changes centrality merge to update packages by id.
services/apps/packages_worker/src/criticality/activities.ts Updates comment to reference merging into packages.
services/apps/packages_worker/src/bin/criticality-worker.ts Updates startup log message to reference rank_packages().
docs/adr/0001-oss-packages-design-decisions.md Updates ADR to remove packages_universe from the flow and describe packages + rank_packages() as the single path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +163 to 166
`UPDATE packages
SET downloads_last_30d = $(count)
WHERE purl = $(purl) AND downloads_last_30d IS DISTINCT FROM $(count)`,
{ count, purl },
Comment on lines +39 to +44
`SELECT p.purl AS purl, p.first_release_at::text AS first_release_at
FROM packages p
LEFT JOIN npm_package_universe_state s ON s.purl = p.purl
WHERE p.ecosystem = 'npm'
AND p.is_critical = TRUE
AND (((hashtext(p.purl) % $(laneCount)) + $(laneCount)) % $(laneCount)) = $(laneIndex)
Comment on lines +69 to +72
`SELECT p.purl AS purl, p.first_release_at::text AS first_release_at,
s.downloads_30d_last_run_at AS last_run_at
FROM packages p
JOIN npm_package_universe_state s ON s.purl = p.purl
Comment on lines +36 to +39
`UPDATE packages p
SET centrality_score = v.score
FROM unnest($(packageIds)::bigint[], $(scores)::numeric[]) AS v(package_id, score)
JOIN packages p ON p.id = v.package_id
WHERE pu.purl = p.purl`,
WHERE p.id = v.package_id`,
Comment on lines +11 to +12
const result = await qx.selectOne(`SELECT * FROM rank_packages()`)

Comment on lines 52 to 54
const [result] = await qx.select(
`SELECT * FROM rank_packages_universe($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`,
`SELECT * FROM rank_packages($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`,
{ wDownloads, wDepPkgs, wTransitive, topN },
Comment on lines 223 to +224
| `packages` | Upsert on `purl`. Each worker only writes columns it owns; ecosystem isolation means column-level conflicts cannot occur in practice. |
| `packages_universe` | Incremental upsert keyed on `purl`. The deps.dev import only touches rows whose underlying deps.dev snapshot date has advanced since the previous import (initial run is a one-time full backfill). |
| `packages` | Incremental upsert keyed on `purl`. The deps.dev import only touches rows whose underlying deps.dev snapshot date has advanced since the previous import (initial run is a one-time full backfill). `rank_packages()` scores and flags in place — no separate ranking workspace. |
Comment on lines +541 to 544
Downloads is the strongest criticality signal. `packages` is the single source of truth — `packages_universe` has been retired. `packages.downloads_last_30d` is the single column used by `rank_packages()`.

Tier 2 (`packages`) downloads are stored in `downloads_daily` — one row per `(package_id, date)`, consumers sum over any window they need. Tier 3 (`packages_universe`) downloads are stored in `downloads_last_30d` — one row per `(purl, end_date)` capturing a rolling 30-day window — with the latest window's count also cached on `packages_universe.downloads_last_30d bigint` for direct use by `rank_packages_universe()` without a join.
Per-day history is stored in `downloads_daily` — one row per `(package_id, date)`, consumers sum over any window they need. The latest 30-day window count is written directly to `packages.downloads_last_30d` for `is_critical = TRUE` packages.

Comment on lines +126 to +130
Weights sum to 1.0 → impact ∈ `[0, 1]`. `dependent_count` is direct dependent packages only; `transitive_dependent_count` is indirect dependents only. All weights are call-time numeric parameters to `rank_packages()` — tunable without schema or code changes.

`centrality_score` (PageRank) is computed and stored on `packages_universe` by the criticality worker and will be added to the formula if needed.
`centrality_score` (PageRank) is computed and stored on `packages` by the criticality worker and will be added to the formula if needed.

**Current weights** (defaults in `rank_packages_universe()`, iterate once the ranked list is observable):
**Current weights** (defaults in `rank_packages()`, iterate once the ranked list is observable):
FROM packages p
LEFT JOIN npm_package_universe_state s ON s.purl = p.purl
WHERE p.ecosystem = 'npm'
AND p.is_critical = TRUE

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this flag for this query in specific? Everything that was previously running against packages_universe should still run against all packages in packages. downloads_last_30d should be ran against the entire universe as we need it for the impact/critical score.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! fixed ✅

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 6 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.


const [result] = await qx.select(
`SELECT * FROM rank_packages_universe($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`,
`SELECT * FROM rank_packages($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing rank_packages database function

High Severity

The worker now invokes rank_packages(), but the oss-packages migrations only define rank_packages_universe() on packages_universe. Running pnpm run:impact or the rankPackages activity fails at query time because the renamed function is not present in the schema.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.

if (mirrorToUniverse) {
const rowCount = await qx.result(
`UPDATE packages_universe
`UPDATE packages

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates nonexistent packages column

High Severity

Latest-window download mirroring now runs UPDATE packages SET downloads_last_30d, yet downloads_last_30d exists on packages_universe, not on packages, in the current migrations. The npm last-30d breadth lane errors when it tries to denormalize the latest count.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.

FROM packages p
JOIN npm_package_universe_state s ON s.purl = p.purl
WHERE p.ecosystem = 'npm'
AND p.is_critical = TRUE

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

History backfill limits critical packages

Medium Severity

Last-30d history selection adds p.is_critical = TRUE, so only critical npm packages get older rolling windows. Prior universe-wide behavior and the PR review ask for full-universe downloads_last_30d coverage needed for impact scoring.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff92e3b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants