feat(build_csr): export a queryable edge-list Parquet (by_src) by ak2k · Pull Request #5 · Mearman/OpenAlex

ak2k · 2026-06-20T20:01:45Z

Implements the by_src half of #4.

The CSR .npz is built for in-RAM linear algebra — scipy.sparse.load_npz pulls the whole matrix into memory and is Python/scipy-specific. For querying the graph in DuckDB/Arrow (the rest of the dataset's format) there was no out-of-core option. This adds one.

What

build_edge_list / --edge-list exports each relationship as a deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs:

csr/work_referenced_works__by_src.parquet   # (src, tgt)

Original IDs → joins the main and relationship tables directly, no dense-index / id-map hop.
Sorted with bounded row groups (1M rows) → DuckDB zonemaps prune WHERE src = X ("what X cites") to a few row groups.
zstd-3, idempotent (provenance fingerprint of the input shards), atomic (temp file + os.replace), and sets temp_directory so the DISTINCT/ORDER BY can spill under a tight --memory-limit — all matching build_csr.

python3 -m sync.build_csr --all --edge-list
duckdb -c "SELECT tgt FROM 'csr/work_referenced_works__by_src.parquet' WHERE src = 2741809807"

Measured (60M-edge / 3M-node sample)

artifact: 288 MB (zstd-3, original-ID (src, tgt))
out-neighbours of a node, straight off the Parquet with no full load: ~2 ms
the file is verified globally sorted by (src, tgt); row-group min/max stats are tight, which is what makes the pruning work

Scope

Additive — the .npz path and its consumers are untouched; --edge-list is opt-in and builds the edge lists instead of CSR in that invocation, so it never forces a CSR rebuild.
Independent of the perf PRs (perf(extract): compress shards with zstd instead of snappy #1–perf(build_csr): push the dense id remap into DuckDB #3) — this is a new, self-contained function; it doesn't touch _build_csr_duckdb.
The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy; that's a small follow-up PR that completes Add a DuckDB-queryable edge-list Parquet for the citation graph #4 (and carries its own storage trade-off, per the issue).

Tests cover dedup, null-dropping, the sort order, cross-shard dedup, original-ID preservation, DuckDB queryability, and the idempotent skip.

The CSR .npz is built for in-RAM linear algebra: scipy.sparse.load_npz pulls the whole matrix into memory and is Python-specific. For querying the graph in DuckDB/Arrow — the rest of the dataset's format — there was no out-of-core option. Add build_edge_list / --edge-list, which exports each relationship as a deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs (so it joins the main and relationship tables directly, no dense-index/id-map hop). Bounded row groups + the sort let DuckDB zonemaps prune WHERE src = X ('what X cites') to a few row groups. On a 60M-edge sample: 288 MB (zstd-3), and an out-neighbours lookup is ~2 ms straight off the Parquet with no full load. Idempotent (provenance fingerprint) and atomic (temp + os.replace), matching build_csr; temp_directory is set so the DISTINCT/ORDER BY can spill under a tight memory_limit. The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy; that is a separate change. Refs Mearman#4.

ak2k mentioned this pull request Jun 20, 2026

feat(build_csr): export the reverse-direction edge list (by_tgt) #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(build_csr): export a queryable edge-list Parquet (by_src)#5

feat(build_csr): export a queryable edge-list Parquet (by_src)#5
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-edge-list-by-src

ak2k commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ak2k commented Jun 20, 2026

What

Measured (60M-edge / 3M-node sample)

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant