Skip to content

feat(build_csr): export the reverse-direction edge list (by_tgt)#6

Open
ak2k wants to merge 2 commits into
Mearman:mainfrom
ak2k:csr-edge-list-by-tgt
Open

feat(build_csr): export the reverse-direction edge list (by_tgt)#6
ak2k wants to merge 2 commits into
Mearman:mainfrom
ak2k:csr-edge-list-by-tgt

Conversation

@ak2k

@ak2k ak2k commented Jun 20, 2026

Copy link
Copy Markdown

Builds on #5, and completes #4. Until #5 merges, the diff here shows both commits; the net change in this PR is the second one — reviewable in isolation: ak2k/OpenAlex@csr-edge-list-by-src...csr-edge-list-by-tgt

Why

A by_src edge list (sorted by src) prunes "what X cites" (WHERE src = X) but not "who cites X" (WHERE tgt = X) — a tgt predicate has to scan every row group of a src-sorted file. "Who cites X" is the more common citation query (forward citations, supersession, discovery), so this adds a tgt-sorted copy.

What

Generalizes build_edge_list with a direction ('src' | 'tgt'):

csr/work_referenced_works__by_src.parquet   # sorted (src, tgt) → WHERE src = X
csr/work_referenced_works__by_tgt.parquet   # sorted (tgt, src) → WHERE tgt = X

--edge-list now emits both; build_all_edge_lists(directions=("src",)) keeps only the forward copy.

Measured (60M-edge sample)

Same edge set, two sort orders, so both directions stay O(ms): a who-cites-X lookup is ~2 ms straight off by_tgt (verified globally sorted by (tgt, src), tight row-group stats).

The trade-off (the open question from #4)

The reverse copy roughly doubles the graph's storage (~14 GB → ~28 GB at full work_referenced_works scale) and is a second dedup+sort pass. If you'd rather ship only the forward direction, directions=("src",) opts out — happy to make that the default if you prefer. Closing #4 with both implemented so the choice is a one-liner either way.

Tests cover the by_tgt sort order, WHERE tgt = X queryability, that both directions hold the same edge set, and that --edge-list emits both.

ak2k added 2 commits June 20, 2026 16:01
The CSR .npz is built for in-RAM linear algebra: scipy.sparse.load_npz
pulls the whole matrix into memory and is Python-specific. For querying the
graph in DuckDB/Arrow — the rest of the dataset's format — there was no
out-of-core option.

Add build_edge_list / --edge-list, which exports each relationship as a
deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs
(so it joins the main and relationship tables directly, no dense-index/id-map
hop). Bounded row groups + the sort let DuckDB zonemaps prune WHERE src = X
('what X cites') to a few row groups.

On a 60M-edge sample: 288 MB (zstd-3), and an out-neighbours lookup is ~2 ms
straight off the Parquet with no full load. Idempotent (provenance
fingerprint) and atomic (temp + os.replace), matching build_csr; temp_directory
is set so the DISTINCT/ORDER BY can spill under a tight memory_limit.

The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy;
that is a separate change. Refs Mearman#4.
A by_src edge list (sorted by src) prunes 'what X cites' (WHERE src = X) but
not 'who cites X' (WHERE tgt = X) — a tgt predicate scans every row group of
a src-sorted file. who-cites-X is the more common citation query (forward
citations, supersession, discovery), so add a tgt-sorted copy.

Generalize build_edge_list with a direction ('src'|'tgt'): by_src sorts
(src, tgt), by_tgt sorts (tgt, src), each emitted as <rel>__by_<dir>.parquet.
build_all_edge_lists / --edge-list now emit both; pass directions=('src',)
to keep only the forward copy.

Same edge set, just two sort orders, so both directions stay O(ms): on a
60M-edge sample, a who-cites-X lookup is ~2 ms straight off by_tgt. The cost
is storage — a second sorted copy roughly doubles the graph's footprint
(~14 GB -> ~28 GB at full work_referenced_works scale), and it is a separate
dedup+sort pass. directions=('src',) opts out if that trade-off isn't wanted.

Closes Mearman#4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant