feat(build_csr): export the reverse-direction edge list (by_tgt)#6
Open
ak2k wants to merge 2 commits into
Open
feat(build_csr): export the reverse-direction edge list (by_tgt)#6ak2k wants to merge 2 commits into
ak2k wants to merge 2 commits into
Conversation
The CSR .npz is built for in-RAM linear algebra: scipy.sparse.load_npz
pulls the whole matrix into memory and is Python-specific. For querying the
graph in DuckDB/Arrow — the rest of the dataset's format — there was no
out-of-core option.
Add build_edge_list / --edge-list, which exports each relationship as a
deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs
(so it joins the main and relationship tables directly, no dense-index/id-map
hop). Bounded row groups + the sort let DuckDB zonemaps prune WHERE src = X
('what X cites') to a few row groups.
On a 60M-edge sample: 288 MB (zstd-3), and an out-neighbours lookup is ~2 ms
straight off the Parquet with no full load. Idempotent (provenance
fingerprint) and atomic (temp + os.replace), matching build_csr; temp_directory
is set so the DISTINCT/ORDER BY can spill under a tight memory_limit.
The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy;
that is a separate change. Refs Mearman#4.
A by_src edge list (sorted by src) prunes 'what X cites' (WHERE src = X) but
not 'who cites X' (WHERE tgt = X) — a tgt predicate scans every row group of
a src-sorted file. who-cites-X is the more common citation query (forward
citations, supersession, discovery), so add a tgt-sorted copy.
Generalize build_edge_list with a direction ('src'|'tgt'): by_src sorts
(src, tgt), by_tgt sorts (tgt, src), each emitted as <rel>__by_<dir>.parquet.
build_all_edge_lists / --edge-list now emit both; pass directions=('src',)
to keep only the forward copy.
Same edge set, just two sort orders, so both directions stay O(ms): on a
60M-edge sample, a who-cites-X lookup is ~2 ms straight off by_tgt. The cost
is storage — a second sorted copy roughly doubles the graph's footprint
(~14 GB -> ~28 GB at full work_referenced_works scale), and it is a separate
dedup+sort pass. directions=('src',) opts out if that trade-off isn't wanted.
Closes Mearman#4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #5, and completes #4. Until #5 merges, the diff here shows both commits; the net change in this PR is the second one — reviewable in isolation: ak2k/OpenAlex@csr-edge-list-by-src...csr-edge-list-by-tgt
Why
A
by_srcedge list (sorted bysrc) prunes "what X cites" (WHERE src = X) but not "who cites X" (WHERE tgt = X) — atgtpredicate has to scan every row group of asrc-sorted file. "Who cites X" is the more common citation query (forward citations, supersession, discovery), so this adds atgt-sorted copy.What
Generalizes
build_edge_listwith adirection('src'|'tgt'):--edge-listnow emits both;build_all_edge_lists(directions=("src",))keeps only the forward copy.Measured (60M-edge sample)
Same edge set, two sort orders, so both directions stay O(ms): a who-cites-X lookup is ~2 ms straight off
by_tgt(verified globally sorted by(tgt, src), tight row-group stats).The trade-off (the open question from #4)
The reverse copy roughly doubles the graph's storage (~14 GB → ~28 GB at full
work_referenced_worksscale) and is a second dedup+sort pass. If you'd rather ship only the forward direction,directions=("src",)opts out — happy to make that the default if you prefer. Closing #4 with both implemented so the choice is a one-liner either way.Tests cover the
by_tgtsort order,WHERE tgt = Xqueryability, that both directions hold the same edge set, and that--edge-listemits both.