feat(build_csr): export a queryable edge-list Parquet (by_src)#5
Open
ak2k wants to merge 1 commit into
Open
Conversation
The CSR .npz is built for in-RAM linear algebra: scipy.sparse.load_npz
pulls the whole matrix into memory and is Python-specific. For querying the
graph in DuckDB/Arrow — the rest of the dataset's format — there was no
out-of-core option.
Add build_edge_list / --edge-list, which exports each relationship as a
deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs
(so it joins the main and relationship tables directly, no dense-index/id-map
hop). Bounded row groups + the sort let DuckDB zonemaps prune WHERE src = X
('what X cites') to a few row groups.
On a 60M-edge sample: 288 MB (zstd-3), and an out-neighbours lookup is ~2 ms
straight off the Parquet with no full load. Idempotent (provenance
fingerprint) and atomic (temp + os.replace), matching build_csr; temp_directory
is set so the DISTINCT/ORDER BY can spill under a tight memory_limit.
The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy;
that is a separate change. Refs Mearman#4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the
by_srchalf of #4.The CSR
.npzis built for in-RAM linear algebra —scipy.sparse.load_npzpulls the whole matrix into memory and is Python/scipy-specific. For querying the graph in DuckDB/Arrow (the rest of the dataset's format) there was no out-of-core option. This adds one.What
build_edge_list/--edge-listexports each relationship as a deduplicated edge list sorted by(src, tgt), in the original OpenAlex IDs:mainand relationship tables directly, no dense-index / id-map hop.WHERE src = X("what X cites") to a few row groups.os.replace), and setstemp_directoryso theDISTINCT/ORDER BYcan spill under a tight--memory-limit— all matchingbuild_csr.python3 -m sync.build_csr --all --edge-list duckdb -c "SELECT tgt FROM 'csr/work_referenced_works__by_src.parquet' WHERE src = 2741809807"Measured (60M-edge / 3M-node sample)
(src, tgt))(src, tgt); row-group min/max stats are tight, which is what makes the pruning workScope
.npzpath and its consumers are untouched;--edge-listis opt-in and builds the edge lists instead of CSR in that invocation, so it never forces a CSR rebuild._build_csr_duckdb.WHERE tgt = X) wants atgt-sorted copy; that's a small follow-up PR that completes Add a DuckDB-queryable edge-list Parquet for the citation graph #4 (and carries its own storage trade-off, per the issue).Tests cover dedup, null-dropping, the sort order, cross-shard dedup, original-ID preservation, DuckDB queryability, and the idempotent skip.