Skip to content

feat(build_csr): export a queryable edge-list Parquet (by_src)#5

Open
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-edge-list-by-src
Open

feat(build_csr): export a queryable edge-list Parquet (by_src)#5
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-edge-list-by-src

Conversation

@ak2k

@ak2k ak2k commented Jun 20, 2026

Copy link
Copy Markdown

Implements the by_src half of #4.

The CSR .npz is built for in-RAM linear algebra — scipy.sparse.load_npz pulls the whole matrix into memory and is Python/scipy-specific. For querying the graph in DuckDB/Arrow (the rest of the dataset's format) there was no out-of-core option. This adds one.

What

build_edge_list / --edge-list exports each relationship as a deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs:

csr/work_referenced_works__by_src.parquet   # (src, tgt)
  • Original IDs → joins the main and relationship tables directly, no dense-index / id-map hop.
  • Sorted with bounded row groups (1M rows) → DuckDB zonemaps prune WHERE src = X ("what X cites") to a few row groups.
  • zstd-3, idempotent (provenance fingerprint of the input shards), atomic (temp file + os.replace), and sets temp_directory so the DISTINCT/ORDER BY can spill under a tight --memory-limit — all matching build_csr.
python3 -m sync.build_csr --all --edge-list
duckdb -c "SELECT tgt FROM 'csr/work_referenced_works__by_src.parquet' WHERE src = 2741809807"

Measured (60M-edge / 3M-node sample)

  • artifact: 288 MB (zstd-3, original-ID (src, tgt))
  • out-neighbours of a node, straight off the Parquet with no full load: ~2 ms
  • the file is verified globally sorted by (src, tgt); row-group min/max stats are tight, which is what makes the pruning work

Scope

Tests cover dedup, null-dropping, the sort order, cross-shard dedup, original-ID preservation, DuckDB queryability, and the idempotent skip.

The CSR .npz is built for in-RAM linear algebra: scipy.sparse.load_npz
pulls the whole matrix into memory and is Python-specific. For querying the
graph in DuckDB/Arrow — the rest of the dataset's format — there was no
out-of-core option.

Add build_edge_list / --edge-list, which exports each relationship as a
deduplicated edge list sorted by (src, tgt), in the original OpenAlex IDs
(so it joins the main and relationship tables directly, no dense-index/id-map
hop). Bounded row groups + the sort let DuckDB zonemaps prune WHERE src = X
('what X cites') to a few row groups.

On a 60M-edge sample: 288 MB (zstd-3), and an out-neighbours lookup is ~2 ms
straight off the Parquet with no full load. Idempotent (provenance
fingerprint) and atomic (temp + os.replace), matching build_csr; temp_directory
is set so the DISTINCT/ORDER BY can spill under a tight memory_limit.

The reverse direction (who-cites-X, WHERE tgt = X) wants a tgt-sorted copy;
that is a separate change. Refs Mearman#4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant