perf(build_csr): push the dense id remap into DuckDB by ak2k · Pull Request #3 · Mearman/OpenAlex

ak2k · 2026-06-20T19:10:15Z

Builds on #2. This branch is #2 (preallocate the indices array) plus one further commit. Until #2 merges, the diff below shows both commits; the net change in this PR is just the second one — reviewable in isolation here: ak2k/OpenAlex@csr-preallocate-indices...csr-duckdb-remap . Easiest to merge #2 first, after which this narrows to the single commit automatically.

What

_build_csr_duckdb read the deduplicated original-ID edges back into Python and remapped each batch to dense indices with two np.searchsorted calls against the sorted id array. On work_referenced_works that array is multi-GB, so every lookup misses cache — the binary search, not the DuckDB dedup/sort, dominated the build.

This moves the remap into DuckDB: a dense id -> idx dimension table (row_number() OVER (ORDER BY id)), joined against the deduplicated edges twice, emitting already-dense, already-sorted (src_idx, tgt_idx) pairs. Python then streams that straight into the preallocated CSR arrays — no binary search, no per-batch index transients.

Result (3M-node / 60M-edge graph)

End-to-end build_csr CLI: 37.5 s → 10.0 s (3.7×); the remap step alone ~13×.
Byte-identical output: .npz and .id_map.npy bit-for-bit identical to the searchsorted path. idx is monotone in id, so ORDER BY (src_idx, tgt_idx) equals the previous ORDER BY (src, tgt) and the dense mapping is unchanged. The build_csr tests (independent reference + determinism) pass.

Query plan

EXPLAIN ANALYZE confirms the shape: projection + filter pushdown into the parquet scans, hash joins with the 3M-row dim table as the (small) build side, a single deterministic sort, and the same three parquet scans as before. No nested loops or cross products.

Trade-off

The join and its spill now live inside DuckDB (bounded by memory_limit) instead of materialising the searchsorted transients on the Python heap. Under a tight memory_limit DuckDB spills more to disk; in exchange the Python side drops both the binary search and the per-batch transient arrays.

Step 4 collected each batch's remapped tgt indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds the per-batch list and the joined array at once — transiently doubling that 12 GB at the worst moment. n_edges is known exactly before the loop, so preallocate indices = np.empty(n_edges, int32) and fill it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once. Output is unchanged: edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB at work_referenced_works. Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR against an independently computed reference (null handling, duplicate collapse, dense remap) at both a small fixture and a 2000-node graph, cross-shard deduplication, the empty-relationship case, idempotent skip on unchanged inputs, and byte-identical output across runs.

Step 4 read deduplicated original-ID edges back into Python and remapped each batch to dense indices with two np.searchsorted calls against the sorted id array. On work_referenced_works that array is multi-GB, so every lookup misses cache — the binary search, not the DuckDB dedup/sort, dominated the build. Move the remap into DuckDB: build a dense id -> idx dimension table with row_number() OVER (ORDER BY id), join the deduplicated edges against it twice, and emit the already-dense, already-sorted (src_idx, tgt_idx) pairs. Python then streams that file straight into the preallocated CSR arrays — no binary search, no per-batch index transients. Output is byte-identical: idx is monotone in id, so ORDER BY (src_idx, tgt_idx) equals the previous ORDER BY (src, tgt) and the dense mapping is unchanged. Verified end-to-end on a 3M-node / 60M-edge graph — the .npz and .id_map.npy are bit-for-bit identical to the searchsorted path, and the build_csr tests (independent reference + determinism) pass. On that graph the full CLI build drops from 37.5s to 10.0s (3.7x); the remap step alone is ~13x faster. Also set DuckDB's temp_directory to the output dir so the dimension-table window, the joins, and the final sort have a spill target — an in-memory connection won't otherwise spill, which would risk OOM under a tight memory_limit now that this work runs inside DuckDB.

ak2k added 2 commits June 20, 2026 15:33

ak2k force-pushed the csr-duckdb-remap branch from f9e701b to 0d5107b Compare June 20, 2026 19:34

This was referenced Jun 20, 2026

Add a DuckDB-queryable edge-list Parquet for the citation graph #4

Open

feat(build_csr): export a queryable edge-list Parquet (by_src) #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(build_csr): push the dense id remap into DuckDB#3

perf(build_csr): push the dense id remap into DuckDB#3
ak2k wants to merge 2 commits into
Mearman:mainfrom
ak2k:csr-duckdb-remap

ak2k commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ak2k commented Jun 20, 2026

What

Result (3M-node / 60M-edge graph)

Query plan

Trade-off

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant