perf(build_csr): push the dense id remap into DuckDB#3
Open
ak2k wants to merge 2 commits into
Open
Conversation
Step 4 collected each batch's remapped tgt indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds the per-batch list and the joined array at once — transiently doubling that 12 GB at the worst moment. n_edges is known exactly before the loop, so preallocate indices = np.empty(n_edges, int32) and fill it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once. Output is unchanged: edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB at work_referenced_works. Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR against an independently computed reference (null handling, duplicate collapse, dense remap) at both a small fixture and a 2000-node graph, cross-shard deduplication, the empty-relationship case, idempotent skip on unchanged inputs, and byte-identical output across runs.
Step 4 read deduplicated original-ID edges back into Python and remapped each batch to dense indices with two np.searchsorted calls against the sorted id array. On work_referenced_works that array is multi-GB, so every lookup misses cache — the binary search, not the DuckDB dedup/sort, dominated the build. Move the remap into DuckDB: build a dense id -> idx dimension table with row_number() OVER (ORDER BY id), join the deduplicated edges against it twice, and emit the already-dense, already-sorted (src_idx, tgt_idx) pairs. Python then streams that file straight into the preallocated CSR arrays — no binary search, no per-batch index transients. Output is byte-identical: idx is monotone in id, so ORDER BY (src_idx, tgt_idx) equals the previous ORDER BY (src, tgt) and the dense mapping is unchanged. Verified end-to-end on a 3M-node / 60M-edge graph — the .npz and .id_map.npy are bit-for-bit identical to the searchsorted path, and the build_csr tests (independent reference + determinism) pass. On that graph the full CLI build drops from 37.5s to 10.0s (3.7x); the remap step alone is ~13x faster. Also set DuckDB's temp_directory to the output dir so the dimension-table window, the joins, and the final sort have a spill target — an in-memory connection won't otherwise spill, which would risk OOM under a tight memory_limit now that this work runs inside DuckDB.
This was referenced Jun 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #2. This branch is #2 (preallocate the indices array) plus one further commit. Until #2 merges, the diff below shows both commits; the net change in this PR is just the second one — reviewable in isolation here: ak2k/OpenAlex@csr-preallocate-indices...csr-duckdb-remap . Easiest to merge #2 first, after which this narrows to the single commit automatically.
What
_build_csr_duckdbread the deduplicated original-ID edges back into Python and remapped each batch to dense indices with twonp.searchsortedcalls against the sorted id array. Onwork_referenced_worksthat array is multi-GB, so every lookup misses cache — the binary search, not the DuckDB dedup/sort, dominated the build.This moves the remap into DuckDB: a dense
id -> idxdimension table (row_number() OVER (ORDER BY id)), joined against the deduplicated edges twice, emitting already-dense, already-sorted(src_idx, tgt_idx)pairs. Python then streams that straight into the preallocated CSR arrays — no binary search, no per-batch index transients.Result (3M-node / 60M-edge graph)
build_csrCLI: 37.5 s → 10.0 s (3.7×); the remap step alone ~13×..npzand.id_map.npybit-for-bit identical to the searchsorted path. idx is monotone in id, soORDER BY (src_idx, tgt_idx)equals the previousORDER BY (src, tgt)and the dense mapping is unchanged. Thebuild_csrtests (independent reference + determinism) pass.Query plan
EXPLAIN ANALYZEconfirms the shape: projection + filter pushdown into the parquet scans, hash joins with the 3M-row dim table as the (small) build side, a single deterministic sort, and the same three parquet scans as before. No nested loops or cross products.Trade-off
The join and its spill now live inside DuckDB (bounded by
memory_limit) instead of materialising the searchsorted transients on the Python heap. Under a tightmemory_limitDuckDB spills more to disk; in exchange the Python side drops both the binary search and the per-batch transient arrays.