Skip to content

perf(build_csr): push the dense id remap into DuckDB#3

Open
ak2k wants to merge 2 commits into
Mearman:mainfrom
ak2k:csr-duckdb-remap
Open

perf(build_csr): push the dense id remap into DuckDB#3
ak2k wants to merge 2 commits into
Mearman:mainfrom
ak2k:csr-duckdb-remap

Conversation

@ak2k

@ak2k ak2k commented Jun 20, 2026

Copy link
Copy Markdown

Builds on #2. This branch is #2 (preallocate the indices array) plus one further commit. Until #2 merges, the diff below shows both commits; the net change in this PR is just the second one — reviewable in isolation here: ak2k/OpenAlex@csr-preallocate-indices...csr-duckdb-remap . Easiest to merge #2 first, after which this narrows to the single commit automatically.

What

_build_csr_duckdb read the deduplicated original-ID edges back into Python and remapped each batch to dense indices with two np.searchsorted calls against the sorted id array. On work_referenced_works that array is multi-GB, so every lookup misses cache — the binary search, not the DuckDB dedup/sort, dominated the build.

This moves the remap into DuckDB: a dense id -> idx dimension table (row_number() OVER (ORDER BY id)), joined against the deduplicated edges twice, emitting already-dense, already-sorted (src_idx, tgt_idx) pairs. Python then streams that straight into the preallocated CSR arrays — no binary search, no per-batch index transients.

Result (3M-node / 60M-edge graph)

  • End-to-end build_csr CLI: 37.5 s → 10.0 s (3.7×); the remap step alone ~13×.
  • Byte-identical output: .npz and .id_map.npy bit-for-bit identical to the searchsorted path. idx is monotone in id, so ORDER BY (src_idx, tgt_idx) equals the previous ORDER BY (src, tgt) and the dense mapping is unchanged. The build_csr tests (independent reference + determinism) pass.

Query plan

EXPLAIN ANALYZE confirms the shape: projection + filter pushdown into the parquet scans, hash joins with the 3M-row dim table as the (small) build side, a single deterministic sort, and the same three parquet scans as before. No nested loops or cross products.

Trade-off

The join and its spill now live inside DuckDB (bounded by memory_limit) instead of materialising the searchsorted transients on the Python heap. Under a tight memory_limit DuckDB spills more to disk; in exchange the Python side drops both the binary search and the per-batch transient arrays.

ak2k added 2 commits June 20, 2026 15:33
Step 4 collected each batch's remapped tgt indices in a Python list and
np.concatenate'd them after the loop. For work_referenced_works (~3B
deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds
the per-batch list and the joined array at once — transiently doubling that
12 GB at the worst moment.

n_edges is known exactly before the loop, so preallocate indices =
np.empty(n_edges, int32) and fill it slice-by-slice via a running offset.
An assert pins that every deduplicated edge is placed exactly once. Output
is unchanged: edges stream in (src, tgt) order, so writing them in batch
order yields the same array np.concatenate produced.

Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient
is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it
is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB
at work_referenced_works.

Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR
against an independently computed reference (null handling, duplicate
collapse, dense remap) at both a small fixture and a 2000-node graph,
cross-shard deduplication, the empty-relationship case, idempotent skip on
unchanged inputs, and byte-identical output across runs.
Step 4 read deduplicated original-ID edges back into Python and remapped
each batch to dense indices with two np.searchsorted calls against the
sorted id array. On work_referenced_works that array is multi-GB, so every
lookup misses cache — the binary search, not the DuckDB dedup/sort,
dominated the build.

Move the remap into DuckDB: build a dense id -> idx dimension table with
row_number() OVER (ORDER BY id), join the deduplicated edges against it
twice, and emit the already-dense, already-sorted (src_idx, tgt_idx) pairs.
Python then streams that file straight into the preallocated CSR arrays —
no binary search, no per-batch index transients.

Output is byte-identical: idx is monotone in id, so ORDER BY (src_idx,
tgt_idx) equals the previous ORDER BY (src, tgt) and the dense mapping is
unchanged. Verified end-to-end on a 3M-node / 60M-edge graph — the .npz and
.id_map.npy are bit-for-bit identical to the searchsorted path, and the
build_csr tests (independent reference + determinism) pass. On that graph
the full CLI build drops from 37.5s to 10.0s (3.7x); the remap step alone
is ~13x faster.

Also set DuckDB's temp_directory to the output dir so the dimension-table
window, the joins, and the final sort have a spill target — an in-memory
connection won't otherwise spill, which would risk OOM under a tight
memory_limit now that this work runs inside DuckDB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant