perf(build_csr): preallocate the CSR indices array by ak2k · Pull Request #2 · Mearman/OpenAlex

ak2k · 2026-06-20T16:24:09Z

_build_csr_duckdb collected each batch's remapped target indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds both the per-batch list and the joined result at once — transiently doubling that 12 GB at the worst possible moment.

n_edges is known exactly before the loop, so this preallocates indices = np.empty(n_edges, np.int32) and fills it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once.

Output is unchanged

Edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Isolating just the array assembly at 500M edges, peak RSS drops from 4.09 GB to 2.03 GB — the saving is ~4 bytes/edge, scaling to ~12 GB at work_referenced_works.

Tests

Adds tests/test_build_csr.py — the module had no coverage. It builds CSR matrices and checks them against an independently computed reference (null handling, duplicate collapse, dense remap), the empty-relationship case, and byte-identical output across repeated runs (the module's documented determinism invariant). build_csr's deps (numpy/scipy/duckdb) are optional relative to the core sync pipeline, so the file skips where they are absent.

Step 4 collected each batch's remapped tgt indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds the per-batch list and the joined array at once — transiently doubling that 12 GB at the worst moment. n_edges is known exactly before the loop, so preallocate indices = np.empty(n_edges, int32) and fill it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once. Output is unchanged: edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB at work_referenced_works. Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR against an independently computed reference (null handling, duplicate collapse, dense remap) at both a small fixture and a 2000-node graph, cross-shard deduplication, the empty-relationship case, idempotent skip on unchanged inputs, and byte-identical output across runs.

ak2k mentioned this pull request Jun 20, 2026

perf(build_csr): push the dense id remap into DuckDB #3

Open

ak2k force-pushed the csr-preallocate-indices branch from 5b192ca to b44dbcc Compare June 20, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(build_csr): preallocate the CSR indices array#2

perf(build_csr): preallocate the CSR indices array#2
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-preallocate-indices

ak2k commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ak2k commented Jun 20, 2026

Output is unchanged

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant