Skip to content

perf(build_csr): preallocate the CSR indices array#2

Open
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-preallocate-indices
Open

perf(build_csr): preallocate the CSR indices array#2
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:csr-preallocate-indices

Conversation

@ak2k

@ak2k ak2k commented Jun 20, 2026

Copy link
Copy Markdown

_build_csr_duckdb collected each batch's remapped target indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds both the per-batch list and the joined result at once — transiently doubling that 12 GB at the worst possible moment.

n_edges is known exactly before the loop, so this preallocates indices = np.empty(n_edges, np.int32) and fills it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once.

Output is unchanged

Edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Isolating just the array assembly at 500M edges, peak RSS drops from 4.09 GB to 2.03 GB — the saving is ~4 bytes/edge, scaling to ~12 GB at work_referenced_works.

Tests

Adds tests/test_build_csr.py — the module had no coverage. It builds CSR matrices and checks them against an independently computed reference (null handling, duplicate collapse, dense remap), the empty-relationship case, and byte-identical output across repeated runs (the module's documented determinism invariant). build_csr's deps (numpy/scipy/duckdb) are optional relative to the core sync pipeline, so the file skips where they are absent.

Step 4 collected each batch's remapped tgt indices in a Python list and
np.concatenate'd them after the loop. For work_referenced_works (~3B
deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds
the per-batch list and the joined array at once — transiently doubling that
12 GB at the worst moment.

n_edges is known exactly before the loop, so preallocate indices =
np.empty(n_edges, int32) and fill it slice-by-slice via a running offset.
An assert pins that every deduplicated edge is placed exactly once. Output
is unchanged: edges stream in (src, tgt) order, so writing them in batch
order yields the same array np.concatenate produced.

Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient
is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it
is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB
at work_referenced_works.

Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR
against an independently computed reference (null handling, duplicate
collapse, dense remap) at both a small fixture and a 2000-node graph,
cross-shard deduplication, the empty-relationship case, idempotent skip on
unchanged inputs, and byte-identical output across runs.
@ak2k ak2k force-pushed the csr-preallocate-indices branch from 5b192ca to b44dbcc Compare June 20, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant