perf(build_csr): preallocate the CSR indices array#2
Open
ak2k wants to merge 1 commit into
Open
Conversation
Step 4 collected each batch's remapped tgt indices in a Python list and np.concatenate'd them after the loop. For work_referenced_works (~3B deduplicated edges) the joined int32 array is ~12 GB, and concatenate holds the per-batch list and the joined array at once — transiently doubling that 12 GB at the worst moment. n_edges is known exactly before the loop, so preallocate indices = np.empty(n_edges, int32) and fill it slice-by-slice via a running offset. An assert pins that every deduplicated edge is placed exactly once. Output is unchanged: edges stream in (src, tgt) order, so writing them in batch order yields the same array np.concatenate produced. Measured on a 3M-node / 60M-edge synthetic graph the concatenate transient is ~0.33 GB of peak RSS; isolating just the array assembly at 500M edges it is 4.09 GB -> 2.03 GB, i.e. the saving is ~4 bytes/edge and scales to ~12 GB at work_referenced_works. Adds tests/test_build_csr.py — the module had no coverage. It checks the CSR against an independently computed reference (null handling, duplicate collapse, dense remap) at both a small fixture and a 2000-node graph, cross-shard deduplication, the empty-relationship case, idempotent skip on unchanged inputs, and byte-identical output across runs.
5b192ca to
b44dbcc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_build_csr_duckdbcollected each batch's remapped target indices in a Python list andnp.concatenate'd them after the loop. Forwork_referenced_works(~3B deduplicated edges) the joined int32 array is ~12 GB, andconcatenateholds both the per-batch list and the joined result at once — transiently doubling that 12 GB at the worst possible moment.n_edgesis known exactly before the loop, so this preallocatesindices = np.empty(n_edges, np.int32)and fills it slice-by-slice via a running offset. Anassertpins that every deduplicated edge is placed exactly once.Output is unchanged
Edges stream in
(src, tgt)order, so writing them in batch order yields the same arraynp.concatenateproduced. Isolating just the array assembly at 500M edges, peak RSS drops from 4.09 GB to 2.03 GB — the saving is ~4 bytes/edge, scaling to ~12 GB atwork_referenced_works.Tests
Adds
tests/test_build_csr.py— the module had no coverage. It builds CSR matrices and checks them against an independently computed reference (null handling, duplicate collapse, dense remap), the empty-relationship case, and byte-identical output across repeated runs (the module's documented determinism invariant). build_csr's deps (numpy/scipy/duckdb) are optional relative to the core sync pipeline, so the file skips where they are absent.