Skip to content

repository: add PackWriter and two-phase chunk index update#9723

Merged
ThomasWaldmann merged 2 commits into
borgbackup:masterfrom
mr-raj12:pack-files-step5-packwriter
Jun 9, 2026
Merged

repository: add PackWriter and two-phase chunk index update#9723
ThomasWaldmann merged 2 commits into
borgbackup:masterfrom
mr-raj12:pack-files-step5-packwriter

Conversation

@mr-raj12

@mr-raj12 mr-raj12 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description

PackWriter buffers (chunk_id, cdata) pairs and flushes them as a pack file via borgstore once max_count chunks accumulate. At N=1 (max_count=1), pack_id == chunk_id and pack files land at packs/{chunk_id_hex}. No changes needed to get() or delete(). UNKNOWN_INT32 = 0xFFFFFFFF is the sentinel for pack location fields that are not yet written. 0xFFFFFFFF is above MAX_DATA_SIZE (~20 MB), so it can never collide with a real obj_offset. chunks.add() writes the placeholder; update_pack_info() fills in real values after flush().

flush() clears _pieces in a try/finally. Without this, a store failure would leave the chunk in the buffer and it would get re-bundled with the next chunk, pushing the N>1 code path and writing under a hash-derived key instead of the chunk's own id.

Changes:

  • repository.py: add PackWriter; put() delegates to
    _pack_writer.add() and returns pack results.
  • constants.py: add UNKNOWN_INT32 = 0xFFFFFFFF.
  • hashindex.pyx: add() uses UNKNOWN_INT32 placeholders; new
    update_pack_info().
  • hashindex.pyi: type stub for update_pack_info().
  • cache.py: add_chunk() calls update_pack_info() from pack results.
  • archive.py: add_reference() in rebuild_archives() does the same.

refs #8572

Checklist

  • PR is against master
  • New code has tests and docs where appropriate
  • Tests pass
  • Commit messages are clean and reference related issues

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.65217% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.78%. Comparing base (a1e8e53) to head (0395cc1).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
src/borg/repository.py 94.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9723      +/-   ##
==========================================
+ Coverage   84.72%   84.78%   +0.05%     
==========================================
  Files          92       92              
  Lines       15007    15047      +40     
  Branches     2243     2250       +7     
==========================================
+ Hits        12715    12757      +42     
+ Misses       1592     1590       -2     
  Partials      700      700              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from 182e92a to 049fba2 Compare June 5, 2026 20:17

@ThomasWaldmann ThomasWaldmann left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Some small optimizations could be done.

Later, when introducing a size limit, it will get a bit more complicated, if we want to absolutely obey that limit and not possibly exceed it by a maximum chunksize in the worst case.

Update: thinking about it, we could also just accept that it is no strict limit. Simpler code. E.g. when setting 50MB as limit, it could be also 70MB.

Comment thread src/borg/hashindex.pyx Outdated
Comment thread src/borg/repository.py Outdated
Comment thread src/borg/repository.py Outdated
@ThomasWaldmann

Copy link
Copy Markdown
Member

range-load using the pack values from index within this PR or in next one?

Comment thread src/borg/archive.py Outdated
Comment thread src/borg/hashindex.pyx Outdated
@mr-raj12

mr-raj12 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

range-load using the pack values from index within this PR or in next one?

in the next one

@ThomasWaldmann

ThomasWaldmann commented Jun 6, 2026

Copy link
Copy Markdown
Member

OK, so please finish this one.

For the next one:

Guess that will be a small and simple PR, just use the obj_offset and obj_size from the index to do a range-load. As that is always 0 and filesize right now, nothing should break.

Idea for the next one after that:

As you now update the index after you can know the sha256 pack_id, add a env var (SHA256_PACK_ID=1 or so) to disable the pack_id == chunk_id hack and use the real sha256 pack_id from the index. Still stay at max_count=1.

Likely, that will show a lot of problems in the existing code, pointing to all the places that now need to use the index to get the pack_id / work based on packs. Make a priority list of what needs fixing.

The CI could get 1 informative but otherwise ignored job that runs the tests with that env var, so we can see how less and less tests fail while you fix more and more stuff.

@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from 8616791 to f60a3d1 Compare June 7, 2026 22:23
Comment thread src/borg/repository.py
@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from f60a3d1 to 60ca680 Compare June 7, 2026 23:25
Comment thread src/borg/repository.py Outdated
@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from 60ca680 to 0a9160c Compare June 8, 2026 05:12

@ThomasWaldmann ThomasWaldmann left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Guess the real test will come when actually using the indexed pack-related values.

@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from 0a9160c to 2c60858 Compare June 9, 2026 02:24
@mr-raj12 mr-raj12 requested a review from ThomasWaldmann June 9, 2026 05:37
Comment thread src/borg/repository.py Outdated
mr-raj12 added 2 commits June 9, 2026 14:49
…gbackup#8572

PackWriter buffers (chunk_id, cdata) pairs and flushes as pack files via borgstore.
At N=1 pack_id == chunk_id; UNKNOWN_INT32 (0xFFFFFFFF) placeholders in the index
are replaced by real pack location fields after flush() via update_pack_info().
Update test_chunkindex_add to expect UNKNOWN_INT32 sentinels from add().
…rom PackWriter, refs borgbackup#8572

Fix PackWriter.flush() to use max_count == 1 (not len == 1) for the pack_id hack,
so final partial packs under max_count > 1 correctly use SHA256. Add covering test.
Move sha256 import to module level in repository_test.
@mr-raj12 mr-raj12 force-pushed the pack-files-step5-packwriter branch from 2c60858 to 0395cc1 Compare June 9, 2026 09:20
@ThomasWaldmann ThomasWaldmann merged commit 7cdaebf into borgbackup:master Jun 9, 2026
28 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants