repository: add PackWriter and two-phase chunk index update#9723
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9723 +/- ##
==========================================
+ Coverage 84.72% 84.78% +0.05%
==========================================
Files 92 92
Lines 15007 15047 +40
Branches 2243 2250 +7
==========================================
+ Hits 12715 12757 +42
+ Misses 1592 1590 -2
Partials 700 700 ☔ View full report in Codecov by Harness. |
182e92a to
049fba2
Compare
There was a problem hiding this comment.
Looks good overall. Some small optimizations could be done.
Later, when introducing a size limit, it will get a bit more complicated, if we want to absolutely obey that limit and not possibly exceed it by a maximum chunksize in the worst case.
Update: thinking about it, we could also just accept that it is no strict limit. Simpler code. E.g. when setting 50MB as limit, it could be also 70MB.
|
range-load using the pack values from index within this PR or in next one? |
in the next one |
|
OK, so please finish this one. For the next one: Guess that will be a small and simple PR, just use the obj_offset and obj_size from the index to do a range-load. As that is always 0 and filesize right now, nothing should break. Idea for the next one after that: As you now update the index after you can know the sha256 pack_id, add a env var (SHA256_PACK_ID=1 or so) to disable the pack_id == chunk_id hack and use the real sha256 pack_id from the index. Still stay at max_count=1. Likely, that will show a lot of problems in the existing code, pointing to all the places that now need to use the index to get the pack_id / work based on packs. Make a priority list of what needs fixing. The CI could get 1 informative but otherwise ignored job that runs the tests with that env var, so we can see how less and less tests fail while you fix more and more stuff. |
8616791 to
f60a3d1
Compare
f60a3d1 to
60ca680
Compare
60ca680 to
0a9160c
Compare
ThomasWaldmann
left a comment
There was a problem hiding this comment.
LGTM.
Guess the real test will come when actually using the indexed pack-related values.
0a9160c to
2c60858
Compare
…gbackup#8572 PackWriter buffers (chunk_id, cdata) pairs and flushes as pack files via borgstore. At N=1 pack_id == chunk_id; UNKNOWN_INT32 (0xFFFFFFFF) placeholders in the index are replaced by real pack location fields after flush() via update_pack_info(). Update test_chunkindex_add to expect UNKNOWN_INT32 sentinels from add().
…rom PackWriter, refs borgbackup#8572 Fix PackWriter.flush() to use max_count == 1 (not len == 1) for the pack_id hack, so final partial packs under max_count > 1 correctly use SHA256. Add covering test. Move sha256 import to module level in repository_test.
2c60858 to
0395cc1
Compare
Description
PackWriterbuffers(chunk_id, cdata)pairs and flushes them as a pack file via borgstore oncemax_countchunks accumulate. At N=1 (max_count=1),pack_id == chunk_idand pack files land atpacks/{chunk_id_hex}. No changes needed toget()ordelete().UNKNOWN_INT32 = 0xFFFFFFFFis the sentinel for pack location fields that are not yet written. 0xFFFFFFFF is above MAX_DATA_SIZE (~20 MB), so it can never collide with a realobj_offset.chunks.add()writes the placeholder;update_pack_info()fills in real values afterflush().flush()clears_piecesin atry/finally. Without this, a store failure would leave the chunk in the buffer and it would get re-bundled with the next chunk, pushing the N>1 code path and writing under a hash-derived key instead of the chunk's own id.Changes:
repository.py: addPackWriter;put()delegates to_pack_writer.add()and returns pack results.constants.py: addUNKNOWN_INT32 = 0xFFFFFFFF.hashindex.pyx:add()usesUNKNOWN_INT32placeholders; newupdate_pack_info().hashindex.pyi: type stub forupdate_pack_info().cache.py:add_chunk()callsupdate_pack_info()from pack results.archive.py:add_reference()inrebuild_archives()does the same.refs #8572
Checklist
master