Switch rust-htslib to noodles#118
Draft
ewels wants to merge 4 commits into
Draft
Conversation
* Migrate BAM I/O from rust-htslib to noodles (#113) Replace rust-htslib with a pure-Rust noodles backend behind a compatibility shim at src/rna/bam/. The shim preserves the existing Reader/IndexedReader/Record API used across dupRadar, RSeQC, preseq, Qualimap, and samtools-compatible outputs. Key changes: - Add noodles-backed BAM/SAM/CRAM readers with indexed fetch support - Preserve samtools-identical CHK checksums via packed sequence bytes - Update build docs: no cmake/htslib system deps required - All integration and unit tests pass in release mode Co-authored-by: Phil Ewels <phil.ewels@seqera.io> * Update docs, Dockerfile, and CI for noodles migration Replace rust-htslib build prerequisites with noodles (pure Rust) across the website docs, Dockerfile, GitHub Actions workflows, and CHANGELOG. Co-authored-by: Phil Ewels <phil.ewels@seqera.io> * Fix CI: rustfmt and bump MSRV to 1.89 for noodles - Remove extra blank line in main.rs (cargo fmt --check) - Bump rust-version to 1.89 (required by noodles 0.111) - Update MSRV CI job to match Co-authored-by: Phil Ewels <phil.ewels@seqera.io> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Follow-up to the rust-htslib → noodles migration (#115). Three fixes surfaced during code review: - align_record: propagate CIGAR decode errors in `Record::from_bam` instead of silently falling back to an empty CIGAR via `unwrap_or_default()`. A swallowed decode error would have produced wrong coverage/indel statistics with no signal. - align_record: complete the ASCII→4-bit sequence packing table to the full htslib `seq_nt16_table` (all 16 IUPAC codes), so SAM/CRAM records containing ambiguity codes round-trip faithfully and produce samtools-identical CHK checksums. Previously every non-ACGT base was collapsed to N. - io: build the fetch `Region` directly from the reference name and a full interval instead of formatting and re-parsing `name:start-end`. The old approach mangled reference names containing `:` or `-` (e.g. HLA contigs) and lossily decoded non-UTF-8 names. Adds unit tests covering IUPAC packing round-trips, lowercase handling, and odd-length trailing-nibble layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indexed BAM reader buffered every record of a fetched reference (and all unmapped reads) into a `Vec` before yielding any. dupRadar fetches one chromosome per worker thread concurrently, so peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. Replace the buffering with a lazy `BamCursor` that mirrors noodles' `csi::io::Query` chunk state machine: `fetch(tid)` resolves the reference's index chunks up front (small) and `read()` seeks/decodes one record at a time straight from the BGZF stream, filtering out the occasional neighbouring-reference record that shares a chunk boundary block. `fetch(Unmapped)` seeks to the index's unmapped offset (falling back to a reopen + first-record scan when the index lacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader. CRAM (sequential and indexed) still buffers — noodles' slice-based CRAM iterators don't expose an equivalent low-level chunk API — and is left as a documented follow-up. Output is unchanged: the dupRadar integration tests assert exact counts and matrices over a 2-reference fixture and drive both the per-tid and unmapped streaming paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CRAM was the remaining buffered path after the indexed-BAM streaming fix: the sequential reader collected every record in the file into a `Vec`, and indexed `fetch` collected the whole queried reference (or all unmapped reads) before yielding any. noodles' `Records`/`Query` iterators already stream container-by-container (~one container of records at a time), but they borrow the reader they read from, so they can't be stored next to it in a plain struct — which is why the original code collected eagerly. Unlike BAM, noodles keeps CRAM's per-container decode internals (`Container`/`Slice` helpers, the reference repository accessor) `pub(crate)`, so the hand-rolled state machine used for BAM isn't possible here. Wrap the reader + its borrowing iterator in `ouroboros` self-referential structs (`CramReaderCursor`, `CramIndexedCursor`) so the streaming iterator can live alongside the reader. Sequential reads stream directly; indexed `fetch(tid)` / `fetch(Unmapped)` rebuild the cursor with a fresh query via `into_heads`. Peak CRAM memory drops from whole-file to ~one container. Adds a runtime test that writes a reference-backed CRAM and streams it back through the sequential reader, since CRAM previously had no fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 17, 2026
Member
Author
|
Test / benchmark PRs:
Not checked at all yet, so still needs a lot of validation work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-submits the BAM I/O migration from rust-htslib to a pure-Rust noodles backend, together with the follow-up review and performance fixes.
Background
The original migration was merged prematurely as #115 and has now been removed from `main` (`main` is back on rust-htslib). This PR brings the work back as a single, complete unit so it can be reviewed and merged properly. It supersedes #117.
Contents
Supersedes #117.