Skip to content

Switch rust-htslib to noodles#118

Draft
ewels wants to merge 4 commits into
mainfrom
noodles-migration-redo
Draft

Switch rust-htslib to noodles#118
ewels wants to merge 4 commits into
mainfrom
noodles-migration-redo

Conversation

@ewels

@ewels ewels commented Jun 17, 2026

Copy link
Copy Markdown
Member

Re-submits the BAM I/O migration from rust-htslib to a pure-Rust noodles backend, together with the follow-up review and performance fixes.

Background

The original migration was merged prematurely as #115 and has now been removed from `main` (`main` is back on rust-htslib). This PR brings the work back as a single, complete unit so it can be reviewed and merged properly. It supersedes #117.

Contents

  • Migrate BAM I/O from rust-htslib to noodles (originally Migrate BAM I/O from rust-htslib to noodles #115) — noodles-backed BAM/SAM/CRAM readers behind a compatibility shim at `src/rna/bam/`, preserving the existing Reader/IndexedReader/Record API and samtools-identical CHK checksums. No cmake/htslib system deps required.
  • fix(bam): harden noodles compatibility shim from Migrate BAM I/O from rust-htslib to noodles #115 review — propagate CIGAR decode errors, complete the IUPAC 4-bit sequence packing table, and build fetch regions directly from reference name + interval.
  • perf(bam): stream indexed reads instead of buffering whole references — lazy `BamCursor` so peak memory is O(1) per reader instead of reference-size × thread-count.
  • perf(bam): stream CRAM reads via self-referential cursors — CRAM now streams ~one container at a time rather than buffering the whole file/reference.

Supersedes #117.

ewels and others added 4 commits June 17, 2026 10:39
* Migrate BAM I/O from rust-htslib to noodles (#113)

Replace rust-htslib with a pure-Rust noodles backend behind a
compatibility shim at src/rna/bam/. The shim preserves the existing
Reader/IndexedReader/Record API used across dupRadar, RSeQC, preseq,
Qualimap, and samtools-compatible outputs.

Key changes:
- Add noodles-backed BAM/SAM/CRAM readers with indexed fetch support
- Preserve samtools-identical CHK checksums via packed sequence bytes
- Update build docs: no cmake/htslib system deps required
- All integration and unit tests pass in release mode

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* Update docs, Dockerfile, and CI for noodles migration

Replace rust-htslib build prerequisites with noodles (pure Rust) across
the website docs, Dockerfile, GitHub Actions workflows, and CHANGELOG.

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* Fix CI: rustfmt and bump MSRV to 1.89 for noodles

- Remove extra blank line in main.rs (cargo fmt --check)
- Bump rust-version to 1.89 (required by noodles 0.111)
- Update MSRV CI job to match

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Follow-up to the rust-htslib → noodles migration (#115). Three fixes
surfaced during code review:

- align_record: propagate CIGAR decode errors in `Record::from_bam`
  instead of silently falling back to an empty CIGAR via
  `unwrap_or_default()`. A swallowed decode error would have produced
  wrong coverage/indel statistics with no signal.

- align_record: complete the ASCII→4-bit sequence packing table to the
  full htslib `seq_nt16_table` (all 16 IUPAC codes), so SAM/CRAM records
  containing ambiguity codes round-trip faithfully and produce
  samtools-identical CHK checksums. Previously every non-ACGT base was
  collapsed to N.

- io: build the fetch `Region` directly from the reference name and a
  full interval instead of formatting and re-parsing `name:start-end`.
  The old approach mangled reference names containing `:` or `-`
  (e.g. HLA contigs) and lossily decoded non-UTF-8 names.

Adds unit tests covering IUPAC packing round-trips, lowercase handling,
and odd-length trailing-nibble layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indexed BAM reader buffered every record of a fetched reference (and
all unmapped reads) into a `Vec` before yielding any. dupRadar fetches
one chromosome per worker thread concurrently, so peak memory scaled as
reference size × thread count — gigabytes for a deep chromosome.

Replace the buffering with a lazy `BamCursor` that mirrors noodles'
`csi::io::Query` chunk state machine: `fetch(tid)` resolves the
reference's index chunks up front (small) and `read()` seeks/decodes one
record at a time straight from the BGZF stream, filtering out the
occasional neighbouring-reference record that shares a chunk boundary
block. `fetch(Unmapped)` seeks to the index's unmapped offset (falling
back to a reopen + first-record scan when the index lacks the metadata
pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader.

CRAM (sequential and indexed) still buffers — noodles' slice-based CRAM
iterators don't expose an equivalent low-level chunk API — and is left
as a documented follow-up.

Output is unchanged: the dupRadar integration tests assert exact counts
and matrices over a 2-reference fixture and drive both the per-tid and
unmapped streaming paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CRAM was the remaining buffered path after the indexed-BAM streaming fix:
the sequential reader collected every record in the file into a `Vec`,
and indexed `fetch` collected the whole queried reference (or all
unmapped reads) before yielding any.

noodles' `Records`/`Query` iterators already stream container-by-container
(~one container of records at a time), but they borrow the reader they
read from, so they can't be stored next to it in a plain struct — which is
why the original code collected eagerly. Unlike BAM, noodles keeps CRAM's
per-container decode internals (`Container`/`Slice` helpers, the reference
repository accessor) `pub(crate)`, so the hand-rolled state machine used
for BAM isn't possible here.

Wrap the reader + its borrowing iterator in `ouroboros` self-referential
structs (`CramReaderCursor`, `CramIndexedCursor`) so the streaming iterator
can live alongside the reader. Sequential reads stream directly; indexed
`fetch(tid)` / `fetch(Unmapped)` rebuild the cursor with a fresh query via
`into_heads`. Peak CRAM memory drops from whole-file to ~one container.

Adds a runtime test that writes a reference-backed CRAM and streams it
back through the sequential reader, since CRAM previously had no fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ewels ewels changed the title Migrate BAM I/O from rust-htslib to noodles (+ follow-up fixes) Switch rust-htslib to noodles Jun 17, 2026
@ewels ewels mentioned this pull request Jun 17, 2026
@ewels ewels marked this pull request as draft June 17, 2026 13:10
@ewels

ewels commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Test / benchmark PRs:

Not checked at all yet, so still needs a lot of validation work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant