Skip to content

fix(file source): handle concatenated gzip streams#25614

Merged
thomasqueirozb merged 10 commits into
masterfrom
fix/file-source-gzip-multi-stream
Jun 17, 2026
Merged

fix(file source): handle concatenated gzip streams#25614
thomasqueirozb merged 10 commits into
masterfrom
fix/file-source-gzip-multi-stream

Conversation

@thomasqueirozb

@thomasqueirozb thomasqueirozb commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

The async file source migration (#23612, v0.50.0) replaced flate2::bufread::MultiGzDecoder with async_compression::tokio::bufread::GzipDecoder. MultiGzDecoder handles concatenated gzip streams by design; GzipDecoder stops after the first member unless .multiple_members(true) is called, which was never done. This caused the file source and fingerprinter to silently drop all but the first gzip stream in multi-member files.

The fix introduces gzip_multiple_decoder in vector-common::compression — a thin wrapper that constructs a GzipDecoder with multiple_members enabled — and replaces all bare GzipDecoder::new call sites (file watcher, fingerprinter, aws_s3 source). GzipDecoder::new is now a denied method in clippy.toml to prevent recurrence.

Vector configuration

data_dir: /tmp/vector-test

sources:
  files:
    type: file
    include:
      - /tmp/vector-test/*.gz
    fingerprint:
      strategy: checksum
    read_from: beginning

sinks:
  out:
    type: console
    inputs: [files]
    encoding:
      codec: text

How did you test this PR?

Create a multi-member gzip file and a standard single-member gzip file:

mkdir -p /tmp/vector-test

# multi-stream: two separate gzip members concatenated
echo "multiple_1hello" | gzip -c >  /tmp/vector-test/multiple-stream.gz
echo "multiple_2world" | gzip -c >> /tmp/vector-test/multiple-stream.gz

# single-stream: two lines in one gzip member
printf "single_1hello\nsingle_2world\n" | gzip -c > /tmp/vector-test/single-stream.gz

Run vector with the config above. Expected output (order may vary):

multiple_1hello
multiple_2world
single_1hello
single_2world

Before this fix, multiple_2world was silently dropped because GzipDecoder stopped after the first member. To stress the path further, a third member was appended and all three were read correctly.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@github-actions github-actions Bot added the domain: sources Anything related to the Vector's sources label Jun 12, 2026
@thomasqueirozb thomasqueirozb marked this pull request as ready for review June 15, 2026 14:16
@thomasqueirozb thomasqueirozb requested a review from a team as a code owner June 15, 2026 14:16
@thomasqueirozb thomasqueirozb added the source: file Anything `file` source related label Jun 17, 2026
@thomasqueirozb thomasqueirozb enabled auto-merge June 17, 2026 20:21
@thomasqueirozb thomasqueirozb added this pull request to the merge queue Jun 17, 2026
Merged via the queue into master with commit 5d41252 Jun 17, 2026
79 checks passed
@thomasqueirozb thomasqueirozb deleted the fix/file-source-gzip-multi-stream branch June 17, 2026 21:10
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

domain: sources Anything related to the Vector's sources source: file Anything `file` source related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File source no longer can decompress concatenated gzip streams

2 participants