Skip to content

Improve scrapers content extraction#19

Open
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction
Open

Improve scrapers content extraction#19
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction

Conversation

@mikesiez

Copy link
Copy Markdown

Implemented scraping with bs4 instead of trifilatura.
Reduced unnecessary elements being scraped, and fixed necessary elements being ignored / improperly parsed such as links & accordions.
Added offline test files in tests/ingestion/fixtures and a script to run assertions at tests/ingestion/offline_scraper.py
Updated dependency list to include bs4 via uv add beautifulsoup4

Linked to issue #6

mikesiez and others added 3 commits June 14, 2026 20:12
@mikesiez mikesiez linked an issue Jun 20, 2026 that may be closed by this pull request
@mikesiez mikesiez requested a review from AJaccP June 20, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve scraper's content extraction

1 participant