add rss feed scraping#7
Conversation
|
Warning Review limit reached
More reviews will be available in 8 minutes and 51 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
📝 WalkthroughWalkthroughA new Go RSS scraper program fetches Cornell's music feed, parses XML items into time-slot grouped structures (Morning, Afternoon, Evening), and persists sessions, songs, and session-song relationships to the database. The system extracts song metadata (title, artist, source) from HTML-formatted descriptions and includes kudo/rating infrastructure. ChangesRSS Feed Scraper with Sessions and Kudos
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
models/rss_feed.go (2)
104-138: No test coverage.The PR description notes testing was done by running
mainand eyeballing output. SinceparseSongandparseDescriptionare pure functions operating on strings, they're easy to unit-test with a small fixture of the feed'sdescriptionXML. This will lock down the(from "...")," / ", and<br>handling against future RSS format drift.Want me to draft
rss_feed_test.gowith table-driven tests forparseSongandparseDescription?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/rss_feed.go` around lines 104 - 138, The PR lacks unit tests for the pure string-parsing functions; add a new table-driven test file (rss_feed_test.go) that exercises parseSong and parseDescription with representative description HTML snippets from the feed (covering "(from \"...\")" source parsing, " / " artist separators, <br> line breaks, and edge cases like missing fields), asserting expected Song and TimeSlot outputs; name tests clearly, include fixtures for multiple item descriptions, and use subtests to make failures actionable so future RSS format drift is caught by CI.
72-102: Shadowedcurrentand fragile<br>splitting.Two readability/robustness concerns:
- The outer
var current *TimeSlot(line 74) is immediately shadowed by the loop variablecurrent(line 77) used as a string. It happens to work because the pointer is reassigned on line 96 before any read, but reusing the name makes this code confusing and brittle to future edits. Rename one of them (e.g.,chunkfor the loop variable).strings.Split(desc, "<br>")won't match<br/>,<br />, or<BR>, which RSS descriptions frequently contain. A single regex split handles all variants.🔧 Suggested fix
-var tagPattern = regexp.MustCompile(`<[^>]+>`) +var tagPattern = regexp.MustCompile(`<[^>]+>`) +var brPattern = regexp.MustCompile(`(?i)<br\s*/?>`) @@ - var lines []string - for _, current := range strings.Split(desc, "<br>") { - for _, line := range strings.Split(current, "\n") { - lines = append(lines, line) - } - } + var lines []string + for _, chunk := range brPattern.Split(desc, -1) { + lines = append(lines, strings.Split(chunk, "\n")...) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/rss_feed.go` around lines 72 - 102, In parseDescription, avoid shadowing the outer pointer variable current and make <br> splitting robust: replace the loop "for _, current := range strings.Split(desc, \"<br>\")" with a loop over chunks (e.g., "for _, chunk := range ...") and split desc using a case-insensitive regex that matches "<br>", "<br/>", "<br />" variants (e.g., regexp.MustCompile(`(?i)<br\s*/?>`).Split(desc, -1)), then iterate lines by splitting each chunk on "\n"; ensure you update references to use the new chunk variable name and keep assigning the pointer current (TimeSlot pointer) only when creating a header slot.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@models/rss_feed.go`:
- Around line 52-69: parseSong currently only applies originPattern to the
title, so lines like `Title / Artist (from "X")` leave Song.Source empty and the
origin stuck in Song.Artist; update parseSong to check originPattern against
both title and artist: first attempt to extract source from title (using
originPattern.FindStringSubmatch and ReplaceAllString), and if not found, try
the same extraction on artist, removing the matched text from artist when found;
ensure you TrimSpace both title and artist before assigning Song.Title and
Song.Artist and prefer the source found in title over one found in artist.
- Around line 104-118: The current main function uses http.Get (which uses
http.DefaultClient with no timeout), panics on errors, and feeds non-2xx
responses straight to xml.Unmarshal for the RSS type; change this to use an
http.Client with a reasonable Timeout (e.g., 10s), replace http.Get with
client.Do(req), check resp.StatusCode and return a descriptive error for any
non-2xx status before reading the body, read and unmarshal into the RSS struct
but propagate errors (return error) instead of calling panic so callers can
handle retries/logging; update the function signature (e.g., fetchRSS or
NewRSSFromURL) to return (*RSS, error) and reference the main function, http.Get
usage, resp.StatusCode check, xml.Unmarshal, and RSS type when making these
changes.
- Around line 1-10: The file currently declares "package main" which conflicts
with other files in the models directory; change the package declaration to
"package models" and refactor by moving only library code (type definitions and
helper functions) into a new models file (e.g., rss.go) under package models,
and create a separate command entrypoint (cmd/rss_scraper/main.go) with "package
main" that imports models and implements main() to fetch the feed and call the
library functions; ensure exported types/functions keep capitalized names used
by the new main and update any imports accordingly (look for the package
declaration and functions that parse/fetch RSS to split between models and the
cmd main).
---
Nitpick comments:
In `@models/rss_feed.go`:
- Around line 104-138: The PR lacks unit tests for the pure string-parsing
functions; add a new table-driven test file (rss_feed_test.go) that exercises
parseSong and parseDescription with representative description HTML snippets
from the feed (covering "(from \"...\")" source parsing, " / " artist
separators, <br> line breaks, and edge cases like missing fields), asserting
expected Song and TimeSlot outputs; name tests clearly, include fixtures for
multiple item descriptions, and use subtests to make failures actionable so
future RSS format drift is caught by CI.
- Around line 72-102: In parseDescription, avoid shadowing the outer pointer
variable current and make <br> splitting robust: replace the loop "for _,
current := range strings.Split(desc, \"<br>\")" with a loop over chunks (e.g.,
"for _, chunk := range ...") and split desc using a case-insensitive regex that
matches "<br>", "<br/>", "<br />" variants (e.g.,
regexp.MustCompile(`(?i)<br\s*/?>`).Split(desc, -1)), then iterate lines by
splitting each chunk on "\n"; ensure you update references to use the new chunk
variable name and keep assigning the pointer current (TimeSlot pointer) only
when creating a header slot.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
models/rss_feed.go (1)
52-102: ⚡ Quick winAdd table-driven tests for parser edge cases.
Given the format-sensitive parsing (
parseSong/parseDescription), please add tests for delimiter variants (<br>,<br/>,<br />), source-in-artist cases, and blank/noise lines to prevent regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/rss_feed.go` around lines 52 - 102, Add table-driven tests to cover edge cases for the parseSong and parseDescription functions. Create test cases that verify different HTML line break variants (such as <br>, <br/>, and <br /> in the parseDescription function), scenarios where the source pattern appears in the artist portion of the song string in parseSong, and handling of blank or noise lines throughout the parsing flow. Use Go's testing patterns to define test slices with input and expected output pairs, then iterate through them to verify each parser function handles these edge cases correctly and prevent future regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@models/rss_feed.go`:
- Around line 77-81: The description parsing loop in the RSS feed tokenization
only splits on the literal string "br" but RSS feeds commonly contain "br/" and
"br /" variants that should also be treated as line breaks. Before the nested
loop that splits on newlines, add additional handling to normalize all br tag
variants (including "br/", "br /", and "br") into a common delimiter so the
subsequent split operations properly tokenize the description text and prevent
headers and songs from incorrectly merging together.
---
Nitpick comments:
In `@models/rss_feed.go`:
- Around line 52-102: Add table-driven tests to cover edge cases for the
parseSong and parseDescription functions. Create test cases that verify
different HTML line break variants (such as <br>, <br/>, and <br /> in the
parseDescription function), scenarios where the source pattern appears in the
artist portion of the song string in parseSong, and handling of blank or noise
lines throughout the parsing flow. Use Go's testing patterns to define test
slices with input and expected output pairs, then iterate through them to verify
each parser function handles these edge cases correctly and prevent future
regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
- Import models from tran/chimes_sessions branch - Add Song, Session, SessionSong, Kudo, KudoType structs - Implement GetOrCreateSong and GetOrCreateSession helpers - Add RSS scraper to parse and store concert data + make db insertions idempotent
Dates from RSS were being interpreted as UTC instead of Eastern Time, causing them to appear one day early in the database. Use ParseInLocation to correctly parse as America/New_York timezone
…dOrCreateUser error handling - Use middleware.UIDFrom + GetUserByFirebaseUID to correctly resolve the authenticated user in RegisterFCMToken and SendTestNotification instead of c.GetUint(userID) which always returned 0 - Distinguish gorm.ErrRecordNotFound from other DB errors in FindOrCreateUser so real DB failures aren't silently treated as missing users
Replace log.Printf with log.Fatalf for DB and Firebase init errors so the server exits immediately instead of continuing with nil dependencies that cause panics on the first request.
…ting app container
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
models/rss_feed.go (2)
67-71:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle all
<br>variants during description tokenization.On Line 67-Line 71, splitting only on literal
<br>misses<br/>and<br />, which can merge headers and songs into malformed slots.Proposed fix
+var brPattern = regexp.MustCompile(`(?i)<br\s*/?>`) + func ParseDescription(desc string) []TimeSlot { var slots []TimeSlot var current *TimeSlot var lines []string - for _, current := range strings.Split(desc, "<br>") { - for _, line := range strings.Split(current, "\n") { + for _, segment := range brPattern.Split(desc, -1) { + for _, line := range strings.Split(segment, "\n") { lines = append(lines, line) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/rss_feed.go` around lines 67 - 71, The nested loop structure in the description tokenization code that splits on literal `<br>` only handles one variant of HTML break tags and misses `<br/>` and `<br />`, causing headers and songs to be improperly merged. Before splitting the description in the loop starting with `for _, current := range strings.Split(desc, "<br>")`, normalize all HTML break tag variants by using a regex replacement or similar method to convert all occurrences of `<br>`, `<br/>`, and `<br />` (and their spacing variations) to a consistent delimiter like newline characters, then perform the split operations on the normalized string.
43-58:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExtract
Sourcebefore splitting title/artist.On Line 46-Line 55,
(from "...")is only parsed fromtitle, so feeds likeTitle / Artist (from "X")persist incorrectArtistand emptySource.Proposed fix
func parseSong(line string) ParsedSong { song := ParsedSong{} + if match := originPattern.FindStringSubmatch(line); match != nil { + song.Source = match[1] + line = strings.TrimSpace(originPattern.ReplaceAllString(line, "")) + } + title, artist, found := strings.Cut(line, " / ") if found { song.Artist = strings.TrimSpace(artist) } - match := originPattern.FindStringSubmatch(title) - - if match != nil { - song.Source = match[1] - title = strings.TrimSpace(originPattern.ReplaceAllString(title, "")) - } - song.Title = title + song.Title = strings.TrimSpace(title) return song }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/rss_feed.go` around lines 43 - 58, The parseSong function currently performs the title and artist split using strings.Cut before extracting the source pattern, which causes the source to be missed if it appears after the artist portion (e.g., "Title / Artist (from \"X\")"). Move the source pattern extraction logic using originPattern.FindStringSubmatch and originPattern.ReplaceAllString to execute before the strings.Cut operation, so that the source is extracted from the entire input line first, then the cleaned line is split into title and artist components.
🧹 Nitpick comments (1)
cmd/rss_scraper/main.go (1)
43-43: ⚡ Quick winPrefer full HTML entity decoding over manual replacement.
Manual replacement misses common entities (for example
"), which can degrade song/source parsing quality.Suggested refactor
import ( "encoding/xml" + "html" "io" @@ - cleanHTML := strings.NewReplacer("<", "<", ">", ">", "&", "&").Replace(item.Description) + cleanHTML := html.UnescapeString(item.Description)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/rss_scraper/main.go` at line 43, The manual HTML entity decoding using strings.NewReplacer for the cleanHTML variable assignment only handles three entities (lt, gt, amp) and misses common ones like quot, causing parsing issues. Replace the strings.NewReplacer call with the html.UnescapeString function from Go's standard html package, which properly handles all HTML entities. Make sure to import the html package and update the cleanHTML assignment to use html.UnescapeString(item.Description) instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rss_scraper/main.go`:
- Around line 20-29: The HTTP request using http.Get() lacks a deadline and the
response status code is never validated before parsing the body, which means
non-OK responses will be treated as valid RSS payload. Replace the http.Get()
call with http.NewRequestWithContext() using a context with a timeout, then
check that resp.StatusCode is http.StatusOK before proceeding to read and parse
the response body. Additionally, at line 81 where time.LoadLocation() is called,
capture and handle the returned error instead of silently ignoring it, as this
error will cause subsequent time.ParseInLocation() calls to fail without proper
diagnostics.
- Around line 81-82: In the parseConcertDate function, the error returned by
time.LoadLocation is currently being ignored (assigned to underscore), which
means if location loading fails, loc will be nil and time.ParseInLocation will
silently fall back to UTC instead of the intended America/New_York timezone.
Capture the error from time.LoadLocation by assigning it to a named variable
instead of the blank identifier, check if the error is not nil, and return the
error immediately to the caller rather than continuing with the parsing
operation. Since parseConcertDate already has an error return type, this error
propagation will be properly handled by callers.
In `@models/setup.go`:
- Line 42: The Session and Song structs lack unique constraints to prevent
duplicate rows from concurrent scraper runs using read-then-insert patterns. Add
GORM uniqueIndex struct tags to enforce database-level constraints: in the
Session struct, add uniqueIndex tags with a composite constraint name to the Day
and TimeOfDay fields, and in the Song struct, add uniqueIndex tags with a
matching composite constraint name to both the SongName and Artist fields. This
will ensure that the AutoMigrate call in setup.go creates the appropriate unique
constraints on the sessions and songs tables during migration, making
GetOrCreateSession and GetOrCreateSong operations idempotent across concurrent
calls.
- Around line 16-17: In the ConnectDatabase() function, the godotenv.Load() call
ignores its return value, which means any errors from parsing or permission
issues will be silently skipped. Capture the error return value from
godotenv.Load() and explicitly check if it is not nil. If an error occurs, log
it appropriately and return the error to maintain consistency with proper error
handling patterns used elsewhere in the ConnectDatabase() function.
In `@models/songs.go`:
- Around line 3-21: The GetOrCreateSong function has a race condition due to a
non-atomic get-or-create pattern where concurrent scraper runs can create
duplicate songs. To fix this, add a composite unique constraint on the
(song_name, artist) pair to the Song struct using GORM tags (hint: use
uniqueIndex tag), then refactor GetOrCreateSong to use atomic upsert by
replacing the separate First() and Create() calls with a single Create() call
that uses clause.OnConflict with Columns set to the constrained fields and
DoNothing: true, followed by a subsequent First() query to retrieve the actual
song record from the database. Apply the same fix to GetOrCreateSession which
has the identical race condition vulnerability.
---
Duplicate comments:
In `@models/rss_feed.go`:
- Around line 67-71: The nested loop structure in the description tokenization
code that splits on literal `<br>` only handles one variant of HTML break tags
and misses `<br/>` and `<br />`, causing headers and songs to be improperly
merged. Before splitting the description in the loop starting with `for _,
current := range strings.Split(desc, "<br>")`, normalize all HTML break tag
variants by using a regex replacement or similar method to convert all
occurrences of `<br>`, `<br/>`, and `<br />` (and their spacing variations) to a
consistent delimiter like newline characters, then perform the split operations
on the normalized string.
- Around line 43-58: The parseSong function currently performs the title and
artist split using strings.Cut before extracting the source pattern, which
causes the source to be missed if it appears after the artist portion (e.g.,
"Title / Artist (from \"X\")"). Move the source pattern extraction logic using
originPattern.FindStringSubmatch and originPattern.ReplaceAllString to execute
before the strings.Cut operation, so that the source is extracted from the
entire input line first, then the cleaned line is split into title and artist
components.
---
Nitpick comments:
In `@cmd/rss_scraper/main.go`:
- Line 43: The manual HTML entity decoding using strings.NewReplacer for the
cleanHTML variable assignment only handles three entities (lt, gt, amp) and
misses common ones like quot, causing parsing issues. Replace the
strings.NewReplacer call with the html.UnescapeString function from Go's
standard html package, which properly handles all HTML entities. Make sure to
import the html package and update the cleanHTML assignment to use
html.UnescapeString(item.Description) instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8a85acad-53c3-406e-843b-e8c627e8dd7a
📒 Files selected for processing (6)
cmd/rss_scraper/main.gomodels/kudos.gomodels/rss_feed.gomodels/sessions.gomodels/setup.gomodels/songs.go
Check for error before dereferencing response.FailureCount. When SendMulticast fails, response is nil, causing a panic. Return early if there's an error.
Copy /usr/share/zoneinfo from the builder to the final distroless image so the RSS scraper can load timezone data (America/New_York) without errors.
Check LoadLocation error instead of ignoring it. Fail fast if tzdata is unavailable rather than silently falling back to UTC.
Use context with 30s timeout on HTTP request to prevent indefinite hangs. Check response status code before parsing body as XML to catch 404/500 errors instead of trying to parse invalid RSS payloads.
Overview
Added rss_feed.go to scrape recent concert information from Chimes' rss feed (https://apps.chimes.cornell.edu/music/rss.xml)
Changes Made
Test Coverage
Summary by CodeRabbit
.envfile and expanding migrations for sessions, songs, session-song links, and kudos.