Skip to content

ber3ben/Case-Study-Library-Circulation-Data

Repository files navigation

README

An automated, Python-based data pipeline designed to ingest, clean, and analyze heavily formatted multi-table spreadsheets.

This project transforms raw, layout-shifted library administrative logs into organized, query-ready datasets and compiles them into a clean web-based dashboard.

The Challenge

Municipal and public services datasets often come in formatted Excel Workbooks rather than raw database dumps. These reports are designed for human eyes, not computers.

Key Obstacles Solved:

The "All-in-One" Sheet: Multiple distinct tables (Registration, Circulation, Event Attendance, and Computer Usage) were stacked vertically within a single spreadsheet tab. Layout Shifts: Shifting headers, empty rows, definitions, and floating "Totals" rows made simple programmatic loading impossible. Missing Identifiers (Implicit Hierarchy): County labels were only written once at the top of a group of towns, leaving the remaining rows blank (NaN).

The Goal:

Build a resilient script to slice these stacked tables, repair implicit labels, normalize numbers, and package them into an interactive dashboard.

Pipeline Design & Features

The pipeline is written in Python using a clean, readable sequence of extraction, transformation, visualization, and output compilation.

1. Extract & Align (Label Recovery)

Instead of hardcoding index ranges—which break as soon as next year's data adds a town—the script programmatically identifies boundary flags. It then uses forward-filling (ffill()) to cascade implicit county names down to each municipal row.

2. Segment & Normalize (Transform)

Using the dynamically assigned structural markers, the pipeline splits the master document into four clean, normalized tables:

Registration: Town-by-town service populations, library card numbers, and active users.

Circulation: A multi-year tracking index (2019–2023) separating physical vs. digital checkouts.

Event Attendance: Tracking program hosting counts and community attendance metrics.

Computer Use: Evaluating digital divide metrics like overall sessions and workstation minutes.

During this stage, non-numeric values are programmatically isolated, missing entries are filled with zeros, and column names are standardized.

4. Visual Discovery (Analytics & Dashboard)

Using Plotly Express, the engine automatically constructs four interactive visual models:

  • Ranked Bar Chart: Sorting town branches by total service population.
  • County Comparison: Grouping and summarizing regional populations.
  • Active Card Share: A distribution map of library user market share.
  • Population vs. Active Users: An interactive scatter plot comparing community size to engagement.

The pipeline wraps these figures in a standalone, responsive HTML grid (Library_Data_Dashboard.html), generating a polished executive portal without requiring a web server.

Analytics Highlights (What the Data Revealed)

The Pandemic Pivot

Physical checkouts dropped by -24.4% between 2019 and 2023, while Digital Circulation surged by +87.3% as patrons leaned into ebooks and remote access.

Event Recovery

Community events proved highly resilient, achieving a 114.9% recovery index score relative to pre-pandemic baselines. Workstation Gap: Workstation sessions remained at 51.2% of 2019 levels, suggesting a lingering shift in public computer usage patterns.

Repository Structure

├── data/

│ ├── Library-County-Details-2023.xlsx # Raw multi-table Excel source

│ └── Clean_Library_Registration_2023.csv # Pipeline registration export

├── outputs/

│ └── Library_Data_Dashboard.html # Generated standalone dashboard

├── library_registration_data_part5.ipynb # Main Jupyter Notebook pipeline

├── requirements.txt # Libraries (pandas, openpyxl, plotly)

└── README.md # Project documentation

Running the Project

1. Installation

Clone the repository and install the dependencies: git clone https://github.com/yourusername/library-etl-dashboard.git cd library-etl-dashboard pip install -r requirements.txt

2. Execution

Open the Jupyter Notebook in your preferred environment (VS Code, JupyterLab, or Google Colab) and run all cells.

The notebook will instantly: Load and process Library-County-Details-2023.xlsx. Generate the cleaned .csv database file. Build the interactive visualizations. Export the responsive web dashboard file directly to your workspace.

About

A data cleanup and analysis of past library data from the Seattle Public Library.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors