An automated, Python-based data pipeline designed to ingest, clean, and analyze heavily formatted multi-table spreadsheets.
This project transforms raw, layout-shifted library administrative logs into organized, query-ready datasets and compiles them into a clean web-based dashboard.
Municipal and public services datasets often come in formatted Excel Workbooks rather than raw database dumps. These reports are designed for human eyes, not computers.
The "All-in-One" Sheet: Multiple distinct tables (Registration, Circulation, Event Attendance, and Computer Usage) were stacked vertically within a single spreadsheet tab. Layout Shifts: Shifting headers, empty rows, definitions, and floating "Totals" rows made simple programmatic loading impossible. Missing Identifiers (Implicit Hierarchy): County labels were only written once at the top of a group of towns, leaving the remaining rows blank (NaN).
Build a resilient script to slice these stacked tables, repair implicit labels, normalize numbers, and package them into an interactive dashboard.
The pipeline is written in Python using a clean, readable sequence of extraction, transformation, visualization, and output compilation.
Instead of hardcoding index ranges—which break as soon as next year's data adds a town—the script programmatically identifies boundary flags. It then uses forward-filling (ffill()) to cascade implicit county names down to each municipal row.
Using the dynamically assigned structural markers, the pipeline splits the master document into four clean, normalized tables:
Registration: Town-by-town service populations, library card numbers, and active users.
Circulation: A multi-year tracking index (2019–2023) separating physical vs. digital checkouts.
Event Attendance: Tracking program hosting counts and community attendance metrics.
Computer Use: Evaluating digital divide metrics like overall sessions and workstation minutes.
During this stage, non-numeric values are programmatically isolated, missing entries are filled with zeros, and column names are standardized.
Using Plotly Express, the engine automatically constructs four interactive visual models:
- Ranked Bar Chart: Sorting town branches by total service population.
- County Comparison: Grouping and summarizing regional populations.
- Active Card Share: A distribution map of library user market share.
- Population vs. Active Users: An interactive scatter plot comparing community size to engagement.
The pipeline wraps these figures in a standalone, responsive HTML grid (Library_Data_Dashboard.html), generating a polished executive portal without requiring a web server.
Physical checkouts dropped by -24.4% between 2019 and 2023, while Digital Circulation surged by +87.3% as patrons leaned into ebooks and remote access.
Community events proved highly resilient, achieving a 114.9% recovery index score relative to pre-pandemic baselines. Workstation Gap: Workstation sessions remained at 51.2% of 2019 levels, suggesting a lingering shift in public computer usage patterns.
├── data/
│ ├── Library-County-Details-2023.xlsx # Raw multi-table Excel source
│ └── Clean_Library_Registration_2023.csv # Pipeline registration export
├── outputs/
│ └── Library_Data_Dashboard.html # Generated standalone dashboard
├── library_registration_data_part5.ipynb # Main Jupyter Notebook pipeline
├── requirements.txt # Libraries (pandas, openpyxl, plotly)
└── README.md # Project documentation
Clone the repository and install the dependencies: git clone https://github.com/yourusername/library-etl-dashboard.git cd library-etl-dashboard pip install -r requirements.txt
Open the Jupyter Notebook in your preferred environment (VS Code, JupyterLab, or Google Colab) and run all cells.
The notebook will instantly: Load and process Library-County-Details-2023.xlsx. Generate the cleaned .csv database file. Build the interactive visualizations. Export the responsive web dashboard file directly to your workspace.