MDI Deeds
Selected Projects | | Links: Code (GitHub) | MDI Deeds | Peters Plan
This web-scraping application built in Python spiders a website of historical records and formats them into a single searchable web page.
Overview
As part of my research into the history of Acadia National Park, I recently came across a collection of the historical deeds for all of the land that is now part of Acadia. This simple website, part of the Mount Desert Island Cultural History Project and hosted by the MDI Historical Society, is a fascinating and useful resource—but the website itself is not searchable, and each deed has been given its own web page. This makes it very hard to search the collection of deeds for a particular name, location, or feature.
I thought that it would be useful to scrape the content from each of these files and combine them into a single HTML page, that could then be searched within the browser simply by using Ctrl + F.
Depending on the technologies used on the site, there are already easier ways to spider and save an entire website (or any portion of it); think wget
. Google’s cache may also help. But I thought this would be a fun way to learn a little more about Scrapy, an open-source web scraping tool.
Configuration of Scrapy involves setting up a spider that will visit a starting URL (or series of URLs) and follow links in the format you specify. This takes some analysis into the structure of the individual site you are hoping to scrape. When properly configured, running the spider will then extract the data and write it to a file.
After the spider ran and copied content from each page into separate files in a folder, I created another Python script (combine-deeds.py
) using the HTML parser Beautiful Soup to combine the content of all of those files into a single HTML file.
Project type
Application
Roles & responsibilities
Primary role
Application Developer
Responsibilities
- Application development
Technologies used
- Python programming language
- Scrapy framework for spidering and extracting website data
- Beautiful Soup HTML/XML parser
Deployment method
- This application was executed locally from the command line and does not have a web interface. The single HTML file that resulted from the spidering and parsing is available on my website: MDI Deeds