MDI Deeds

Selected Projects | 2018 | Links: Code (GitHub) | MDI Deeds | Peters Plan

This web-scraping application built in Python spiders a website of historical records and formats them into a single searchable web page.

Overview
Project type
Roles & responsibilities
1. Primary role
2. Responsibilities
Technologies used
Deployment method

Overview

As part of my research into the history of Acadia National Park, I recently came across a collection of the historical deeds for all of the land that is now part of Acadia. This simple website, part of the Mount Desert Island Cultural History Project and hosted by the MDI Historical Society, is a fascinating and useful resource—but the website itself is not searchable, and each deed has been given its own web page. This makes it very hard to search the collection of deeds for a particular name, location, or feature.

I thought that it would be useful to scrape the content from each of these files and combine them into a single HTML page, that could then be searched within the browser simply by using Ctrl + F.

Depending on the technologies used on the site, there are already easier ways to spider and save an entire website (or any portion of it); think wget. Google’s cache may also help. But I thought this would be a fun way to learn a little more about Scrapy, an open-source web scraping tool.

Configuration of Scrapy involves setting up a spider that will visit a starting URL (or series of URLs) and follow links in the format you specify. This takes some analysis into the structure of the individual site you are hoping to scrape. When properly configured, running the spider will then extract the data and write it to a file.

After the spider ran and copied content from each page into separate files in a folder, I created another Python script (combine-deeds.py) using the HTML parser Beautiful Soup to combine the content of all of those files into a single HTML file.

Project type

Application

Roles & responsibilities

Primary role

Application Developer

Responsibilities

Application development

Technologies used

Python programming language
Scrapy framework for spidering and extracting website data
Beautiful Soup HTML/XML parser

Deployment method

This application was executed locally from the command line and does not have a web interface. The single HTML file that resulted from the spidering and parsing is available on my website: MDI Deeds

MDI Deeds

Overview

Project type

Roles & responsibilities

Primary role

Responsibilities

Technologies used

Deployment method

Jennifer Galas

Error

Overview

Project type

Roles & responsibilities

Primary role

Responsibilities

Technologies used

Deployment method

Templates (for web app):

Error