MDI Deeds

MDI Deeds

This web-scraping application built in Python spiders a website of historical records and formats them into a single searchable web page.

  1. Overview
  2. Project type
  3. Roles & responsibilities
    1. Primary role
    2. Responsibilities
  4. Technologies used
  5. Deployment method

Overview

As part of my research into the history of Acadia National Park, I recently came across a collection of the historical deeds for all of the land that is now part of Acadia. This simple website, part of the Mount Desert Island Cultural History Project and hosted by the MDI Historical Society, is a fascinating and useful resource—but the website itself is not searchable, and each deed has been given its own web page. This makes it very hard to search the collection of deeds for a particular name, location, or feature.

I thought that it would be useful to scrape the content from each of these files and combine them into a single HTML page, that could then be searched within the browser simply by using Ctrl + F.

Depending on the technologies used on the site, there are already easier ways to spider and save an entire website (or any portion of it); think wget. Google’s cache may also help. But I thought this would be a fun way to learn a little more about Scrapy, an open-source web scraping tool.

Configuration of Scrapy involves setting up a spider that will visit a starting URL (or series of URLs) and follow links in the format you specify. This takes some analysis into the structure of the individual site you are hoping to scrape. When properly configured, running the spider will then extract the data and write it to a file.

After the spider ran and copied content from each page into separate files in a folder, I created another Python script (combine-deeds.py) using the HTML parser Beautiful Soup to combine the content of all of those files into a single HTML file.


Project type

Application


Roles & responsibilities

Primary role

Application Developer

Responsibilities

  • Application development

Technologies used

  • Python programming language
  • Scrapy framework for spidering and extracting website data
  • Beautiful Soup HTML/XML parser

Deployment method

  • This application was executed locally from the command line and does not have a web interface. The single HTML file that resulted from the spidering and parsing is available on my website: MDI Deeds


© 2024 Jennifer Galas. All rights reserved.