DataRescue Haverford

#DataRescueHC

A few helpful links for the day:

Join our Slack team to communicate with other participants

Review our Code of Conduct

See our guides for each role

Seeders

Seeders and Sorters canvass the resources of our assigned sections of the websites NOAA and the Dept. of Energy, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, while also flagging pages with uncrawlable data using the project's Chrome Extension.

Instructions for Seeding
What is Crawlable?

Researchers

Researchers inspect the "uncrawlable" list to confirm that seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable), and investigate how the dataset could be best harvested.

Instructions for Researchers

Harvesters

Harvesters take the "uncrawlable" data and try to figure out how to actully capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the Harvesting Toolkit for more details and tools.

DataRefuge's Harvester Toolkit

Baggers

Baggers perform some quality assurance on the dataset to make sure the content is correct and corresponds to the original URL. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata and upload it to final DataRefuge destination.

Instructions for Installing Bagging Prerequisites
Instructions for Bagging