
DataRescue Haverford
#DataRescueHC
A few helpful links for the day:
Join our Slack team to communicate with other participants
Review our Code of Conduct
See our guides for each role
Seeders
Seeders and Sorters canvass the resources of our assigned sections of the websites NOAA and the Dept. of Energy, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, while also flagging pages with uncrawlable data using the project's Chrome Extension.
Instructions for SeedingWhat is Crawlable?
Researchers
Researchers inspect the "uncrawlable" list to confirm that seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable), and investigate how the dataset could be best harvested.
Instructions for ResearchersHarvesters
Harvesters take the "uncrawlable" data and try to figure out how to actully capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the Harvesting Toolkit for more details and tools.
DataRefuge's Harvester ToolkitBaggers
Baggers perform some quality assurance on the dataset to make sure the content is correct and corresponds to the original URL. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata and upload it to final DataRefuge destination.
Instructions for Installing Bagging PrerequisitesInstructions for Bagging