Data Scraping for Dartmouth Flood Data

October 5, 2011 - 12:40 -- Josh Goldstein
Revision #6Recommend a SolutionFork
Washington D.C. or Global!

A simple data mining and scraping library for flood related data available at Darthmouth Flood Observatory Riverwatch. The algorithm / application will automatically let user choose the select the list of locations where water and flood data are available for a given region (http://floodobservatory.colorado.edu/India.htm) from an excel file or automated detection using the HTML page.

User will then provide a generic template based on a HTML page for one of the locations (For example, for http://floodobservatory.colorado.edu/AMSR-E/Gaging/Reaches/215.htm 215 is the unique ID which changes for each station while rest of the link is generic and same for all stations in the region) which contains and defines variables of interest (Site ID, Latitude, Longitude, River, Mean Annual Runoff, Seven Day Total etc.), time series of daily flow measurements and charts or other objects that user may be interested in pulling together. Variable type will also be defined in the interface. Additional features include defining other characteristics for a given variable based on HTML features such as colors or font sizes (Red - High, Yellow - Medium, Green - Low).

Interface needs to have ability to detect different variable types (strings, double, geo location, images, integers, time-series etc.) The interface will then automatically pull together the data, possibly returning error exception incase data are missing. Final dataset will be available as a single aggregated file (excel, NetCDF or csv) or a group of files based on number of variables chosen. Final output can be downloaded.

Contact:

Hrishikesh Patel

hpatel@worldbank.org

Similar Projects and Resources: 

ScraperWiki - Toolkit to scrape data off the web.

https://scraperwiki.com/

But getting at it isn’t always easy. There's a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it.

Other offline applications that have ability to grab web content (e.g. HTML Grabber 1.0, Web Data Grabber) but none of these applications have custom ability to mine very specific data or need additional customization.

Qualitative Impact: 
It will automate the process of data grabbing and save users time to download such datasets. Code could be adapted for similar applications.
Quantitative Impact: 
Community of usersinterested in satellite based flow data measurements (50-100 people).
Problem Definition Category: 

Comments

There's now a cleaned up CSV of significant floods thanks to RHOK Oxford, UK. See https://github.com/ghickman/weather/blob/master/data/floods.csv

 

For more on this solution see http://www.rhok.org/solutions/floodsource

 

Please contact me for more information. 

 

Michael Saunby Dec 05, 2011

I haven't used ScraperWiki in months. It was unstable and almost impossible to use. I did discover the Ruby Mechanize and Nokogiri gems there, and they seem to be state of the art for building scrapers / parsers.

 

There's a nifty collection of scraping tools at the ProPublica site as well:

 

http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

znmeb Dec 05, 2011

UI/UX design support provided by Azavea