A simple data mining and scraping library for flood related data available at Darthmouth Flood Observatory Riverwatch. The algorithm / application will automatically let user choose the select the list of locations where water and flood data are available for a given region (http://floodobservatory.colorado.edu/India.htm) from an excel file or automated detection using the HTML page.
User will then provide a generic template based on a HTML page for one of the locations (For example, for http://floodobservatory.colorado.edu/AMSR-E/Gaging/Reaches/215.htm 215 is the unique ID which changes for each station while rest of the link is generic and same for all stations in the region) which contains and defines variables of interest (Site ID, Latitude, Longitude, River, Mean Annual Runoff, Seven Day Total etc.), time series of daily flow measurements and charts or other objects that user may be interested in pulling together. Variable type will also be defined in the interface. Additional features include defining other characteristics for a given variable based on HTML features such as colors or font sizes (Red - High, Yellow - Medium, Green - Low).
Interface needs to have ability to detect different variable types (strings, double, geo location, images, integers, time-series etc.) The interface will then automatically pull together the data, possibly returning error exception incase data are missing. Final dataset will be available as a single aggregated file (excel, NetCDF or csv) or a group of files based on number of variables chosen. Final output can be downloaded.
ScraperWiki - Toolkit to scrape data off the web.
But getting at it isn’t always easy. There's a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it.
Other offline applications that have ability to grab web content (e.g. HTML Grabber 1.0, Web Data Grabber) but none of these applications have custom ability to mine very specific data or need additional customization.