I used to work in Market Research and would spend hours cleaning up labels from surveys I’d programmed in online tools. The below video shows how easy it can be if you understand regular expressions and a little programming.
Regular Expressions can have a steep learning curve however it is really worth it if you continually get data that you need to clean. You can also check out this post of mine showing how they can be used to automate your metrics.
Being able to, programatically, navigate to an Internet page and scrape the contents in a reliable fashion is best things invented since sliced bread! I spent years manually going through pages and copying/pasting contents from IE to Excel then spent even more time trying to clean it up. Done properly you can get the data very, very close to how it is on the web with little effort.
The below video walks through using AutoHotKey to obtain basic values from a Web page. It also demonstrates a script I wrote that helps write the syntax (yes I’m that lazy!) The AutoHotKey script I wrote is further down this page and can also be found on the AHK forum here.
In this beginning tutorial I how to:
1) get a pointer to IE
2) navigate to a page
3) get text from a page
When building my first scraping scripts I used to waste a ton of time trying to figure out what was broken. Adding some structure to your diagnoses process can greatly speed-up detecting what has gone wrong. A copy of the AutoHotKey syntax writer can be found here.
I think some excellent advice, not exclusive to troubleshooting web scraping, is to have a bobble-doll or something to talk to. Pretend you’re explaining your issue to a friend and often, when you hear yourself say the words, your issue will appear to you.
This video offers some general troubleshooting tips around troubleshooting web scraping when using AutoHotKey.