Recently I needed to download the HTML source code from 9640 pages behind a login. At first I used PHP and cURL which normally works just find. With cURL, I first cURLed to the login page passing my login info. Then, in the same script, I cURLed to the pages I wanted to download and save to my local machine. That didn’t work so I inspected the headers and set cURL’s header options to be exactly the same. That didn’t work either and I just got a 301 Moved Permanently error. It seemed like I needed a way for the remote server to think I was no different than a regular web user browsing in a web browser. So, I tried Javascript. I logged into the website, fired up Firebug, and pasted the following code into the console tab’s command line:
[cc lang=”js”]
function wait(msecs) {
var start = new Date().getTime();
var cur = start
while(cur – start < msecs) {
cur = new Date().getTime();
}
}
Continue reading Javascript Screen Scraper / Website Downloader