Node.js + jQuery
https://code.tutsplus.com/tutorials/how-to-scrape-web-pages-with-nodejs-and-jquery–net-22478
This tutorial shows how to use Node.js with the Express web server framework, the Jade template engine, and the JSDOM parser. You may prefer the Handlebars template engine. This tutorial also includes use of the node “request” and “url” modules.
Node.js + Cheerio
If performance is an issue, you can replace JSDOM and jQuery with Cheerio. Cheerio is 16x faster than JSDOM and can handle complex websites.
Node.js + Nightmare JS
Nightmare JS is similar to PhantomJS but it is based on Electron.
https://github.com/segmentio/nightmare
The following example shows how to extract the H1 text and all LI (list) HTML tags and write them to a file.
If you’d like to scrape a bunch of web pages, you can run Nightmare in a loop using the Async library to process URLs in batches and to ensure correct processing.
Saving to File
In all 3 cases, if you need to save your scraped data to a file, you can use the fs module as shown in the Nightmare example above.
Saving to a File in a Non-existent Path
If the path to write the file to doesn’t exist, fs.createWriteStream will throw an error. The path must exist. In that case, you can dynamically create the path as follows.
Download HTML Source of Multiple Website URLs
To run the examples above, do the following:
- Install Node JS
- Put the script in a folder
- On the command line, go to that folder
- Install dependencies
- npm install –save nightmare
- npm install –save async
- Run the script
- node scape.js
Browser Scope
Note that variables outside of the “evaluate()” function are NOT accessible within that function because within the evaluate() function, you are in the browser scope. Whatever code you can run in the browser console can run in the evaluate() function.
Note: there’s also Headless Chrome and Puppeteer