Scraping Wikipedia with NightmareJS
I've written a lot of webscrapers in the past. Each approach I've used has its benefits and drawbacks. If I need to be able to do anything more complicated than extract some text from an HTML document, Nightmare has been my favorite way to do it. It's easy to use, fast, lends itself to readable code, and comes with a fully featured headless browser with JavaScript.
Most recently, I was researching a hobby of mine, finding ghosts towns to visit. Wikipedia is a great resource for them. It's a more comprehensive list than I could find anywhere else and many of the towns had GPS coordinates. However, what I really wanted was a map so I could get an idea where each town was in relation to the others. Since I couldn't find it anywhere, I thought I would map it out myself with a custom Google Map. The prospect of copying and pasting each of the 100+ towns into Google Maps seemed daunting.
So I decided to automate it. First I would need to get the list ghost towns and their respective URLs on Wikipedia. Then I could grab the coordinates from that page and put it into a CSV file that Google Maps could import. Simple enough.
Installing Nightmare is as easy as running:
npm install nightmare
It will install a version of Electron in your node-modules folder. Whenever you initialize your script with Nightmare, it will run your scraping code inside of sandboxed browser.
Using Nightmare is a lot like scripting a browser. You initialize it with nightmare()
which is a lot like starting up the browser. Visiting new pages can be achieved by calling its .goto(url)
function. You can chain any number of these actions and the Election instance will evaluate them sequentially. .evaluate(function)
allows you to execute Javascript code within that instance. It gives you everything you'd expect in a browser console, including the document
and window
object.
I ran into some issues with Wikipedia's rate limiting as I was fetch over 100 pages asynchronously. To get around this, I used Nightmare's .delay(ms)
functionality. This will pause the browser for that specified amount of time. I did it before even going to the page to prevent too many sequential page loads.
Nightmare is uses asynchronous functions at its core. Almost every operation done with it returns a promise. It's a great use case for ES7 await keyword. Without it, you start to get into some fairly nasty nested function which will make the code a lot less readable.
As a last note, if you are running this on a Linux machine without X installed like I was, Electron will silently fail. To get around this I installed xvfb and ran my script with xvfb-run node ghost-town-scraper.js
.