Scraping Wikipedia with NightmareJS

I've written a lot of webscrapers in the past. Each approach I've used has its benefits and drawbacks. If I need to be able to do anything more complicated than extract some text from an HTML document, Nightmare has been my favorite way to do it. It's easy to use, fast, lends itself to readable code, and comes with a fully featured headless browser with JavaScript.

Most recently, I was researching a hobby of mine, finding ghosts towns to visit. Wikipedia is a great resource for them. It's a more comprehensive list than I could find anywhere else and many of the towns had GPS coordinates. However, what I really wanted was a map so I could get an idea where each town was in relation to the others. Since I couldn't find it anywhere, I thought I would map it out myself with a custom Google Map. The prospect of copying and pasting each of the 100+ towns into Google Maps seemed daunting.

So I decided to automate it. First I would need to get the list ghost towns and their respective URLs on Wikipedia. Then I could grab the coordinates from that page and put it into a CSV file that Google Maps could import. Simple enough.

Installing Nightmare is as easy as running:

npm install nightmare

It will install a version of Electron in your node-modules folder. Whenever you initialize your script with Nightmare, it will run your scraping code inside of sandboxed browser.

const nightmare = require('nightmare');

// Run Nightmare in an async function to take advantage of 
// ES7's await
async function run() {
  // Set large timeout of 150 seconds, default is 30
  const nm = nightmare({
    waitTimeout: 150000,
  });
    
  // Grab the list of towns from the county webpage
  const towns = await nm.goto(`https://en.wikipedia.org/wiki/Template:Nevada_County,_California`)
    .wait('body')
    .evaluate(function() {
      const ghostTown = document
        .querySelector('tr > th > a[title="Ghost town"]');
      const row = ghostTown.parentElement.parentElement;
      // get the URL for each town from its link
      const towns = Array.from(row.querySelectorAll('li > a'))
        .map((el) => el.href);
      return towns;
    })
    .catch(function(error) {
      console.error('did not get towns', error);
    });
    
  // Set a delay of 1 second per page, to play nice with
  // Wikipedia's rate limiting
  let delay = 1000;
  const res = towns.map(async function(url) {
    // increasing this delay will stagger each promise to 
    // roughly execute 1 second after each other
    delay += 1000;
    const coord = await nm.wait(delay)
      .goto(url)
      .wait('body')
      .evaluate(() => {
        // scrape latitude and longitude data
        const lat = document
          .querySelector('span[class="latitude"]').innerHTML;
        const long = document
          .querySelector('span[class="longitude"]').innerHTML;
        return {lat, long};
      })
      .catch((error) => {
        console.error(`did not get coords: ${url}, ${error}`);
      });
    return {
      url: url,
      coord: coord,
    };
  });
  const coords = await Promise.all(res);
  // Only display towns that have coordinates
  console.log(coords.filter((c) => c.coord !== undefined));
  await nm.end();
}

run();

ghost-town-scraper.js

Using Nightmare is a lot like scripting a browser. You initialize it with nightmare() which is a lot like starting up the browser. Visiting new pages can be achieved by calling its .goto(url) function. You can chain any number of these actions and the Election instance will evaluate them sequentially. .evaluate(function) allows you to execute Javascript code within that instance. It gives you everything you'd expect in a browser console, including the document and window object.

I ran into some issues with Wikipedia's rate limiting as I was fetch over 100 pages asynchronously. To get around this, I used Nightmare's .delay(ms) functionality. This will pause the browser for that specified amount of time. I did it before even going to the page to prevent too many sequential page loads.

Nightmare is uses asynchronous functions at its core. Almost every operation done with it returns a promise. It's a great use case for ES7 await keyword. Without it, you start to get into some fairly nasty nested function which will make the code a lot less readable.

As a last note, if you are running this on a Linux machine without X installed like I was, Electron will silently fail. To get around this I installed xvfb and ran my script with xvfb-run node ghost-town-scraper.js.