Scraping a book from Kindle (read.amazon.com)

Scraping a book from Kindle (read.amazon.com)

After writing my latest post on scraping text from Amazon's web Kindle reader, I got an idea. I wondered how hard it would be to scrape an entire book with the method.

Here's what I came up with:

function hashString(str){
	let hash = 0;
	for (let i = 0; i < str.length; i++) {
		hash += Math.pow(str.charCodeAt(i) * 31, str.length - i);
		hash = hash & hash; // Convert to 32bit integer
	}
	return hash;
}

var hashes = {};
var content = [];

function addDiv(div){
	let hash  = hashString(div.innerText);
	if (hashes[hash] === undefined) {
		hashes[hash] = true;
		content.push(div.outerHTML);
	}
}

var timeout = null;
function main() {
	var appFrame = document.querySelector('#KindleReaderIFrame').contentDocument;
var contentFrames = Array.from(appFrame.querySelectorAll('iframe')).map(f => f.contentDocument);
	Array.from(contentFrames[1].querySelectorAll('body > div')).forEach(addDiv);
	timeout = setTimeout(main, 1000);
	appFrame.getElementById('kindleReader_pageTurnAreaRight').click()
	console.log('content', content.length);
}
main();

Essentially, this script will turn the page every second and look for any new text in the frame the Kindle app uses to display the book. The frame is more or less a window of text to buffer. It doesn't just contain the text displayed on the screen. There is no guarantee that going to the next page would add new content to the frame. However, the app does ensure that only how paragraphs get added or removed and these paragraphs are in order. I reused the string hash code function from an earlier post as an efficient way to see if the string had already been added.

The end result is an array where each element is a paragraph of text. If you let this run through the entire book, this array will have the entire contents of the book in HTML.