Create e-book from website with ruby
I’m going to spend next two weeks without internet and I want to catch up with some reading. There is a few websites with articles I’d like to read, so I decided to create an e-book out of them.
Those articles have one thing in common — an archive that provides a list of them. I could chuck them into a Pocket, but it wouldn’t be too much fun and it would involve lots of clicking. Let’s use a ruby instead.
First thing to do is scraping that list of links that I’m interested in. Why not do it for this blog.
require 'nokogiri'
require 'open-uri'
archive_url = "https://chodounsky.com/archive/"
link_selector = ".content .archive li a"
domain = "https://chodounsky.com"
archive = Nokogiri::HTML(open(archive_url))
links = archive.css(link_selector).map { |a| domain + a["href"] }.reverse
We used nokogiri for parsing archive page and selected all the links to articles with a simple CSS selector. You can be more creative depending on the page structure or your needs — as the archive might be spread across multiple pages or you want a specific order or filtering, but for this example we’ll keep it simple and only reverse it to start with the oldest articles.
After that, we’ll create a simple data container for storing an article — our new book chapter. It would be able to give me an id and format itself to HTML string, which we’ll save to the file later.
class Chapter
attr_accessor :title, :content
def initialize(title, content)
@title = title
@content = content
end
def id
title.downcase.gsub(" ", "_").gsub(/[^0-9a-z_]/i, '')
end
def to_s
<<-eos
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>#{title}</title>
<style>
img { max-width: 95%; }
</style>
</head>
<body>
#{content}
</body>
</html>
eos
end
end
Next step is to scrape the content we are interested in. It’s an action, so we’ll wrap it into a service object.
class DownloadChapter
attr_reader :index
attr_accessor :article_selector, :title_selector, :configuration
def initialize
@index = 0
yield(self)
end
def call(url)
html = raw_html(url)
title = html.css(title_selector).text
article = html.css(article_selector).first
images = download_images_from(article)
content = replace_images(article, images).to_s
save(Chapter.new(title, content))
@index += 1
end
private
def replace_images(article, images)
article.css("img").each_with_index { |img, index| img["src"] = images[index] }
article.css("a").each { |a| a.replace(a.children) }
article
end
def domain
@domain ||= configuration.fetch(:domain, "")
end
def raw_html(url)
html = Nokogiri::HTML(open(url))
configuration[:normalize].call(html) if configuration[:normalize]
html
end
def download_images_from(html)
html.css("img").map do |img|
url = domain + img["src"]
filename = filename_prefix + "_" + id + "_" + url.split("/").last
open("#{filename}", 'wb') { |file| file << open(url).read }
filename
end
end
def filename_prefix
'%03d' % @index
end
def save(chapter)
File.open("#{filename_prefix}-#{chapter.id}.html", "w") { |f| f.write(chapter.to_s) }
end
end
This service is the most complex piece of this small ruby script. We want to download the appropriate content, that means text and images. On the other hand we want to skip comments, ads, sidebars and other distracting elements — that’s when the normalization method kicks in, but more about that later. Article content and title is defined with CSS selectors and we have to provide URL to scrape from.
This service has multiple responsibilities ranging from downloading the images to saving the output to the file, but we are going to keep it in one class for the sake of simplicity. If you intend to do some serious programming I recommend splitting responsibilities apart into separate classes.
Let’s move the final step and tie everything together.
download_chapter = DownloadChapter.new do |d|
d.article_selector = ".post"
d.title_selector = ".post h1"
d.configuration = {
domain: domain,
normalize: -> (article) { article.css("footer").each { |node| node.remove } }
}
end
links.each do |url|
download_chapter.call(url)
end
Firstly, we created the service and passed it a configuration. The configuration contains the domain name and the normalization method. This method is important for stripping the content that we are not interested in. In this case it removes comment section, and you can remove anything by matching CSS selectors.
Last three lines are calling the service for each link we scraped from the archive page. This whole script generates HTML files with properly linked images inside your folder.
You might wonder where we create the final product — an actual e-book. I have a nook so my preferred format is epub
. It is a zipped archive of html pages with bunch of files under certain hierarchy. There is a few ruby gems that exports content into it, but I didn’t find any one of them to be convenient enough to produce nice and compatible results with my reader.
But there is an excellent e-book creator called Sigil which you can produce beautiful e-books with tables of content and title pages really really easily. I highly recommend it as it is also available for all operating systems.
Oh, and if you own a Kindle and you are after the mobi
format, don’t despair. Epub
and mobi
are convertible between each other and you can use calibre for that job.