🔍

Indexing PDF for Searching Using Tika, Nokogiri, and Algolia

Figuring out where to start from

This is the first guide I’ve written so bear with me, and please provide feedback!

I was working on a project that required me to have some really powerful search capabilities that work for multiple languages, and especially searching through file contents (I initially started with PDF). I needed something working and I didn’t have a lot of time. Usually a quick Google search would turn up a tutorial or guide on how to tackle a task like this, but this time I didn’t have any luck, so I decided I’d share my experience after I had something working.

I went ahead and looked at StackShare and found a category for Search as a Service, I had always heard amazing things about Elastic Search and I checked out most of the other tools available there and contacted their support teams asking for some guidance. I was surprised to get a response only two minutes later (literally) from Adam Surak, director of infrastructure at Algolia (a search service used by Medium surprisingly, and even Twitch) and he helped me out quite a bit. He directed me towards Apache Tika which as their page states:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.Adam also recommended that I split the PDF into paragraphs because it assures that the searched text isn’t too long and that the relevancy is high.

Extracting PDF Contents

I was able to quickly and easily acquire Tika via HomeBrew:

brew install tika

I’m going to use a PDF version of Dracula (and then I cut out everything but the first 3 pages) that I acquired for free from PDFPlanet and I ran:

tika -r dracula-shortened.pdf > dracula.html

The r option is for pretty printing, and Tika outputs HTML for easy parsing (I used > to pass the HTML output into a file called dracula.html). Let’s look at the metadata that we got out of that:

<head>
<meta name="date" content="2017-01-05T18:08:24Z"/>
<meta name="pdf:PDFVersion" content="1.7"/>
<meta name="xmp:CreatorTool" content="Acrobat PDFMaker 5.0 for Word"/>
<meta name="Keywords" content="free, PDF, ebook, ebooks, Planet PDF, download, classic, classics"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="subject" content="Free Planet PDF eBooks -- an assortment of some of the most popular classics. Free!"/>
<meta name="dc:creator" content="Bram Stoker"/>
<meta name="dcterms:created" content="2002-09-20T06:10:21Z"/>
<meta name="Last-Modified" content="2017-01-05T18:08:24Z"/>
<meta name="dcterms:modified" content="2017-01-05T18:08:24Z"/>
<meta name="dc:format" content="application/pdf; version=1.7"/>
<meta name="Last-Save-Date" content="2017-01-05T18:08:24Z"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="meta:save-date" content="2017-01-05T18:08:24Z"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Dracula"/>
<meta name="modified" content="2017-01-05T18:08:24Z"/>
<meta name="cp:subject" content="Free Planet PDF eBooks -- an assortment of some of the most popular classics. Free!"/>
<meta name="Content-Length" content="48741"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="creator" content="Bram Stoker"/>
<meta name="meta:author" content="Bram Stoker"/>
<meta name="dc:subject" content="free, PDF, ebook, ebooks, Planet PDF, download, classic, classics"/>
<meta name="meta:creation-date" content="2002-09-20T06:10:21Z"/>
<meta name="created" content="Fri Sep 20 09:10:21 AST 2002"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="3"/>
<meta name="Creation-Date" content="2002-09-20T06:10:21Z"/>
<meta name="resourceName" content="Dracula_NT-shortened.pdf"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="meta:keyword" content="free, PDF, ebook, ebooks, Planet PDF, download, classic, classics"/>
<meta name="Author" content="Bram Stoker"/>
<meta name="producer" content="Acrobat Distiller 5.0.5 (Windows)"/>
<meta name="access_permission:can_modify" content="true"/>
<title>Dracula</title>
</head>

There’s a lot of useful data there, and you can use Tika to get metadata, detect content language, and so many other powerful options, but in our case we are more interested in the body, here’s a snippet of what the body looks like (2nd page of the PDF):

<body>

<div class="page">
   <p/>
   <p>Dracula
   </p>
   <p>2 of 684
   </p>
   <p>Chapter 1
   </p>
   <p>Jonathan Harker’s Journal
      3 May. Bistritz.—Left Munich at 8:35 P.M., on 1st
   </p>
   <p>May, arriving at Vienna early next morning; should have
      arrived at 6:46, but train was an hour late. Buda-Pesth
      seems a wonderful place, from the glimpse which I got of
      it from the train and the little I could walk through the
      streets. I feared to go very far from the station, as we had
      arrived late and would start as near the correct time as
      possible.
   </p>
   <p>The impression I had was that we were leaving the
      West and entering the East; the most western of splendid
      bridges over the Danube, which is here of noble width
      and depth, took us among the traditions of Turkish rule.
   </p>
   <p>We left in pretty good time, and came after nightfall to
      Klausenburgh. Here I stopped for the night at the Hotel
      Royale. I had for dinner, or rather supper, a chicken done
      up some way with red pepper, which was very good but
      thirsty. (Mem. get recipe for Mina.) I asked the waiter,
      and he said it was called ‘paprika hendl,’ and that, as it was
      a national dish, I should be able to get it anywhere along
      the Carpathians. </p>
   <p/>
   <div class="annotation">
      <a href="http://www.planetpdf.com/mainpage.asp?eb"/>
   </div>
</div>

And here is the 2nd page in the PDF for comparison:

From the output we can see that each page is a div having page as its class, and then we have each paragraph as the inner text of a paragraph element p

This is very nicely formatted really easy to parse.

Using Tika in Ruby

For me, the next step was to use Tika in Ruby (because my stack relies on Ruby on Rails). Some quick searching lead me to Yomu and using it in Ruby became as easy as:

require "yomu"
yomu = Yomu.new "dracula-shortened.pdf"
puts yomu.html

Parsing HTML Using Nokogiri

I have used Nokogiri in the past to parse HTML and it’s pretty easy in this case.

require "nokogiri"
require "yomu"

def invalid_paragraph?(str)
  disallowed_strings = [ "", " ", "\n", " \n" ]
  disallowed_strings.include?(str)
end

def get_pdf_paragraphs(filename)
  yomu = Yomu.new(filename)
  paragraphs = []
  doc = Nokogiri::HTML(yomu.html)
  page = 0

  doc.css('.page').each do |node|
    node.css('p').each do |paragraph|
      paragraph_text = paragraph.inner_text

      next if invalid_paragraph?(paragraph_text)

      paragraphs << { text: paragraph_text, page: page }
    end

    page += 1
  end

  paragraphs
end

paragraphs = get_pdf_paragraphs("dracula-shortened.pdf")

I use a CSS selector to loop through anything with page as its class, and then I loop through all the paragraph elements p within that page, and I’m just incrementing the page variable (I do this because knowing which page the text was found on is important for my use case). I had a bunch of empty string paragraphs, paragraphs there were just a single space character or a newline character, and a space followed by a new line, so I added a simple function invalid_paragraph? to easily skip those pieces of text (I could’ve written a regular expression but it didn’t seem worth it in this case, especially since I wanted an easily readable tutorial).

Let’s take a look at paragraphs (I cut out a good chunk of the PDFs before pasting this output so I don’t have a really gigantic example):

[
    {
    
        :text => "Dracula \nBram Stoker \n",
        :page => 0
    },
    
    {
        :text => " \nThis eBook is designed and published by Planet PDF. For more free \neBooks visit our Web site at http://www.planetpdf.com/.",
        :page => 0
    },
    
    {
        :text => "Dracula \n",
        :page => 1
    },
    
    {
        :text => "2 of 684 \n",
        :page => 1
    },
    
    {
        :text => "Chapter 1 \n",
        :page => 1
    },
    
    {
        :text => "Jonathan Harker’s Journal \n3 May. Bistritz.—Left Munich at 8:35 P.M., on 1st \n",
        :page => 1
    },
    
    {
        :text => "May, arriving at Vienna early next morning; should have \narrived at 6:46, but train was an hour late. Buda-Pesth \nseems a wonderful place, from the glimpse which I got of \nit from the train and the little I could walk through the \nstreets. I feared to go very far from the station, as we had \narrived late and would start as near the correct time as \npossible. \n",
        :page => 1
    },
    
    {
        :text => "The impression I had was that we were leaving the \nWest and entering the East; the most western of splendid \nbridges over the Danube, which is here of noble width \nand depth, took us among the traditions of Turkish rule. \n",
        :page => 1
    },
    
    {
        :text => "We left in pretty good time, and came after nightfall to \nKlausenburgh. Here I stopped for the night at the Hotel \nRoyale. I had for dinner, or rather supper, a chicken done \nup some way with red pepper, which was very good but \nthirsty. (Mem. get recipe for Mina.) I asked the waiter, \nand he said it was called ‘paprika hendl,’ and that, as it was \na national dish, I should be able to get it anywhere along \nthe Carpathians. ",
        :page => 1
    },
]

And there you have it. Thanks to Tika, we easily split the PDF into paragraphs, and thanks to Nokogiri we have parsed it with extreme ease.

Getting the PDF Contents Into Algolia

So I decided to use Algolia after playing around with it, because it was extremely easy to use, configure, and it returned search results extremely fast. Getting my PDF content searchable was as easy as doing this:

Algolia.init(application_id: 'xxxx', api_key: 'xxxx')
index = Algolia::Index.new("books")
index.add_objects(paragraphs)
index.set_settings({ "searchableAttributes" => ["text"] })

Closing Notes

Now obviously you could add more to that example (maybe get the filename and PDF author/title in as a record too, or add book IDs linked to your system or whatever) but that’s a nice and simple example to get started from. I’d like to thank all the people who worked on any of the tools I’ve used because they’re awesome and incredible time savers.

Here’s the full code of the example as a GitHub Gist:

💡

invalid_paragraph? could be written in a more performant way using regexes, but I wanted something easy for all readers to understand.

require "nokogiri"
require "yomu"
require "algoliasearch"

def invalid_paragraph?(str)
  disallowed_strings = [ "", " ", "\n", " \n" ]
  disallowed_strings.include?(str)
end

def get_pdf_paragraphs(filename)
  yomu = Yomu.new(filename)
  paragraphs = []

  doc = Nokogiri::HTML(yomu.html)

  page = 0

  doc.css('.page').each do |node|

    node.css('p').each do |paragraph|
      paragraph_text = paragraph.inner_text

      next if invalid_paragraph?(paragraph_text)

      paragraphs << { text: paragraph_text, page: page }
    end

    page += 1
  end

  paragraphs
end

paragraphs = get_pdf_paragraphs("dracula-shortened.pdf")

Algolia.init(application_id: 'xxxx', api_key: 'xxxx')

index = Algolia::Index.new("books")
index.add_objects(paragraphs)

index.set_settings({ "searchableAttributes" => ["text"] })

Thanks for reading!