Figuring out where to start from
This is the first guide I’ve written so bear with me, and please provide feedback!
I was working on a project that required me to have some really powerful search capabilities that work for multiple languages, and especially searching through file contents (I initially started with PDF). I needed something working and I didn’t have a lot of time. Usually a quick Google search would turn up a tutorial or guide on how to tackle a task like this, but this time I didn’t have any luck, so I decided I’d share my experience after I had something working.
I went ahead and looked at StackShare and found a category for Search as a Service, I had always heard amazing things about Elastic Search and I checked out most of the other tools available there and contacted their support teams asking for some guidance. I was surprised to get a response only two minutes later (literally) from Adam Surak, director of infrastructure at Algolia (a search service used by Medium surprisingly, and even Twitch) and he helped me out quite a bit. He directed me towards Apache Tika which as their page states:
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.Adam also recommended that I split the PDF into paragraphs because it assures that the searched text isn’t too long and that the relevancy is high.
Extracting PDF Contents
I was able to quickly and easily acquire Tika via HomeBrew:
brew install tikaI’m going to use a PDF version of Dracula (and then I cut out everything but the first 3 pages) that I acquired for free from PDFPlanet and I ran:
tika -r dracula-shortened.pdf > dracula.htmlThe r option is for pretty printing, and Tika outputs HTML for easy parsing (I used > to pass the HTML output into a file called dracula.html). Let’s look at the metadata that we got out of that:
There’s a lot of useful data there, and you can use Tika to get metadata, detect content language, and so many other powerful options, but in our case we are more interested in the body, here’s a snippet of what the body looks like (2nd page of the PDF):
And here is the 2nd page in the PDF for comparison:
From the output we can see that each page is a div having page as its class, and then we have each paragraph as the inner text of a paragraph element p
This is very nicely formatted really easy to parse.
Using Tika in Ruby
For me, the next step was to use Tika in Ruby (because my stack relies on Ruby on Rails). Some quick searching lead me to Yomu and using it in Ruby became as easy as:
require "yomu"
yomu = Yomu.new "dracula-shortened.pdf"
puts yomu.htmlParsing HTML Using Nokogiri
I have used Nokogiri in the past to parse HTML and it’s pretty easy in this case.
I use a CSS selector to loop through anything with page as its class, and then I loop through all the paragraph elements p within that page, and I’m just incrementing the page variable (I do this because knowing which page the text was found on is important for my use case). I had a bunch of empty string paragraphs, paragraphs there were just a single space character or a newline character, and a space followed by a new line, so I added a simple function invalid_paragraph? to easily skip those pieces of text (I could’ve written a regular expression but it didn’t seem worth it in this case, especially since I wanted an easily readable tutorial).
Let’s take a look at paragraphs (I cut out a good chunk of the PDFs before pasting this output so I don’t have a really gigantic example):
And there you have it. Thanks to Tika, we easily split the PDF into paragraphs, and thanks to Nokogiri we have parsed it with extreme ease.
Getting the PDF Contents Into Algolia
So I decided to use Algolia after playing around with it, because it was extremely easy to use, configure, and it returned search results extremely fast. Getting my PDF content searchable was as easy as doing this:
Algolia.init(application_id: 'xxxx', api_key: 'xxxx')
index = Algolia::Index.new("books")
index.add_objects(paragraphs)
index.set_settings({ "searchableAttributes" => ["text"] })
Closing Notes
Now obviously you could add more to that example (maybe get the filename and PDF author/title in as a record too, or add book IDs linked to your system or whatever) but that’s a nice and simple example to get started from. I’d like to thank all the people who worked on any of the tools I’ve used because they’re awesome and incredible time savers.
Here’s the full code of the example as a GitHub Gist:
invalid_paragraph? could be written in a more performant way using regexes, but I wanted something easy for all readers to understand.Thanks for reading!