In my previous post, I went through parsing PDFs and using Algolia to index them and making them searchable. The end result as that we ended up with a record per paragraph, which meant that we could end up with a very large number of records very fast. At Tam, we were using this on a larger scale in one of our Ruby on Rails projects. One of our use-cases required us to be able to delete these records, and we were using a method named delete_by_query
in Algolia’s library that can delete objects using a search query (In our case, it was a query filtering the index by file id).
It was working fine initially but in our use case there might be millions of records this operation has to happen on, and this method call was taking over an hour to complete, and it would slow down our servers so much that users would stop getting responses back from the servers. After one of my colleagues and I spent some time debugging, we found out that delete_by_query
was causing the issue. In their Ruby library, they have a convention of suffixing blocking methods (synchronous methods that await responses from the Algolia servers) with an exclamation mark. However delete_by_query
is a blocking method, and they didn’t have a delete_by_query!
method. The way delete_by_query
previously worked looked like this:
#
# Delete all objects matching a query
#
# @param query the query string
# @param params the optional query parameters
#
def delete_by_query(query, params = nil)
raise ArgumentError.new('query cannot be nil, use the `clear` method to wipe the entire index') if query.nil? && params.nil?
params ||= {}
params.delete(:hitsPerPage)
params.delete('hitsPerPage')
params.delete(:attributesToRetrieve)
params.delete('attributesToRetrieve')
params[:hitsPerPage] = 1000
params[:attributesToRetrieve] = ['objectID']
loop do
res = search(query, params)
break if res['hits'].empty?
res = delete_objects(res['hits'].map { |h| h['objectID'] })
wait_task res['taskID']
end
end
It would retrieve objects in batches (1000 per batch), then delete each batch, so that method would wait for both retrieving each batch and deleting it. The reason they waited for deletion was because there was a very high chance that if they performed the search using the same query again, it would retrieve objects that are going to be deleted anyway from the previous call to delete_objects and make another unnecessary call to delete them again, and prolonging the deletion process.
After a discussion with my colleague, going over the Algolia REST API docs to find a better way, and receiving some suggestions from the fine folks over at Algolia, we implemented a new solution that retrieves all the objects to be deleted, then deleting them using a single asynchronous call via the delete_objects
method (which uses Algolia’s fast/asynchronous batch API) without waiting. There’s no need to wait because having all the objects means we can avoid the scenario they tried to prevent and simply delete them all in one ago. We also added delete_by_query!
which runs the new version of delete_by_query
but waits for its result, making both of these methods follow the convention of the rest of the library. We didn’t have issues using this new approach with millions of objects, our server stopped getting slowed to a crawl, and the method call was quite fast. However if someone were to try it with a much larger amount of objects, it would definitely have some memory overhead and slower response times.
You can take a look at the pull request here, although GitHub says it’s closed, it was merged into another pull request which eventually got merged into the master branch, but it happened in an odd way that didn’t get me counted as a contributor to the project (Something with how they rebased I assume). My colleague also added the same solution to the Go version of Algolia’s library.
One final thing to note, this implementation isn’t available in all of Algolia’s libraries, so if any one goes through the same issue, consider the same approach discussed in this post.