Resilient Elasticsearch based search and analytics
by Manuel Huez on
One of the common startup issues is providing customers with powerful search results and intelligent analytics. At ProcessOut, this need arised from the very beginning, as companies need to be able to search for a specific transaction or customer, and see how their business is doing, all in real time. For example, we offer our merchants analytics for their conversion and authorization rates, failed transactions charts, card details and much more.
The naive approach
At first, we had chosen to go with the naive solution: make everything work within PostgreSQL, our main database engine. We used views to generate computable result tables, and generated charts on top of them. We also used the full-text indexing feature of PostgreSQL to provide search results.
However, as our database rows grew exponentially, queries started to slow down. And as our customers’ needs also grew, PostgreSQL native features started to become too restricted for specific use-cases.
Choosing a solution
We’ve done several iterations as to what database we’d chose for search queries and analytics. InfluxDB was one of them. The throughput was extremely good, and the queries fast. However, the results weren’t very accurate, and search queries weren’t powerful enough for our needs. We also started to see some data-point loss when the throughput started to grow. InfluxDB is great when dealing with projects within IoT, but isn’t very precise, and we needed something that could accurately generate money-related charts.
We then had a choice between Algolia, which we already use for our documentation (and we love it!), and Elasticsearch. Because AWS provides a hosted solution for Elasticsearch, and a big part of ProcessOut’s infrastructure is hosted on Amazon Web Services, we decided to give it a try.
The implementation
Running two databases in parallel raises the issue of keeping them in sync in real-time, but this shouldn’t impact the API response time. It also shouldn’t prevent the platform from functioning if there’s an issue with the secondary database.
To achieve this, we decided to go with asynchronous indexing. Our first iteration ran a job on our workers’ pool and pushed the updated document to Elasticsearch every time a row was updated in the PostgreSQL database.
This worked well at first and provided near real-time search results. However, as the number of documents grew, indexing documents one by one started to slow down both our workers and our Elasticsearch nodes. In fact, workers were starting to fill up with indexing jobs, and Elasticsearch handles bulk indexing way better than a flood of one-document indexing.
# Push an entry to index
# rpush returns the number of elements
# in the queue
rpush indexer:transactions tr_4DDgDAcE
# Retrieve an entry to index
# lpop pops the oldest element of the
# queue and returns it
lpop indexer:transactions
As we’re also running a Redis cluster for our cache and queues, we chose to push the newly updated rows ID in a new queue, and flush the queue every minute or once a threshold number of document has been updated. This was a tremendous success. Our workers immediately stopped being flooded, the PostgreSQL database received less queries as we could select and index multiple documents at once, and the Elasticsearch nodes handled the bulk indexing way better.
However, we also needed a way to recover from Elasticsearch node failures when some documents couldn’t be indexed, to prevent from data-loss. To achieve this, we run cron jobs that’ll select database entries updated since the last indexing, and push them to the indexing queue. This ensures that any missed document get indexed eventually. This also makes us able to re-index every document when we update our Elasticsearch indexes or create new ones.
What’s next: Zero downtime Elasticsearch migrations