Dealing with Elasticsearch

Last week, finally, our flagship product Price Guru PRO¹ went online with a huge change: its primary database had been switched from a classic SQL to Elasticsearch.

This change was driven mainly because of these points:

the huge amount of data that is stored into PGP system must be quickly available to the end users²
some searches we are implementing requires tools that are not always available into classical SQL system³

A lot of time has been spent into R&D to test various solutions that will solve our desideratas. At first we tried using some extensions to SQL Servers in order to obtain the speed and features we need, but soon we realized that this was not the correct path to follow.

Then, after these unsuccessful tests, we figured out that standard or improved SQL system where not capable to give to the project the speed it requires. So we decided to move forward to other solutions, which inevitably leads us to Full Text Search and it's de facto standard: Lucene and Elasticsearch⁴.

Elasticsearch is a document based store (a NoSQL server) that is powered with Lucene to provide full text capabilities (proximity matching, n-gram and so on), data aggregation and geolocation searches.

At first hand, it seemed exactly what we were looking for. So we spent a considerable amount of time studying this tecnology to see if it satisfied our requirements.

The learning curve of Elasticsearch is very steep, there are a lot of concepts that are not so immediate from people that have not dealt with full text searches. Moreover, while the language used to store data is fairly simple, the language used to query the server is very complicated and not really intuitive. We spent a lot of time first to deeply undestand how indexing is working on Elasticsearch and, again, a lot of hours were spent on a tedious try - error - fix cycle to obtain the correct queries useful for our project.

On the other hand, the results obtained by this server were amazing. Time spent on waiting data was drammatically reduced. N-gram queries and geolocalization were obtained for free from the server.

The path was obviously set and, after a huge rewrite of our search system, now our customers can benefit of the improved search speed obtained by Elasticsearch.

This journey on learning Elasticsearch has been very profitable for our team. We have scratched the surface of this amazing and useful index server and it will be surely reused in our systems. On the other hands we bumped into several problems which we have to face all by ourselves⁵ . At the time of writing an official book on Elasticsearch has been published and we strongly encourage people that will use this server to buy and read the book; it will reduce the learning curve time and it will help to understand how exactly Elasticsearch will work on your data.

Obviously Elasticsearch is not all a bed of roses. As stated before, writing a query for Elasticsearch requires a lot of experience on DSL, which can be only obtained by a lot of practice. Moreover Elasticsearch has its flaws:

It cannot be used as a primary datastore because it misses transactions and security
There are some resiliency problems that affects Elasticsearch. Most of the problem will not occur in a production environment, but if your structure is complex, you may encounter some of them. Again the recommendation is to use a companion store system that guarantees your data will be safe

In the end, this product has provided great benefit to our project, but it has to be used wisely to avoid its inherent problems. It requires that your team study and master its functionality but, once you have set up the server and your knowledge, your project will surely have a great speedup using Elasticsearch.

Price Guru PRO is a Real Time Automotive Market Analysis System for new and used cars in real time, built using top notch Big Data and Cloud Computing technologies and able to produce detailed reports about market trends, prices history, sales, and much much more. ↩
Internet users tends to abandon a site that requires more than 3 - 5 seconds to load. Waiting data in PGP for a period longer than 5 seconds will surely make people left the system, even if the data inside PGP are valuable for their business. ↩
PGP uses geospatial localization through polygons and point with radius, tools that are normally not available in a SQL server (or they are available as external extensions). Another feature PGP currently used is N-gram searches and autocomplete suggestions. Even if this kind of searches can be simulated with like SQL construct, it tends to be very slow when an index cannot be used. ↩
Elasticsearch is not the only indexing and search server present into the market, another well known server is Solr, a Lucene subproject, but Elasticsearch has some feature that Solr misses which is undoubtedly useful:
- Elasticsearch Dynamic mapping, the ability to infer document schema from data inserrted into the index without redefining the mapping (which is a typical behaviour of a NoSQL system)
- Elasticsearch scaling system which is very easy to configure
↩
Elasticsearch documentation is very comprehensive about every detail of the server but it is missing the overall picture of the system, it is very hard to find a tutorial that solve your particular problem. Luckily there is a very active community that shares its knowledge and it is happy to help new Elasticsearch users. ↩

A Fistful of Datas

Dealing with Elasticsearch

Comments