Elasticsearch fuzzy matching optimization for huge server/server cluster

Elasticsearch fuzzy matching optimization for huge server/server cluster - performance

I've got an index with quite complex queries running on it. The main slowdown are the fuzzy queries which are run against a field containing 2-5 words for each record. I mainly have to find rows with 1-3 differing characters.
On my 4 core (with HT) and 8GB ram machine the my queries are executed in about 1-2s each.
On a server with 12 cores (with HT) and 72Gb RAM the query executes in 0.3-0.5 seconds. This doesn't seem to me as a reasonable scaling on the hardware provided. I'm sure there should be some hidden options for me to tune to adjust the query performance.
I've looked through the elastic search guide but couldn't find there anything which would help me in tuning the performance based on the number of CPUs or RAM or tuning elastic specifically for fuzzy queries.
another question is how does it scale if i add another server like this? will the query time be roughly twice smaller?

There is a couple of possibilities here. First is that your query is I/O bound. In this case, just adding another server might help because two nodes will be retrieving data from two disks. Another possibility is that your query is CPU bound. To a large degree, search against a single shard is a single-threaded process. Assuming that your index was created with default settings, it has 5 shards. So, your query cannot significantly benefit from running on more than 5 CPUs. In this case, adding another node would only slow things down because of network overhead. Instead, you need to recreate index with more shards.

Related

Can I configure Apache Solr 6 to run on multiple CPU for single Core

I have setup Solr 6.6.1 on machine with 48 Core and 96 GB RAM. There are currently 8 million documents in one core that will increase with time. Before this I have similar setup of Solr on small machine 4 cores and 16 GB RAM. But response time is same on both machines that is surprising. From a Slideshare, I have found that one core in solr only gets one CPU. That's why its better to split index into multiple shards. Is it correct or there is some alternative and best way to increase response time.
Finally, the way to split shards when there size become more than 1 Million.
Following is the reference slide.

Your observation is correct, but there's an important detail: this is relevant for a single query - i.e. one query only uses one thread/core. Multiple queries will use multiple threads, so in your case you'll be able to handle more simultaneous users instead.
To optimize for the single query use case, splitting your index into multiple shards is, as you say, the way to go. In that case the query will effectively be split into four separate queries, then merged afterwards instead.
There is no hard limit on when to split, as that will depend on your use case and query profile.

Elasticsearch sharding basics with respect to query speed

Pardon my ignorance if this question is too broad or vague. I'm playing with Elasticsearch with out-of-the-box settings on my laptop and it works just great.
It's a 8 core Macbook and 6G of heap is given to Elasticsearch and it works pretty well for a large dataset (just over 7 Million documents).
I'm keen to set up a multinode cluster (2 machines) and before I assume a few things, I would like to get expert views on a few key points.
I understand "How many shards per node" is a very subjective question and one answer will not fit all situations.
I understand that sharding helps to distribute the indices to multiple nodes so that the storage footprint is optimal per node.
But mainly, I'd like to understand how the sharding effects on the query speed & effective CPU cores utilization.
When a single search query comes in, does ES fire internal subqueries to all the shards in parallel, and therefore it can keep all the cores busy (if the no of shards equals no of cores)?
Can I also be pointed to a few useful links that will help me? Thanks.

Your understanding is pretty much spot on.
The basic concept to understand is that one query on a single shard will use one thread. One thread boils down to one core CPU. If the query needs to touch multiple shards, then ES will make sure the shards involved are queried. This means each shard will do its part of the job using one thread.
The size of the shard and the complexity of the query translates to how much time is being spent in that thread. But the OS will not give one CPU core to that thread all the time, the OS it's scheduling jobs and other processes get a slice of the CPU core.
Ideally, yes, you would have number of shards = number of cores, but rarely clusters out there use this setup. Mainly those clusters that have a lot of concurrent requests per seconds and they demand a strict response time.

Thanks for the response.
Just a summary of my understanding to get it validated.
No of shards == No of cores
(-)
Bigger shards.
A thread could take more time to search a single shard
therefore other threads could be queued up and made to wait?
(+)
Optimal core utilization.
Less chances of context switching overhead as the no of threads are limited.
No of shards > No of cores
(-)
More threads will be spawned for queries and context switching overhead may apply.
More threads perhaps need more memory for thread stack etc.
More shards could potentially need more housekeeping (ie managing file handles etc) by elasticsearch.
(+)
A thread could take less time(relatively) to search a single shard
Could process more concurrent requests (as the no of threads are more).
Eventually, it depends on which one of these gives a best balance between:
Available Hardware, Query Speed and Concurrency factor and I think it requires quite a bit of experimenting. Or in otherwords, which hurts little.

Elasticsearch with different java heaps, does it matter?

Say, I've got 2 servers. One of which has -xmx and -xms set to 4G and one to 2G.
Will ElasticSearch handle those performance differences in the balancing mode? Or will both the servers be called purely based on indices, resulting in a (much more) likely OOM for the latter than the former?
By the way, I've set the properties indices.fielddata.cache.size, indices.breaker.fielddata.limit, indices.breaker.request.limit, and indices.breaker.total.limit on both servers as ElasticSearch is suggesting
This is important, to me, because if it does, I'd have to change the index sharding on guessed index strain, which will be a hassle (if not impossible)

Elasticsearch treats every nodes as the same and equally balances the documents between them. This means that Elasticsearch wont readjust based on hardware and get you the optimal performance.
One thing to remember here is that a herd of bulls is only as fast as its slowest bull. The same gets applied here. But then if the load is small enough that it does not eat up all the hardware for 2 GB machine ,then we should not be seeing any issue. Otherwise you should see difference in memory aggressive operations like aggregations.

Increase Solr performance when querying a subset of documents

The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.

A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources

First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio