solr asks more and more rows from shards - performance

I am having performance issues after solr6 upgrade. I have multiple nodes in the cluster and direct queries to one of them with shards=[list of hosts] which takes care of submitting queries to all the shards and aggregating the results. All the original queries have rows=200.
I have found that many logs with distrib=false have large rows value. e.g., values like: 1200, 2200, 3200, 4200, ... .
What is triggering it? What am I doing wrong to cause this behavior.

So, it turned out that i was not paying attention to the start parameter.
I found that the 'primary' node is asking for very large count of rows from shard nodes when the start= is a large value.

Related

How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

Elasticsearch- Single Index vs Multiple Indexes

I have more than 4000 different fields in one of my index. And that number can grow larger with time.
As Elasticsearch give default limit of 1000 field per index. There must be some reason.
Now, I am thinking that I should not increase the limit set by Elasticsearch.
So I should break my single large index into small multiple indexes.
Before moving to multiple indexes I have few questions as follows:
The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index?
Is there really a need to break my single large index into multiple indexes because of a large number of fields?
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). Each index would have 5 shards(default number, which I don't want to change). Search on these multiple indexes would be searching on these 250 shards at once. Will this affect my search performance? Note: These shards might increase in time as well.
When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards?
It strongly depends on your infrastructure. If you run a single node with 50 Shards a query will run longer than it would with only 1 Shard. If you have 50 Nodes holding one shard each, it will most likely run faster than one node with 1 Shard (if you have a big dataset). In the end, you have to test with real data to be sure.
When there is a massive amount of fields, ES gets a performance problem and errors are more likely. The main problem is that every field has to be stored in the cluster state, which takes a toll on your master node(s). Also, in a lot of cases you have to work with lots of sparse data (90% of fields empty).
As a rule of thumb, one shard should contain between 30 GB and 50 GB of data. I would not worry too much about overloading shards in your use-case. The opposite is true.
I suggest testing your use-case with less shards, go down to 1 Shard, 1 Replica for your index. The overhead from searching multiple Shards (5 primary, multiply by replicas) then combining the results again is massive in comparison to your small dataset.
Keep in mind that document_type behaviour changed and will change further. Since 6.X you can only have one document_type per Index, starting in 7.X document_type is removed entirely. As the API listens at _doc, _doc is the suggested document_type to use in 6.X. Either move to one Index per _type or introduce a new field that stores your type if you need the data in one index.

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

ElasticSearch max shard size

I'm trying to configure my elasticsearch cluster but I need more information.
I understand rules to choose the number of shard to use.
What I don't know is the maximum size of a shard. Is it link this the JVM max size or another param?
Thanks
Rule of thumb is to not have a shard larger than 30-50GB. But this number depends on the use case, your acceptable query response times, your hardware etc.
You need to test this and establish this number. There is no hard rule for how large a shard can be. Take one node from the cluster, create the index with one primary and no replicas and add to it as many documents as you can. Test your queries. If you are happy with the response times, add more document. Test again. And so on. When the response times are not satisfactory anymore, this is your hardware limit for a shard size.
Don't forget that this number might be lower if the use-case requires it (many shards on a single node, huge number of concurrent requests per second, configuration mistakes etc).
In conclusion: test as indicated above and you get an "ideal" shard size as a starting point.

Primary/Replica Inconsistent Scoring

We have a cluster with 3 primary shards and 2 replicas per primary. The total doc count is the same for the primary/replica shards; however, we're getting 3 distinct scores for the same query/document. When we add preference = primary as a query parameter, we get consistent scores each time.
The only explanation we can think of is different DF counts between the primary/replicas. Where is the inconsistency between the primary/replica shards, and how does one go about fixing this? We're using 1.4.2.
EDIT:
We just reindexed the doctype we were querying, but there's still inconsistent scoring.
Primary and replica shards have a different "path" when it comes to segment merging. Meaning, the number and size of the segments can differ between them. Each shared takes care of its own segments independent from other shards.
Why this matters when it comes to calculating score, is because merging is the moment when the documents that were deleted are actually deleted. Until then, deleted documents are only marked as deleted (and taken out from the query results after the query already ran). So, this means it can influence the algorithm by which the score is calculated.
To be more specific - total number of docs in a shard is used for the [IDF calculation](http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#idf(long, long)) and for document frequency (docFreq):
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)
And this number of docs include the deleted (marked as deleted, to be more precise) documents. Take, also, a look at this github issue and Simon's comments regarding the same subject.

Resources