How to let elasticsearch coordinate node don't merge and resort - elasticsearch

For example, the ES cluster has 3 shards, a query wants to get 300 docs.
Normally, the coordinate node will get 300 docs from each shard, that's 3*300=900 docs in total, then coordinate node sort these 900 docs and return top 300 docs.
How can I set the query, let coordinate node get 100 docs from each shard and return 3*100=300 docs?

Am curious why you would like every single shard to only return an equally sized share/slice of the resulting hits, as it is very unlikely that the 300 most relevant/important hits are evenly distributed across all shards.
The coordinating node's task is not just to return 300 hits, but the 300 most relevant/important hits. By default hits are sorted by descending score (unless you specify a different sorting criteria). Statically considering 100 hits from every single shard could result in a total meaningless result list.
An example: for simplicity, assume that your index is only made up of 2 (primary) shards and you contains documents about mobile phone news back in early 2007. It's very likely that you have many documents in your index about Windows, Nokia and Blackberry phones. And then, all of a sudden the iPhone got announced and articles start popping up. Let's further assume that short ofter the presentation of the phone there have been 100 very relevant articles about iPhones been published and indexed in your Elasticsearch index and now you are querying for the best 100 hits about iPhones. With coordinating nodes "optimized" the way you are asking for, the first fifty documents would get retrieved from both shards. As a consequence it will be very likely that you only end up having something like 60-70 of the relevant articles in your result set, and the other 30-40 very relevant hits are missing (and even worse, 30-40 articles are rather irrelevant and just made it in, because they mentioned the term iPhone once).
Actually, the coordinating nodes are also "smart" and under certain conditions can skip shards when it's guaranteed that they they don't contain any matching document.
Furthermore, if you don't deal with big data and all your documents easily fix into a single, configure your index to be made up of 1 shard and the coordinating node does not need to do any merging.
If your use-case does not rely on relevancy at all, you could think of organizing your data in different indices (rather than multiple shards within an index). Then you can query every single index independently for the first n hits, and merge the results on application side. But as this involves more network roundtrips, it eventually might be even slower.

Related

How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

Resource usage with rolling indices in Elasticsearch

My question is mostly based on the following article:
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
The article advises against having multiple shards per node for two reasons:
Each shard is essentially a Lucene index, it consumes file handles, memory, and CPU resources
Each search request will touch a copy of every shard in the index. Contention arises and performance decreases when the shards are competing for the same hardware resources
The article advocates the use of rolling indices for indices that see many writes and fewer reads.
Questions:
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
How does searching many small indices compare to searching one large one?
I should mention that in our particular case, there is only one ES node though of course generally applicable answers will be more useful to SO readers.
It's very difficult to spit out general best practices and guidelines when it comes to cluster sizing as it depends on so many factors. If you ask five ES experts, you'll get ten different answers.
After several years of tinkering and fiddling around ES, I've found out that what works best for me is always to start small (one node, how many indices your app needs and one shard per index), load a representative data set (ideally your full data set) and load test to death. Your load testing scenarii should represent the real maximum load you're experiencing (or expecting) in your production environment during peak hours.
Increase the capacity of your cluster (add shard, add nodes, tune knobs, etc) until your load test pass and make sure to increase your capacity by a few more percent in order to allow for future growth. You don't want your production to be fine now, you want it to be fine in a year from now. Of course, it will depend on how fast your data will grow and it's very unlikely that you can predict with 100% certainty what will happen in a year from now. For that reason, as soon as my load test pass, if I expect a large exponential data growth, I usually increase the capacity by 50% more percent, knowing that I will have to revisit my cluster topology within a few month or a year.
So to answer your questions:
Yes, if old indices are left open, they will consume resources.
Yes, the more indices you search, the more resources you will need in order to go through every shard of every index. Be careful with aliases spanning many, many rolling indices (especially on a single node)
This is too broad to answer, as it again depends on the amount of data we're talking about and on what kind of query you're sending, whether it uses aggregation, sorting and/or scripting, etc
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Yes.
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
Yes.
How does searching many small indices compare to searching one large one?
When ES searches an index it will pick up one copy of each shard (be it replica or primary) and asks that copy to run the query on its own set of data. Searching a shard will use one thread from the search threadpool the node has (the threadpool is per node). One thread basically means one CPU core. If your node has 8 cores then at any given time the node can search concurrently 8 shards.
Imagine you have 100 shards on that node and your query will want to search all of them. ES will initiate the search and all 100 shards will compete for the 8 cores so some shards will have to wait some amount of time (microseconds, milliseconds etc) to get their share of those 8 cores. Having many shards means less documents on each and, thus, potentially a faster response time from each. But then the node that initiated the request needs to gather all the shards' responses and aggregate the final result. So, the response will be ready when the slowest shard finally responds with its set of results.
On the other hand, if you have a big index with very few shards, there is not so much contention for those CPU cores. But the shards having a lot of work to do individually, it can take more time to return back the individual result.
When choosing the number of shards many aspects need to be considered. But, for some rough guidelines yes, 30GB per shard is a good limit. But this won't work for everyone and for every use case and the article fails to mention that. If, for example, your index is using parent/child relationships those 30GB per shard might be too much and the response time of a single shard can be too slow.
You took this out of the context: "The article advises against having multiple shards per node". No, the article advises one to think about the aspects of structuring the indices shards before hand. One important step here is the testing one. Please, test your data before deciding how many shards you need.
You mentioned in the post "rolling indices", and I assume time-based indices. In this case, one question is about the retention period (for how long you need the data). Based on the answer to this question you can determine how many indices you'll have. Knowing how many indices you'll have gives you the total number of shards you'll have.
Also, with rolling indices, you need to take care of deleting the expired indices. Have a look at Curator for this.

Primary/Replica Inconsistent Scoring

We have a cluster with 3 primary shards and 2 replicas per primary. The total doc count is the same for the primary/replica shards; however, we're getting 3 distinct scores for the same query/document. When we add preference = primary as a query parameter, we get consistent scores each time.
The only explanation we can think of is different DF counts between the primary/replicas. Where is the inconsistency between the primary/replica shards, and how does one go about fixing this? We're using 1.4.2.
EDIT:
We just reindexed the doctype we were querying, but there's still inconsistent scoring.
Primary and replica shards have a different "path" when it comes to segment merging. Meaning, the number and size of the segments can differ between them. Each shared takes care of its own segments independent from other shards.
Why this matters when it comes to calculating score, is because merging is the moment when the documents that were deleted are actually deleted. Until then, deleted documents are only marked as deleted (and taken out from the query results after the query already ran). So, this means it can influence the algorithm by which the score is calculated.
To be more specific - total number of docs in a shard is used for the [IDF calculation](http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#idf(long, long)) and for document frequency (docFreq):
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)
And this number of docs include the deleted (marked as deleted, to be more precise) documents. Take, also, a look at this github issue and Simon's comments regarding the same subject.

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?
A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

How does elasticsearch handle skip requests (from/size parameter)

I am deploying an approach which uses from parameter a lot of times. I wish to understand how 'skip' works in elasticsearch or other such systems in general to judge what performance lost does it incur.
It depends on search type. If you use the default, i.e. query then fetch, then to fetch page 20 with size 10 (from: 190, size: 10), elasticsearch will:
ask each primary shard for ids and relevance scores of top 200 documents (which are selected from all docs matching the query, so this means searching the whole index, but this is the same as with fetching only the first page)
merge the results, sorting by relevance, and skip 190 top hits of such merged list, taking those 10 that follow
fetch actual docs (i.e. 10 of them) from relevant shards
It means that if you have e.g. 3 primary replicas, then elasticsearch nodes need to exchange information about 3 * 200 = 600 docs. There are some optimizations to make obtaining particularly 'distant' pages more efficient, but in a nutshell, you need to process more and more documents each time you fetch next page.
If your use case involves going through a result set sequentially, consider scrolling.

Resources