How does elasticsearch handle skip requests (from/size parameter) - elasticsearch

I am deploying an approach which uses from parameter a lot of times. I wish to understand how 'skip' works in elasticsearch or other such systems in general to judge what performance lost does it incur.

It depends on search type. If you use the default, i.e. query then fetch, then to fetch page 20 with size 10 (from: 190, size: 10), elasticsearch will:
ask each primary shard for ids and relevance scores of top 200 documents (which are selected from all docs matching the query, so this means searching the whole index, but this is the same as with fetching only the first page)
merge the results, sorting by relevance, and skip 190 top hits of such merged list, taking those 10 that follow
fetch actual docs (i.e. 10 of them) from relevant shards
It means that if you have e.g. 3 primary replicas, then elasticsearch nodes need to exchange information about 3 * 200 = 600 docs. There are some optimizations to make obtaining particularly 'distant' pages more efficient, but in a nutshell, you need to process more and more documents each time you fetch next page.
If your use case involves going through a result set sequentially, consider scrolling.

Related

Elastic app search Bulk Indexing Maximum is 100, how to index 1 million documents?

I am in a real pain with Elastic App Search, the indexing limit for bulk index is 100 according to the docs:
https://www.elastic.co/guide/en/app-search/current/limits.html
I was trying to create all the promises and then do promise.all(allPromises), but it's failing to index everything, and the response of when this fails still return 200, and you have to loop over:
res.data (all the 100 documents array), and look if they have error field.
Is there any solution to index lot of document fast? Because indexing 1 million with loop to await between every 100 batch size query is extremely slow.
Unfortunately the limit of 100 makes it a slow operation. We are indexing 1.1 million documents and our solution was to slice that our into ten segments and run ten processes in parallel to decrease the time that it takes.
Since reading from the index is very quick we have a separate job that validates the information in App Search matches our source data and flags anything out of order. So I don't check for errors on a full import or update so that part goes as fast as possible. We only see around a 1% failure rate on bulk imports.
I should note that part of the reason for the second piece is we have on occasion found the search index can get out of alignment with the documents and fields we have, so validation seemed like a good idea.

Elastic Search force merging and merging(to do or not to do force merge)

I have a scenario in which we have indices for each month, and each document has a field, say expiry date, and according to that expiry date, documents will be deleted when they reach their expiry. One-month-old index will be moved to a warm node from a hot node(All my queries below will be pertaining to the indexes that are on warm nodes).
Now I understand Elastic search will merge the segments as needed.
here is my first question, how will elastic search determine that now is the need to merge the segments?
I have come across a property index.merge.policy.expunge_deletes_allowed which has a default value of 10%, does this property dictate when the merging will happen? And it says 10% deleted document, what does that exactly mean? let's suppose if a segment has 100 documents and I deleted 11 of them(that happens to be on the same segment) does that mean the default limit of 10% has been met?
Coming back to the scenario when my documents get deleted at some point there will come a time when all the documents in an index get deleted. What will segments of that index look like then? will it have 0 segments or just 1 to hold index metadata?
Another question regarding the force merge is if I happen to choose force merge to get rid of all the deleted documents from the disk and if force merging resulted in a segment of size greater than 5 GB so as written here. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html#forcemerge-api-desc
Snippet :
Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.
When will my segment(greater than 5GB will get merged automatically if at all?) as it says it will when it consists mostly of deleted documents? That's vague what does mostly mean here? what's the threshold?
Another question is, it is suggested that force merge should only be done on indexes that are read-only. Why is that? how does it degrade performance? coming back to the scenario I will have some updates and new documents coming on my indexes on warm nodes even after I force merge them, but the frequency of those updates and new documents will be very less(we can say less than 5% of the documents will be updated, and less than and maybe a couple of hundred new documents could be added to those indexes).
Also what if I am force-merging 4 450GB indexes( each with 16 shards) in parallel, how will it affect my searching speed? I read somewhere that by default each force merge request is executed in a single thread and that too is throttled if need be? does that mean if search requests increases the merging will be paused?
Thank you for your patience and time.

How to let elasticsearch coordinate node don't merge and resort

For example, the ES cluster has 3 shards, a query wants to get 300 docs.
Normally, the coordinate node will get 300 docs from each shard, that's 3*300=900 docs in total, then coordinate node sort these 900 docs and return top 300 docs.
How can I set the query, let coordinate node get 100 docs from each shard and return 3*100=300 docs?
Am curious why you would like every single shard to only return an equally sized share/slice of the resulting hits, as it is very unlikely that the 300 most relevant/important hits are evenly distributed across all shards.
The coordinating node's task is not just to return 300 hits, but the 300 most relevant/important hits. By default hits are sorted by descending score (unless you specify a different sorting criteria). Statically considering 100 hits from every single shard could result in a total meaningless result list.
An example: for simplicity, assume that your index is only made up of 2 (primary) shards and you contains documents about mobile phone news back in early 2007. It's very likely that you have many documents in your index about Windows, Nokia and Blackberry phones. And then, all of a sudden the iPhone got announced and articles start popping up. Let's further assume that short ofter the presentation of the phone there have been 100 very relevant articles about iPhones been published and indexed in your Elasticsearch index and now you are querying for the best 100 hits about iPhones. With coordinating nodes "optimized" the way you are asking for, the first fifty documents would get retrieved from both shards. As a consequence it will be very likely that you only end up having something like 60-70 of the relevant articles in your result set, and the other 30-40 very relevant hits are missing (and even worse, 30-40 articles are rather irrelevant and just made it in, because they mentioned the term iPhone once).
Actually, the coordinating nodes are also "smart" and under certain conditions can skip shards when it's guaranteed that they they don't contain any matching document.
Furthermore, if you don't deal with big data and all your documents easily fix into a single, configure your index to be made up of 1 shard and the coordinating node does not need to do any merging.
If your use-case does not rely on relevancy at all, you could think of organizing your data in different indices (rather than multiple shards within an index). Then you can query every single index independently for the first n hits, and merge the results on application side. But as this involves more network roundtrips, it eventually might be even slower.

How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Resources