Scroll time increment effect on Elastic Search - elasticsearch

I am working on a project using ElasticSearch and querying it to fetch the member information. It has 3 million records.
I am running a campaign for 2 million users and the user data is present on elasticsearch6.2. I query the ES and fetches the records in batches (50 records at a time) using the scroll. Also, I want to keep the SEARCH context for 1 day because if the campaign running process fails due to any reason, I can resume the campaign from where it was stopped. In this way, I will escape from starting the campaign again from starting. I am also saving the scrollID and will use it to resume campaign.
While testing I found CPU Utilization increased by 50% (ES config: 2 nodes with 4 shards running on aws, Instance Type:i3.xlarge.elasticsearch) and its CPU Utilization remains consistent to 50%.
Is there any relation between CPU Utilization and keeping the search context for 1day. BTW campaigns take 6 hours to finish.

From the documentation
Normally, the background merge process optimizes the index by merging
together smaller segments to create new bigger segments, at which time
the smaller segments are deleted. This process continues during
scrolling, but an open search context prevents the old segments from
being deleted while they are still in use. This is how Elasticsearch
is able to return the results of the initial search request,
regardless of subsequent changes to documents.
So with your scroll cursor expiration to 24h it seems you forbid Lucene to merge your segments, increasing to load of your shards.
Later in the documentation there is an explanation on how to clear your scroll cursor :
Search context are automatically removed when the scroll timeout has
been exceeded. However keeping scrolls open has a cost, as discussed
in the previous section so scrolls should be explicitly cleared as
soon as the scroll is not being used anymore using the clear-scroll
API:
You should try to clear your cursor after a campaign is completed.

Related

elasticsearch - refresh interval of one second

I am aware of how refresh works and refresh happens every second by default. However, what disconnects me more here is
Does it mean any size of data will appear in search after exactly one second or it means it will take at least one second for the searcher to see the new documents .
From Documentation, "The default refresh interval is one second for indices that receive or more search requests in the last 30 seconds." It doesnt seem apply for all the indices, can someone shed more details about this what it really mean by for indices that receive or more search requests in the last 30 seconds in the context of what happens to other indices which didnt receive the search req in last 30 sec
Really nice question, let me try to explain to you.
1. Does it mean any size of data will appear in search after exactly one second or it means it will take at least one second for the searcher to see the new documents.
Answer: Size of data has got nothing to do here, it's simply a background process in elasticsearch which commits data from im-memory(which is not available to searches) to segments(Hope you know what segments in ES and Lucene), so that it's available for searches.
2.The default refresh interval is one second for indices that receive or more search requests in the last 30 seconds.
Answer: This is the smart optimization done by elasticsearch to reduce the overhead of refresh(explained earlier), if your indices didn't get any search request in last 30 seconds, so no need to explicit refresh(as only when you search, you will get to see the latest data, available by using refresh), Hence on indices which have not got any search requests in last 30 seconds, ES can skip the refresh on those indices, even their refresh interval is 1 second.

How to stop auto reindexing in elastic search if any update happens?

I am having a big use case with elasticsearch which has millions of records in it.
I will be updating the records frequently, say 1000 records per hour.
I don't want elastic search to reindex for my every update.
I am planning to reindex it on weekly basis.
Any Idea how to stop auto-reindex while update ?
Or any other better suggestion is welcome . Thanks in advance :)
Elasticsearch(ES) update an existing doc in below manner.
1. Deletes the old doc.
2. Index a new doc with the changes applied to it.
According to ES docs :-
In Elasticsearch, this lightweight process of writing and opening a
new segment is called a refresh. By default, every shard is refreshed
automatically once every second. This is why we say that Elasticsearch
has near real-time search: document changes are not visible to search
immediately, but will become visible within 1 second.
Note that these changes will not be visible/searchable until ES commits/flush these changes to disk cache and disk,which is control by soft-commit(es refresh interval, which is by default 1 second) and hard-commit(which actually write the document to disk, which prevent it being lost permanently and costly affair than a soft-commit).
You need to make sure, you tune your ES refresh interval, and do proper load testing, as setting it very low and very high has its own pros and cons.
for example setting it very less for example 1 second and if you have too many updates happening than it has a performance hit and it might crash your system. Also setting it very high for example 1 hour means you now don't have a NRT(near real time search) and during that time if your memory could contain again millions of doc(depending on your app) and can cause out of memory error, also committing on such a large memory is a very costly affair.

elastic query returns same results after insert

I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html

Issues with ElasticSearch for real-time geo queries

I'm building a service that will allow users to search for other users who are nearby, based on GPS coordinates. I've tried using ElasticSearch's geo spatial indexes. When a user signs in, he submits his GPS location to an ElasticSearch geo index. Other users periodically poll ElasticSearch, querying for new documents that contain GPS coordinates within a few hundred meters.
The problem is that ElasticSearch either doesn't update its index fast enough, or it caches its results, making it unsuitable for retrieving real-time results. I've tried disabling the cache with index.cache.filter.max_size=-1 and passing "_cache=false" with every query. ElasticSearch still returns stale results when polling with the same query, and it can return stale results for up to a few minutes.
Any idea on what could be happening? Maybe it's because I'm keeping the same connection open during polling, and ElasticSearch caches results for each connection? Still, the results can be out of date with subsequent requests.
Elasticsearch results don't become immediately available for search. They are accumulated in a buffer and become available only after operation called refresh. In other words, search is not real time, but "near real time" operation ("near" is because refresh is called every second by default). Please also note that get operation is real-time - you can get document immediately after it is indexed.
While you can force refresh process after each document or make it more often, it's not the best solution for your problem because very frequent refreshing can significantly reduce search and indexing performance. Instead, I would advise you to check Elasticsearch percolators, which were added exactly for the use cases such as yours.

Getting an indexes item count with ElasticSearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?
Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Resources