elasticsearch - refresh interval of one second - elasticsearch

I am aware of how refresh works and refresh happens every second by default. However, what disconnects me more here is
Does it mean any size of data will appear in search after exactly one second or it means it will take at least one second for the searcher to see the new documents .
From Documentation, "The default refresh interval is one second for indices that receive or more search requests in the last 30 seconds." It doesnt seem apply for all the indices, can someone shed more details about this what it really mean by for indices that receive or more search requests in the last 30 seconds in the context of what happens to other indices which didnt receive the search req in last 30 sec

Really nice question, let me try to explain to you.
1. Does it mean any size of data will appear in search after exactly one second or it means it will take at least one second for the searcher to see the new documents.
Answer: Size of data has got nothing to do here, it's simply a background process in elasticsearch which commits data from im-memory(which is not available to searches) to segments(Hope you know what segments in ES and Lucene), so that it's available for searches.
2.The default refresh interval is one second for indices that receive or more search requests in the last 30 seconds.
Answer: This is the smart optimization done by elasticsearch to reduce the overhead of refresh(explained earlier), if your indices didn't get any search request in last 30 seconds, so no need to explicit refresh(as only when you search, you will get to see the latest data, available by using refresh), Hence on indices which have not got any search requests in last 30 seconds, ES can skip the refresh on those indices, even their refresh interval is 1 second.

Related

Why is ElasticSearch index searchable when refresh_interval is set to -1 on initial data upload?

I'm performing a large upload of data to an empty index.
This article suggests to set "refresh_interval=-1" and "number_of_replicas=0" to increase upload performance. Then it says to enable it back.
The interesting thing is that if I don't enable it back - I can still send the queries to the newly created index and get the results.
I'd like to know why is that and what I got wrong ? (My expectation was that I should get zero results because indexing is disabled)
And one more thing I'd like to understand - if I enable refresh_interval back to the original value, do I need to execute /_refresh operation ?
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds. You can change this default interval using the
index.refresh_interval setting.
so document says: when you send a search request, it will send a refresh request with that. so you could search your data but very slow for first time or miss some data for first search. it is better to have a refresh_interval if you index new data on your indices.

Elasticsearch refresh interval when index is not searched

The documentation on refreshes says:
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
What happens if the index that was not queried in the last 30 seconds? When does it get refreshed? If I for example write to an index, don't search it for a long time and then search it, will I get up to date results? When I write I get the results immediately on the first search, so I seem to be misunderstanding something.

Scroll time increment effect on Elastic Search

I am working on a project using ElasticSearch and querying it to fetch the member information. It has 3 million records.
I am running a campaign for 2 million users and the user data is present on elasticsearch6.2. I query the ES and fetches the records in batches (50 records at a time) using the scroll. Also, I want to keep the SEARCH context for 1 day because if the campaign running process fails due to any reason, I can resume the campaign from where it was stopped. In this way, I will escape from starting the campaign again from starting. I am also saving the scrollID and will use it to resume campaign.
While testing I found CPU Utilization increased by 50% (ES config: 2 nodes with 4 shards running on aws, Instance Type:i3.xlarge.elasticsearch) and its CPU Utilization remains consistent to 50%.
Is there any relation between CPU Utilization and keeping the search context for 1day. BTW campaigns take 6 hours to finish.
From the documentation
Normally, the background merge process optimizes the index by merging
together smaller segments to create new bigger segments, at which time
the smaller segments are deleted. This process continues during
scrolling, but an open search context prevents the old segments from
being deleted while they are still in use. This is how Elasticsearch
is able to return the results of the initial search request,
regardless of subsequent changes to documents.
So with your scroll cursor expiration to 24h it seems you forbid Lucene to merge your segments, increasing to load of your shards.
Later in the documentation there is an explanation on how to clear your scroll cursor :
Search context are automatically removed when the scroll timeout has
been exceeded. However keeping scrolls open has a cost, as discussed
in the previous section so scrolls should be explicitly cleared as
soon as the scroll is not being used anymore using the clear-scroll
API:
You should try to clear your cursor after a campaign is completed.

Elastic search - get last second changes

How do I get all changes that were indexed in my Elasticsearch cluster within the last second?
I've tried to add a time stamp and query it, but indexing some of the items took few seconds, and therefore these items were missing in the result (I didn't get them in the next second either, since the timestamp refers to the start of the indexing process).

Getting an indexes item count with ElasticSearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?
Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Resources