When no write, why Elasticsearch performs indexing every 'n' seconds? - elasticsearch

I have basic question regarding elastic search.
As per documentation : By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html#refresh-api-desc
Also as per documentation: When a document is stored, it is indexed and fully searchable in near real-time--within 1 second.
Reference : https://www.elastic.co/guide/en/elasticsearch/reference/7.14/documents-indices.html
So when write happens, indexing happen. When write is not happening and documents are already indexed, then why elastic search indexes every 1 second existing documents?

it's not indexing existing documents, that's already been done
it's checking to see if it needs to write any in memory indexing requests that need to be written to disk to make them searchable

Related

Why is ElasticSearch index searchable when refresh_interval is set to -1 on initial data upload?

I'm performing a large upload of data to an empty index.
This article suggests to set "refresh_interval=-1" and "number_of_replicas=0" to increase upload performance. Then it says to enable it back.
The interesting thing is that if I don't enable it back - I can still send the queries to the newly created index and get the results.
I'd like to know why is that and what I got wrong ? (My expectation was that I should get zero results because indexing is disabled)
And one more thing I'd like to understand - if I enable refresh_interval back to the original value, do I need to execute /_refresh operation ?
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds. You can change this default interval using the
index.refresh_interval setting.
so document says: when you send a search request, it will send a refresh request with that. so you could search your data but very slow for first time or miss some data for first search. it is better to have a refresh_interval if you index new data on your indices.

How to debug document not available for search in Elasticsearch

I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.
By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.

ElasticSearch Frequent Full Index Updating affect on search response

I have to built an index in Elastic Search which will have more than 500,000 unique documents. The documents have nested fields as well.
All the documents in the index are updated every 10 mins (using PUT).
I read that updating an document includes reindexing the document and it can affect the search performance.
Did anyone faced similar scenario in using EL and if someone can share their experience on the search/query response time across such an index if the expected response for query is under 2 seconds?
Update:
Now, I Indexed document with id as 1 using update request. Then, I updated document (id=1) using PUT to /_update with
"doc_as_upsert" : true and doc field, I see the response contains the same version as before update for the document and has attribute result ="noop" in the output.
I assume that indexing didn't happened as version of the document is not updated.
Does this reduce impact on search response(assuming there are 100 requests/second happening) and indexing response for my use case if do the same but for 500,000 documents every 10 mins compared to using PUT (INDEX API)?

Is it possible to limit a size of an Elasticsearch index?

I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain

Getting an indexes item count with ElasticSearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?
Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Resources