How to debug document not available for search in Elasticsearch - elasticsearch

I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.

By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.

Related

When no write, why Elasticsearch performs indexing every 'n' seconds?

I have basic question regarding elastic search.
As per documentation : By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html#refresh-api-desc
Also as per documentation: When a document is stored, it is indexed and fully searchable in near real-time--within 1 second.
Reference : https://www.elastic.co/guide/en/elasticsearch/reference/7.14/documents-indices.html
So when write happens, indexing happen. When write is not happening and documents are already indexed, then why elastic search indexes every 1 second existing documents?
it's not indexing existing documents, that's already been done
it's checking to see if it needs to write any in memory indexing requests that need to be written to disk to make them searchable

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

ElasticSearch Frequent Full Index Updating affect on search response

I have to built an index in Elastic Search which will have more than 500,000 unique documents. The documents have nested fields as well.
All the documents in the index are updated every 10 mins (using PUT).
I read that updating an document includes reindexing the document and it can affect the search performance.
Did anyone faced similar scenario in using EL and if someone can share their experience on the search/query response time across such an index if the expected response for query is under 2 seconds?
Update:
Now, I Indexed document with id as 1 using update request. Then, I updated document (id=1) using PUT to /_update with
"doc_as_upsert" : true and doc field, I see the response contains the same version as before update for the document and has attribute result ="noop" in the output.
I assume that indexing didn't happened as version of the document is not updated.
Does this reduce impact on search response(assuming there are 100 requests/second happening) and indexing response for my use case if do the same but for 500,000 documents every 10 mins compared to using PUT (INDEX API)?

ElasticSearch - Configuration to Analyse a document on Indexing

In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

Listening to recently added documents that matches a given query

I need to listen to a subset of documents added to an Elasticsearch index. The subset is defined by an Elasticsearch query. I can tolerate a 10 second delay between the index operation and the listening callback.
Obviously, I can do this by sending search requests to the server every 10 seconds. But searching the entire index for recent documents seems redundant. I can store the id of the last document I've fetched and use that to narrow down the search further, which I will do if there is no easier way.
I thought, however, there may be a plugin that will catch any newly inserted document, try to match the query against the document and push it to my listener if the match was successful. Does such plugin exists? Is this at least possible?
You can take a look at the Percolator feature, which does what you described. Also, there are significant changes in the upcoming 1.0 release, see: Master Documentation
Edit: As of Elasticsearch 5.0, the Percolator has been deprecated and replaced with the percolator query
the delay depends on how many queries you need to match. recently, I've test the percolate feature in the 1.0 beta version. to percolate one document via 100 registered queries, less than 10ms, 1000, above 15ms, 10000 above 100ms, seems the delay increase linearly with the number of
registered queries. very bad. after reading "how does it work under the hook" section of beta 1.0 percolate documentation, I'm confirmed that the "percolating" is done in a linear way.

Resources