Last updated time for an index in Elasticsearch - elasticsearch

I have a use case where I ran a batch code to first create and then subsequently update my index in elasticsearch.
My program crashed pre-maturedly and now I want to know what was the last time that an update was made to my elasticsearch index.
Is there any api which could give me the information for the last update time of the index.
I have not been able to find any such resources. I looked specifically in https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
and tried,
curl http://{myhost}/{indexName}/_stats

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830
As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

How to check the index is used for searching or indexing

I've a lot of elasticsearch clusters which hold the historical indices(more than 10 years old), some of these indices are created newly with latest settings and fields, but old ones are not deleted.
Now I need to delete the old indices which are not receiving any search and index requests.
I've already gone to elasticsearch curator but it would not work with older version of ES.
Is there is any API which can just gives the last time of index and search request in ES, that would serve my purpose very well.
EDIT:- I've also check https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html but this also doesn't give the last time when indexing or search request came. all it gave is the number of these requests from last restart.

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6
I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.
I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time
Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Elasticsearch: time to index document

I am updating existing documents by deleting and reindexing them. I did it this way because the documents have nested components and it was easier to massage the document myself rather than construct an update operation.
Mostly this works fine but occasionally the system updates the same document twice very quickly. I think what is happening is that the the search for the second update gets the original document (before it was updated the first time) because the the previous updates have not yet been reflected in the indexes. By the time I try to delete the document (by id) the index has updated and it comes up as not found.
I am not doing bulk updates.
Is this a known issue and if so how does one work around it?
I can't find any reference to problems like this anywhere so I am puzzled.

Resources