Elasticsearch count of searches against an index resets to zero after cluster restart - elasticsearch

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830

As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

Related

Does ElasticSearch Keep Count The Number Of Times A Record Is Returned In A Given Period Of Time?

I have an ElasticSearch instance and it does one type of search - it takes a few parameters and returns the companies in its index that match the parameters given.
I'd like to be able to pull some stats that essentially says "This company has been returned from search queries X number of times in the past week".
Does ElasticSearch store metadata that will allow to pull this kind of info from it? If this kind of data isn't stored in ES out of the box, is there a way to enable it?
Elasticsearch (not ElasticSearch ;) ) does not do this natively, no. you can build something using the slow log, where you set the timing to 0 to get it to log everything, but that then logs everything which may not be useful/too noisy
things like https://www.elastic.co/enterprise-search, built on top of Elasticsearch, do provide this sort of insight

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Elasticsearch query a specific node for scroll

I have a scan/scroll query where each document that comes back has something done to it and is then the changes are written back. Basically mapping over the whole index (or document type actually).
If the function applied during this mapping starts to become too slow then I need to find a way to split this across several machines.
I could share a scroll ID across multiple machines using Zookeeper or something but will there be issues querying ES from 2 clients at almost the same time?
Alternatively, is there a way to write a query that will only run against one specified node? This way, if I had one 'mapping process' on the same box as one node then I could remove the network overhead.
Check "_only_node" or "_prefer_node" option in ElasticSearch API.

Getting an indexes item count with ElasticSearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?
Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Resources