Unexplainable count results in ElasticSearch - elasticsearch

We have an index running with 241.047 items in it. These items can have any number of subitems, which are indexed as nested documents. The total number of subitems is 381.705.
Both include_in_parent and include_in_root are not set in the mapping, which means that each nested document is indexed as additional documents. This should mean that there will be a total number of 241.047 + 381.705 = 622.752 documents in the index.
When I run the following Curl command to look up the number of documents in the index I get a different number, it's not far off but I'm wondering why it's giving me a different number and it's not returning the number I'm expecting.
curl -XGET
'http://localhost:9200/catawiki_development/_status?pretty' returns 622.861
Next to that, when I'm running a Curl command to get the number of root documents I get a different number than if I run a match_all query and ask for the number of documents returned
curl -XGET 'http://localhost:9200/elasticsearch_development/_count?pretty' returns 241.156
The match_all query returns the correct number of documents, 241.047
How can these difference be explained?

The path of a count api request is quite different from the path of a normal search request. In fact it is a shortcut that allows to only get the count of the documents matching a query, thats' it. It differs from a search with search_type=count too, which is effectively only the first part of a search: broadcast the search request to all shards, but no reduce/fetch since we only want to return the total number of matching documents. You can also add facets etc. to a search request (when using search_type=count too), which is something that you cannot do with the count api.
That said, I'm not that surprised you see a difference for the above reason, it would be nice to understand exactly what the problem is though. The best would be to be able to reproduce the problem with a small number of documents and open an issue including a curl recreation so that we can have a look at it.
In the meantime, I would suggest to use a search request with search_type=count if you have problems with the count api. That one is guaranteed to return the same number of documents as a normal search, just because it is exactly the same logic.

Related

Get last document from index in Elasticsearch

I'm playing around the package github.com/olivere/elastic; all works fine, but I've a question: is it possible to get the last N inserted documents?
The From statement has 0 as default starting point for the Search action and I didn't understand if is possible to omit it in search.
Tldr;
Although I am not aware of a feature in elasticsearch api to retrieve the latest inserted documents.
There is a way to achieve something alike if you store the ingest time of the documents.
Then you can sort on the ingest time, and retrieve the top N documents.

How to return frequencies of matched terms in Elasticsearch query-string searches?

I am trying to adapt an existing boolean query that runs query-string searches against multiple fields so that it returns the number of hits for each matched term, for each search result.
This seems like it should be a straightforward request since the default relevancy scoring takes the matched term frequencies into account on a doc-by-doc basis, and highlighting must parse the fields to identify the positions of matched terms, but from scouring the docs there doesn't seem to be an easy way to do this that doesn't require some additional parsing of the results returned by Elasticsearch.
I would really like to avoid having to do more than one call to Elasticsearch per query for performance reasons, so would like to adapt the existing search queries if possible.
I know that the search API has an "explain" option that, when set to "true", makes a nested "_explanation" object be returned for each search result, but most of the information in these objects is irrelevant to what I want to know (the matched term frequencies), and I haven't found a way to exclude any of that information from being returned in the search results. I am reluctant to use this option because a) I've seen advice that it should only be used for debugging purposes and not in production, and b) the queries I'm running are not for individual terms but for query strings that can contain an arbitrary number of matched terms for each query, thus making the explanation objects much larger in some cases (therefore increasing the response payload) and more complex to parse. It's also not clear if the "_explanation" object has a well-defined structure anyway.
I've also considered parsing highlighted text fragments to determine matched term frequencies, since I'm already returning highlighted fields as part of the query. However, again this would require some additional parsing of the API response which would uncomfortably couple the method for obtaining matched term frequencies with the custom pre- and post-tags of the highlighted fields.
Edit: Just to clarify, I would be open to a separate Elasticsearch call per search if necessary, but not any that would require submitting a set of document IDs matched from the first query, because this would mean the API calls couldn't be done in parallel and because the number of returned results in the first call could be quite high, which I presume would impact performance of the second call.

How to get all the index patterns which never had any documents?

For Kibana server decommissioning purposes, I want to get a list of index patterns which never had any single document and had documents.
How to achieve this using Kibana only?
I tried this but it doesn't give the list based on the document count.
GET /_cat/indices
Also in individual level getting the count to check the documents are there is time consuming .
GET index-pattern*/_count
You can try this. V is for verbose and s stands for sort.
GET /_cat/indices?v&s=store.size:desc
From the docs :
These metrics are retrieved directly from Lucene, which {es} uses internally to power indexing and search. As a result, all document counts include hidden nested documents.

ElasticSearch paginate over 10K result

Elasticsearch's search feature only support 10K result by default. I know I can specific the "size" parameter in the search query, but this only applies to number of result to get back in one call.
If I want to iterate over 20K results using size=100, making 200 calls total. How should I do it?

Why elasticsearch return docs in different order with the same query?

In elasticsearch 7.9, I have an Index with 1 shard and 1 replica. I use simple datetime filter to get docs between start time and end time, but I often get same result set in different order. I do not want to use Sort clause and compute scores. I just want to get results in same order.
So there is anyway to do this without using Sort?
It may be happening due to the fact, that you have 1 replica for your index, which might have some difference or different values for your timestamp field, you can use the preference param and make sure, your search results are always returned from the same shard.
Refer bouncy result issue blog in ES for more info.

Resources