Elasticsearch index stats differ from search hits - elasticsearch

When examining the status of indices in our Elasticsearch instance using curl 'http://localhost:9200/_cat/indices?v' the number of documents, docs.count in each index is frequently larger than the number of search results returned when searching all documents on that index.
Sometimes it is an integer multiple of the search hits but not always. In one case there are 98160 hits for match_all but 805383 documents in the index.
Note that there are no nested documents in the mappings.
What is the explanation? Note that search does seem to functioning normally.

This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas, and probably one of the node are down while you are performing search queries.
For instance,
If I have a cluster of only one node, and the node has 1 index with 4 documents, I will get the following output when i examine indices,
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open blog 5 1 4 0 10.9kb 10.9kb
Now, if I run match_all query,
{
"query": {
"match_all": {}
}
}
I will get,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [........
Notice how docs.count equals to hits count. In above output, observe the number of shards, which are 5. All those shards are assigned to a single node.
But if I had a multi node setup with replicas not configured, those shards will be distributed among multiple nodes.
Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2 and 4 available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.
Number of hits in this case will always be less than docs.count value.
This kind of uncertainty can be avoided by using replicas

matches all documents, giving them all a _score of 1.0.
One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. In this case, the email field will be broken up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms.
link
how scoring works

Related

ElastichSearch Count Discrepancies

I have an ElasticSearch (v7.4) cluster with 3 master nodes and 4 data nodes. When gathering statistics about the number of documents, I have come across a few apparent inconsistencies:
The number of documents as returned by GET https://<my_ip>/<my_index>/_count: 66717419 (24 shards)
The number of documents as returned by GET https://<my_ip>/<my_idnex/_search?track_total_hits=true: 66717419 (same as above)
Now I checked the number of documents where the id starts with 0:
curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
"query": {
"prefix": {
"id": {
"value": "0"
}
}
},
"track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}
Repeating the same query for all of [0-9a-f] also returns numbers between 57k and 58k for each of these letters/digits for hits.total.value (just as shown in the example query for 0 above).
Repeating the same query for any other letters returns 0 results (as expected).
These totals sum up to ~912k total documents (16*57k)
So I see ~900k documents that have an id starting with any of [0-9a-f], 0 starting with other ids. At the same time, ES reports a total of 66M documents in the index.
Where does the discrepancy come from? Can there be documents with no id? Does ES count deleted or updated documents somehow?
According to the ID Field documentation
Each document has an _id that uniquely identifies it
Could it be related to sharding? From the results shown above, however, it looks like each of my queries hits all 24 shards.
The documentation for the Count API or the Search API don't seem to indicate any peculiar behaviour in that regard. What else could explain these numbers?
Update:
The index statistics:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open <my_index> BUWfFDsBQAGcl64-J7gzHQ 24 1 66717419 23791236 1.6tb 873gb

return empty result for a nested bool query on fields that don't have data

I'm doing the following query:
the ns.ns field has configured (has both mapping and setting set up successfully) but there is no source data for this field. and I get empty result returned from ElasticSearch. is that right? I mean without data this query would return empty result, is that correct? Still learning ES and thanks for the help.
The ns.ns field has configured (has both mapping and setting set up
successfully) but there is no source data for this field. and I get
empty result returned from ElasticSearch. is that right?
without data this query would return an empty result, is that correct?
As you have mentioned above that the ns field is mapped as type nested, therefore when you hit the search query you will not get "index_not_found_exception", since the index already exists.
The search API returns search hits that match the query defined in the request.
When you hit the search query, mentioned in the question above, the following response is there:
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
The response provides the following information about the search request:
took – how long it took Elasticsearch to run the query, in
milliseconds
timed_out – whether or not the search request timed out
_shards – how many shards were searched and a breakdown of how many shards succeeded, failed, or were skipped.
max_score – the score of the most relevant document found
hits.total.value - how many matching documents were found
The hits.hits above returns a blank array([]), hits.hits is an array of found documents that meet your search query. As here no documents are indexed, therefore no documents are matched when a search query is hit.
Refer to this ES documentation, to know more about how scoring works in ES
In the above response max_score value is NULL, the _score in
Elasticsearch is a way of determining how relevant a match is to the
query.

Elasticsearch number of results changes with pagination

I'm using Elasticsearch 7.6.0 and have paginated one of my queries. It seems to work well, and I can vary the number of results per page and the selected page using the search from and size parameters.
query = 'sample query'
items_per_page = 12
page = 0
es_query = {'query': {
'bool': {
'must': [{
'multi_match': {
'query': query,
"fuzziness": "AUTO",
"operator": "and",
'fields': ['title^2', 'description']
},
}]
}
}, 'min_score': 5.0}
res = es.search(index='my-index', body=es_query, size=items_per_page, from_=items_per_page*page)
hits = sorted(res['hits']['hits'], key=lambda x: x['_score'], reverse=True)
print(res['hits']['total']['value']) # This changes depending on the page provided
I've noticed that the number of results returned depends on the page provided, which makes no sense to me! The number of results also oscillates which further confuses me: Page 0, 233 items. Page 1, 157 items. Page 2, 157 items. Page 3, 233 items...
Why does res['hits']['total']['value'] depend on the size and from parameters?
The search is distributed and being sent to all the nodes holding shards matching the searched indices. Then all the results will be merged and returned. Sometimes, not all shards can be searched. This happens when
The cluster is very busy
The specific shard is not available due to recovery process
The search has been optimized and the shard has been omitted.
In the response, there is a _shards section like this:
{
"took": 1,
"timed_out": false,
"_shards":{
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits":{...}
}
Check if there is any value other than 0 for failed shards. If so, check the logs and cluster and index status.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-track-total-hits
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.
len(res['hits']['hits']) will always return the same number as specified in items_per_page (i.e. 12 in your case), except for the last page, where it might return a number smaller or equal to 12.
However, res['hits']['total']['value'] is the total number of documents in your index, not the number of results returned. If the number of documents increases, it means that new documents got indexed between the last query and the current one.

Elastic search document count

I am having a Elastic search running 2.2 version. I have created index and loaded sample documents. I found some problem in it. When i give
GET index/type/_count
I am getting the correct answer
{
"count": 9998,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
}
}
But when i see the things using http://IP:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open index 5 1 79978 0 32.1mb 32.1mb
Where docs.count : 79978. Which is wrong.
Why i am seeing docs.count with wrong value. The exact document count is 9998
GET index/type/_count will return the top-level document count.
docs.count in _cat/indices returns the count of all documents, including artificial documents that have been created for nested fields.
That's why you see a difference:
The former count (i.e. 9998) will tell you how many Elasticsearch documents are in your index, i.e. how many you have indexed.
The latter count (i.e. 79978) will tell you how many Lucene documents are in your index.
So if one ES document contain a nested field with 5 sub-elements, you'll see 1 ES document, but 6 Lucene documents. Judging by the counts, each of your ES document has between 7 and 8 nested elements within it.

Inconsistent doc count

Hi I am running Elasticsearch 1.5.2
I indexed 6,761,727 documents in one of my indexes.
When I run the following query....
GET myindex/mytype/_search
{
"size": 0
}
The hits.total count keeps alternating between 2 values...
"hits": {
"total": 6761727,
"max_score": 0,
"hits": []
}
and
"hits": {
"total": 6760368,
"max_score": 0,
"hits": []
}
No matter how many times I run the query the count goes back and forth between the 2.
I searched around a bit and found out that it seems that primary vs replica shards don't have exact same number of docs. If I use preference=primary then the doc count returned is correct.
What is the easiest way to check which shard is the culprit and try to fix him without re-indexing everything?
Set the replica count to 0 for that index
PUT /my_index/_settings
{
"index": {
"number_of_replicas": 0
}
}
wait to see no more replicas for that index when you do GET /_cat/shards/my_index?v and then set back to the initial number of replicas.
This will delete all the replicas for that index and then make a new copy of the primaries.

Resources