Elastic search document count - elasticsearch

I am having a Elastic search running 2.2 version. I have created index and loaded sample documents. I found some problem in it. When i give
GET index/type/_count
I am getting the correct answer
{
"count": 9998,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
}
}
But when i see the things using http://IP:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open index 5 1 79978 0 32.1mb 32.1mb
Where docs.count : 79978. Which is wrong.
Why i am seeing docs.count with wrong value. The exact document count is 9998

GET index/type/_count will return the top-level document count.
docs.count in _cat/indices returns the count of all documents, including artificial documents that have been created for nested fields.
That's why you see a difference:
The former count (i.e. 9998) will tell you how many Elasticsearch documents are in your index, i.e. how many you have indexed.
The latter count (i.e. 79978) will tell you how many Lucene documents are in your index.
So if one ES document contain a nested field with 5 sub-elements, you'll see 1 ES document, but 6 Lucene documents. Judging by the counts, each of your ES document has between 7 and 8 nested elements within it.

Related

ElastichSearch Count Discrepancies

I have an ElasticSearch (v7.4) cluster with 3 master nodes and 4 data nodes. When gathering statistics about the number of documents, I have come across a few apparent inconsistencies:
The number of documents as returned by GET https://<my_ip>/<my_index>/_count: 66717419 (24 shards)
The number of documents as returned by GET https://<my_ip>/<my_idnex/_search?track_total_hits=true: 66717419 (same as above)
Now I checked the number of documents where the id starts with 0:
curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
"query": {
"prefix": {
"id": {
"value": "0"
}
}
},
"track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}
Repeating the same query for all of [0-9a-f] also returns numbers between 57k and 58k for each of these letters/digits for hits.total.value (just as shown in the example query for 0 above).
Repeating the same query for any other letters returns 0 results (as expected).
These totals sum up to ~912k total documents (16*57k)
So I see ~900k documents that have an id starting with any of [0-9a-f], 0 starting with other ids. At the same time, ES reports a total of 66M documents in the index.
Where does the discrepancy come from? Can there be documents with no id? Does ES count deleted or updated documents somehow?
According to the ID Field documentation
Each document has an _id that uniquely identifies it
Could it be related to sharding? From the results shown above, however, it looks like each of my queries hits all 24 shards.
The documentation for the Count API or the Search API don't seem to indicate any peculiar behaviour in that regard. What else could explain these numbers?
Update:
The index statistics:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open <my_index> BUWfFDsBQAGcl64-J7gzHQ 24 1 66717419 23791236 1.6tb 873gb

ElasticSearch Segment merge not happening when deleted documents count is greater than 50%

Elasticsearch version: 7.10.0
I have an elasticsearch index with 8 shards in 8 different nodes and with a document count greater than 25 million documents(nested not included). It's an heavy update index. The document size grows over a period of time because of deleted documents. I did a search on this issue and read blogs like one below which tells a segment will automatically be merged when the deleted docs count in that segment is greater than 50%.
https://discuss.elastic.co/t/too-many-deleted-docs/84964/4
I did a /_segments for the index and found segments like the below
"segments": {
"_bbx": {
"generation": 14685,
"num_docs": 27901732,
"deleted_docs": 23290932,
"size_in_bytes": 5071187083,
"memory_in_bytes": 137008,
"committed": true,
"search": true,
"version": "8.7.0",
"compound": false,
"attributes": {
"Lucene87StoredFieldsFormat.mode": "BEST_SPEED"
}
},
Full response of /_segment call can be found here
https://drive.google.com/file/d/1mLE2xw0u7lnogHnfzz65rWCBS8JrcnNm/view?usp=sharing
In many segments like the one above the deleted_docs count is more than 75% of the num_docs but is still not getting merged. We haven't set any max_merged_segment so the default is 5gb. We also haven't changed any MergePolicy and are using the default ones as of Es version 7.10.0.
Is my understanding correct ?
Any thoughts on this would be helpful. Thanks in advance.
The num_docs contains only the present documents and doesn't include the deleted documents.
So in this case we have 23,290,932 deleted documents out of a total of 51,192,664 (27,901,732 + 23,290,932) documents which means 45.5% are deleted in that segment. Hence segment merge didn't happen.
Note : Posted the same question in elasticsearch forums got this reply
https://discuss.elastic.co/t/elasticsearch-segment-merge-not-happening-when-deleted-documents-count-is-greater-than-50/277209

return empty result for a nested bool query on fields that don't have data

I'm doing the following query:
the ns.ns field has configured (has both mapping and setting set up successfully) but there is no source data for this field. and I get empty result returned from ElasticSearch. is that right? I mean without data this query would return empty result, is that correct? Still learning ES and thanks for the help.
The ns.ns field has configured (has both mapping and setting set up
successfully) but there is no source data for this field. and I get
empty result returned from ElasticSearch. is that right?
without data this query would return an empty result, is that correct?
As you have mentioned above that the ns field is mapped as type nested, therefore when you hit the search query you will not get "index_not_found_exception", since the index already exists.
The search API returns search hits that match the query defined in the request.
When you hit the search query, mentioned in the question above, the following response is there:
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
The response provides the following information about the search request:
took – how long it took Elasticsearch to run the query, in
milliseconds
timed_out – whether or not the search request timed out
_shards – how many shards were searched and a breakdown of how many shards succeeded, failed, or were skipped.
max_score – the score of the most relevant document found
hits.total.value - how many matching documents were found
The hits.hits above returns a blank array([]), hits.hits is an array of found documents that meet your search query. As here no documents are indexed, therefore no documents are matched when a search query is hit.
Refer to this ES documentation, to know more about how scoring works in ES
In the above response max_score value is NULL, the _score in
Elasticsearch is a way of determining how relevant a match is to the
query.

Elasticsearch index stats differ from search hits

When examining the status of indices in our Elasticsearch instance using curl 'http://localhost:9200/_cat/indices?v' the number of documents, docs.count in each index is frequently larger than the number of search results returned when searching all documents on that index.
Sometimes it is an integer multiple of the search hits but not always. In one case there are 98160 hits for match_all but 805383 documents in the index.
Note that there are no nested documents in the mappings.
What is the explanation? Note that search does seem to functioning normally.
This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas, and probably one of the node are down while you are performing search queries.
For instance,
If I have a cluster of only one node, and the node has 1 index with 4 documents, I will get the following output when i examine indices,
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open blog 5 1 4 0 10.9kb 10.9kb
Now, if I run match_all query,
{
"query": {
"match_all": {}
}
}
I will get,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [........
Notice how docs.count equals to hits count. In above output, observe the number of shards, which are 5. All those shards are assigned to a single node.
But if I had a multi node setup with replicas not configured, those shards will be distributed among multiple nodes.
Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2 and 4 available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.
Number of hits in this case will always be less than docs.count value.
This kind of uncertainty can be avoided by using replicas
matches all documents, giving them all a _score of 1.0.
One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. In this case, the email field will be broken up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms.
link
how scoring works

How to run term statistics efficiently in Elasticsearch?

I use the following code to find term frequency for a document.
POST myindex/mydoc/1/_termvectors?fields=fields.bodyText&pretty=true
{
"term_statistics":true,
"filter":{
"max_doc_freq":300,
"min_doc_freq":50
}
}
In my index there are 1 million documents. How to run this statistics more efficiently for each document?
By efficiently I mean for example: The word the in doc 1 can also appear in doc 2, so when I run the statistics for doc 2 there is no need to calculate the again(assuming that my index has not been updated for each document).

Resources