ElastichSearch Count Discrepancies - elasticsearch

I have an ElasticSearch (v7.4) cluster with 3 master nodes and 4 data nodes. When gathering statistics about the number of documents, I have come across a few apparent inconsistencies:
The number of documents as returned by GET https://<my_ip>/<my_index>/_count: 66717419 (24 shards)
The number of documents as returned by GET https://<my_ip>/<my_idnex/_search?track_total_hits=true: 66717419 (same as above)
Now I checked the number of documents where the id starts with 0:
curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
"query": {
"prefix": {
"id": {
"value": "0"
}
}
},
"track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}
Repeating the same query for all of [0-9a-f] also returns numbers between 57k and 58k for each of these letters/digits for hits.total.value (just as shown in the example query for 0 above).
Repeating the same query for any other letters returns 0 results (as expected).
These totals sum up to ~912k total documents (16*57k)
So I see ~900k documents that have an id starting with any of [0-9a-f], 0 starting with other ids. At the same time, ES reports a total of 66M documents in the index.
Where does the discrepancy come from? Can there be documents with no id? Does ES count deleted or updated documents somehow?
According to the ID Field documentation
Each document has an _id that uniquely identifies it
Could it be related to sharding? From the results shown above, however, it looks like each of my queries hits all 24 shards.
The documentation for the Count API or the Search API don't seem to indicate any peculiar behaviour in that regard. What else could explain these numbers?
Update:
The index statistics:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open <my_index> BUWfFDsBQAGcl64-J7gzHQ 24 1 66717419 23791236 1.6tb 873gb

Related

Avoid ranking all matching documents in elasticsearch search query

I am having Elasticsearch index with multi-millions of documents. I am running a following search query.
POST testIndex/_search?size=200
{
"query": {
"query_string": {
"query": "(title:QA Manager OR title:QA Lead) AND (skills:JIRA OR skills:Software Development OR skills:Test Case)"
}
}
}
Even if we have passed the limit with size=200, it seems Elasticsearch is doing ranking for all the matching documents and bringing the top 200 with the highest rank.
Is there a way we can limit ranking? meaning do ranking on max 1000 matching documents only?
ES will consider your all data for search and ranking that is how Elasticsearch work. What basically do is, It executes your query in 2 phases, one is query and the second is fetch.
In Query Phase, it executes your query in all shades and get document id and score from each shard and return to requesting node. So in your scenario as size is set to 200, it will get 200 documents id from each shard and return to requesting node.
On requesting node, all the document id and score are merged and sorted based on score and select top document based on size param.
In Fetch phase, the actual docs are retrieved from individual shards where they reside based on ID which are selected in Query Phase and Results are returned to the client.
If you don't want to calculate score for some of your query, then you can move that query to the filter clause in bool query.

ElasticSearch Segment merge not happening when deleted documents count is greater than 50%

Elasticsearch version: 7.10.0
I have an elasticsearch index with 8 shards in 8 different nodes and with a document count greater than 25 million documents(nested not included). It's an heavy update index. The document size grows over a period of time because of deleted documents. I did a search on this issue and read blogs like one below which tells a segment will automatically be merged when the deleted docs count in that segment is greater than 50%.
https://discuss.elastic.co/t/too-many-deleted-docs/84964/4
I did a /_segments for the index and found segments like the below
"segments": {
"_bbx": {
"generation": 14685,
"num_docs": 27901732,
"deleted_docs": 23290932,
"size_in_bytes": 5071187083,
"memory_in_bytes": 137008,
"committed": true,
"search": true,
"version": "8.7.0",
"compound": false,
"attributes": {
"Lucene87StoredFieldsFormat.mode": "BEST_SPEED"
}
},
Full response of /_segment call can be found here
https://drive.google.com/file/d/1mLE2xw0u7lnogHnfzz65rWCBS8JrcnNm/view?usp=sharing
In many segments like the one above the deleted_docs count is more than 75% of the num_docs but is still not getting merged. We haven't set any max_merged_segment so the default is 5gb. We also haven't changed any MergePolicy and are using the default ones as of Es version 7.10.0.
Is my understanding correct ?
Any thoughts on this would be helpful. Thanks in advance.
The num_docs contains only the present documents and doesn't include the deleted documents.
So in this case we have 23,290,932 deleted documents out of a total of 51,192,664 (27,901,732 + 23,290,932) documents which means 45.5% are deleted in that segment. Hence segment merge didn't happen.
Note : Posted the same question in elasticsearch forums got this reply
https://discuss.elastic.co/t/elasticsearch-segment-merge-not-happening-when-deleted-documents-count-is-greater-than-50/277209

Elasticsearch index stats differ from search hits

When examining the status of indices in our Elasticsearch instance using curl 'http://localhost:9200/_cat/indices?v' the number of documents, docs.count in each index is frequently larger than the number of search results returned when searching all documents on that index.
Sometimes it is an integer multiple of the search hits but not always. In one case there are 98160 hits for match_all but 805383 documents in the index.
Note that there are no nested documents in the mappings.
What is the explanation? Note that search does seem to functioning normally.
This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas, and probably one of the node are down while you are performing search queries.
For instance,
If I have a cluster of only one node, and the node has 1 index with 4 documents, I will get the following output when i examine indices,
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open blog 5 1 4 0 10.9kb 10.9kb
Now, if I run match_all query,
{
"query": {
"match_all": {}
}
}
I will get,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [........
Notice how docs.count equals to hits count. In above output, observe the number of shards, which are 5. All those shards are assigned to a single node.
But if I had a multi node setup with replicas not configured, those shards will be distributed among multiple nodes.
Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2 and 4 available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.
Number of hits in this case will always be less than docs.count value.
This kind of uncertainty can be avoided by using replicas
matches all documents, giving them all a _score of 1.0.
One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. In this case, the email field will be broken up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms.
link
how scoring works

How to run term statistics efficiently in Elasticsearch?

I use the following code to find term frequency for a document.
POST myindex/mydoc/1/_termvectors?fields=fields.bodyText&pretty=true
{
"term_statistics":true,
"filter":{
"max_doc_freq":300,
"min_doc_freq":50
}
}
In my index there are 1 million documents. How to run this statistics more efficiently for each document?
By efficiently I mean for example: The word the in doc 1 can also appear in doc 2, so when I run the statistics for doc 2 there is no need to calculate the again(assuming that my index has not been updated for each document).

Elastic search document count

I am having a Elastic search running 2.2 version. I have created index and loaded sample documents. I found some problem in it. When i give
GET index/type/_count
I am getting the correct answer
{
"count": 9998,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
}
}
But when i see the things using http://IP:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open index 5 1 79978 0 32.1mb 32.1mb
Where docs.count : 79978. Which is wrong.
Why i am seeing docs.count with wrong value. The exact document count is 9998
GET index/type/_count will return the top-level document count.
docs.count in _cat/indices returns the count of all documents, including artificial documents that have been created for nested fields.
That's why you see a difference:
The former count (i.e. 9998) will tell you how many Elasticsearch documents are in your index, i.e. how many you have indexed.
The latter count (i.e. 79978) will tell you how many Lucene documents are in your index.
So if one ES document contain a nested field with 5 sub-elements, you'll see 1 ES document, but 6 Lucene documents. Judging by the counts, each of your ES document has between 7 and 8 nested elements within it.

Resources