Inconsistent doc count - elasticsearch

Hi I am running Elasticsearch 1.5.2
I indexed 6,761,727 documents in one of my indexes.
When I run the following query....
GET myindex/mytype/_search
{
"size": 0
}
The hits.total count keeps alternating between 2 values...
"hits": {
"total": 6761727,
"max_score": 0,
"hits": []
}
and
"hits": {
"total": 6760368,
"max_score": 0,
"hits": []
}
No matter how many times I run the query the count goes back and forth between the 2.
I searched around a bit and found out that it seems that primary vs replica shards don't have exact same number of docs. If I use preference=primary then the doc count returned is correct.
What is the easiest way to check which shard is the culprit and try to fix him without re-indexing everything?

Set the replica count to 0 for that index
PUT /my_index/_settings
{
"index": {
"number_of_replicas": 0
}
}
wait to see no more replicas for that index when you do GET /_cat/shards/my_index?v and then set back to the initial number of replicas.
This will delete all the replicas for that index and then make a new copy of the primaries.

Related

return empty result for a nested bool query on fields that don't have data

I'm doing the following query:
the ns.ns field has configured (has both mapping and setting set up successfully) but there is no source data for this field. and I get empty result returned from ElasticSearch. is that right? I mean without data this query would return empty result, is that correct? Still learning ES and thanks for the help.
The ns.ns field has configured (has both mapping and setting set up
successfully) but there is no source data for this field. and I get
empty result returned from ElasticSearch. is that right?
without data this query would return an empty result, is that correct?
As you have mentioned above that the ns field is mapped as type nested, therefore when you hit the search query you will not get "index_not_found_exception", since the index already exists.
The search API returns search hits that match the query defined in the request.
When you hit the search query, mentioned in the question above, the following response is there:
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
The response provides the following information about the search request:
took – how long it took Elasticsearch to run the query, in
milliseconds
timed_out – whether or not the search request timed out
_shards – how many shards were searched and a breakdown of how many shards succeeded, failed, or were skipped.
max_score – the score of the most relevant document found
hits.total.value - how many matching documents were found
The hits.hits above returns a blank array([]), hits.hits is an array of found documents that meet your search query. As here no documents are indexed, therefore no documents are matched when a search query is hit.
Refer to this ES documentation, to know more about how scoring works in ES
In the above response max_score value is NULL, the _score in
Elasticsearch is a way of determining how relevant a match is to the
query.

Elasticsearch index stats differ from search hits

When examining the status of indices in our Elasticsearch instance using curl 'http://localhost:9200/_cat/indices?v' the number of documents, docs.count in each index is frequently larger than the number of search results returned when searching all documents on that index.
Sometimes it is an integer multiple of the search hits but not always. In one case there are 98160 hits for match_all but 805383 documents in the index.
Note that there are no nested documents in the mappings.
What is the explanation? Note that search does seem to functioning normally.
This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas, and probably one of the node are down while you are performing search queries.
For instance,
If I have a cluster of only one node, and the node has 1 index with 4 documents, I will get the following output when i examine indices,
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open blog 5 1 4 0 10.9kb 10.9kb
Now, if I run match_all query,
{
"query": {
"match_all": {}
}
}
I will get,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [........
Notice how docs.count equals to hits count. In above output, observe the number of shards, which are 5. All those shards are assigned to a single node.
But if I had a multi node setup with replicas not configured, those shards will be distributed among multiple nodes.
Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2 and 4 available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.
Number of hits in this case will always be less than docs.count value.
This kind of uncertainty can be avoided by using replicas
matches all documents, giving them all a _score of 1.0.
One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. In this case, the email field will be broken up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms.
link
how scoring works

Getting different sequence of documents when upgraded from ES 1.4 to ES 2.3

I used this query curl localhost:9200/tweets/user/_search?size=25 on es 1.4.2 and I got the following result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 294633,
"max_score": 1,
"hits": [
...
with a list of documents.
When I ran the same query on es 2.3.0, I got same hits but the documents were completely different.
What could be the reason?
The documentation says the order will be random:
This will apply a constant score (default of 1) to all documents. It will perform the same as the above query, and all documents will be returned randomly like before, they’ll just have a score of one instead of zero.

Retrieve all docs using top_hits aggregation ElasticSearch

I am using top_hits aggregation to retrieve documents along with counts.I need to retrieve all the document based on my earlier post here, for which thought passing size 0 will do it but it throws following error.
org.elasticsearch.search.query.QueryPhaseExecutionException: [my-demo][3]: query[ConstantScore(*:*)],from[0],size[10]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalArgumentException: numHits must be > 0; please use TotalHitCountCollector if you just need the total hit count
at org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:254)
at org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:238)
at org.elasticsearch.search.aggregations.metrics.tophits.TopHitsAggregator.collect(TopHitsAggregator.java:108)
at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucketNoCounts(BucketsAggregator.java:74)
at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:63)
at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucket(BucketsAggregator.java:55)
at org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$WithHash.collect(GlobalOrdinalsStringTermsAggregator.java:236)
at org.elasticsearch.search.aggregations.AggregatorFactories$1.collect(AggregatorFactories.java:114)
at org.elasticsearch.search.aggregations.BucketCollector$2.collect(BucketCollector.java:81)
at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucketNoCounts(BucketsAggregator.java:74)
According to elasticseach, size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned. so, size = 0 means no documents (i think) so try sending maximum values. – progrrammer 31 mins ago
Top hit aggregation response is of format,
"top_tags_hits": {
"hits": {
"total": 25365,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "question",
"_id": "602679",
"_score": 1,
"_source": {
"title": "Windows port opening"
},
"sort": [
1370143231177
]
}
]
}
here hits - > total give total no of hits, you can use pagination(from, and size) as in search api, to get documents or use maximum integer value [(2^31)-1] to get all the documents.
Hope this helps.

Deleting documents in multiple indices from elasticsearch

I'm trying to delete all documents of a certain type across multiple indices (documents are created by logstash so there is an index for each day).
I've tried this:
DELETE _all/_query?q=type:iss
The result looks something like:
{
"_indices": {
"logstash-2014.01.18": {
"_shards": {
"total": 5,
"successful": 0,
"failed": 5
}
},
"_indices": {
"logstash-2014.01.18": {
"_shards": {
"total": 5,
"successful": 2,
"failed": 3
}
},
...
}
Every time I run it I get a different number of successes/failures in each index. The 1st query above initially seemed to work. If I look in elasticsearch-head and Kibana it seems like at least some of the documents have been deleted. However if I then query for them:
POST _search {"query":{"match":{"type":"iis"}}}
or
GET _search?q=type:iis
it still returns all results. I don't believe this is a caching problem as I've done everything possible to try to ensure that isn't the case (cleared browser data, restarted elasticsearch/server etc).
I also tried:
DELETE _all/iis/_query {"query":{"match_all":{}}}
Again I get the inconsistent success/failure results but it does seem to have deleted documents when I run the search queries again. It only seems to be deleting a few every time though.
Why is this so inconsistent and what can I do to get this working consistently?

Resources