I want to count number of document returned as a result of a query with size limit. For example, I run following query:
curl -XGET http://localhost:9200/logs_-*/a_logs/_search?pretty=true -d '
{
"query" : {
"match_all" : { }
},
"size" : 5,
"from" : 8318
}'
and I get:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 159,
"successful" : 159,
"failed" : 0
},
"hits" : {
"total" : 8319,
"max_score" : 1.0,
"hits" : [ {
....
Total documents matching my query are 8319, but I fetched at max 5. Only 1 document was returned since I queried "from" 8318.
In the response, I do not know how many documents are returned. I want to write a query such that the number of documents being returned are also present in some field. Maybe some facet may help, but I could not figure out. Kindly help.
Your query :
{
"query" : {
"match_all" : { }
},
=> Means that you ask all your data
"size" : 5,
=> You want to display only 5 results
"from" : 8318
=> You start from the 8318 records
ElasticSearch respons :
....
"hits" : {
"total" : 8319,
...
=> Elastic search told you that there is 8319 results in his index.
You ask him all the result and you start from the 8318.
8319 - 8318 = 1 So you have 1 result.
Try by removing the from.
Looking through the documentation, it's not clear how to make the query return this -- if indeed the API supports it. If you just want to have the count of the returned hits, the easiest way seems to be to actually count them yourself after parsing the response.
Related
I have an index with two fields:
name: uuid
version: long
I now only want to count the documents (on a very large index [1 million+ entries]) where the version of the name is the highest. For e.g. a query on an index with the following documents:
{name="a", version=1}
{name="a", version=2}
{name="a", version=3}
{name="b", version=1}
... would return:
count=2
Is this somehow possible? I can not find a solution for this particular problem.
You are effectively describing a count of distinct names, which you can do with a cardinality aggregation.
Request:
GET test1/_search
{
"aggs" : {
"distinct_count" : {
"cardinality" : {
"field" : "name.keyword"
}
}
},
"size": 0
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"distinct_count" : {
"value" : 2
}
}
}
I am trying to get distinct values of a field "vip_name" on an index.
This is what I tried to begin with:
curl -XGET http://172.31.38.157:9200/cb_inventory/_search -d
'{"size":0,"aggs":{"vips":{"terms":{"field":"vip_name"}}}}'
{"error":{"root_cause":
[{"type":"illegal_argument_exception","reason":"Fielddata is disabled
on text fields by default. Set fielddata=true on [vip_name] in order to
load fielddata in memory by uninverting the inverted index. Note that
this can however use significant memory. Alternatively use a keyword
field
instead."}],"type":"search_phase_execution_exception","reason":"all
shards failed","phase":"query","grouped":true,"failed_shards": [{"shard":0,
"index":"cb_inventory","node":"7_t7zG82QsS__Q_vRHWy9A","reason":
{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields
by default.
OK. So I set fielddata to true as below:
curl -XPUT http://172.31.38.157:9200/cb_inventory/_mapping/cb_inventory -d '{"properties":{"vip_name":{"type":"text","fielddata":true}}}'
{"acknowledged":true}
Now I do the search and get back the below:
curl -XGET http://172.31.38.157:9200/cb_inventory/_search?pretty=true -d '{"size":0,"aggs":{"vips":{"terms":{"field":"vip_name","size":1000}}}}'
{"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"vips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "domain.com",
"doc_count" : 3
},
{
"key" : "ppcbcl00021",
"doc_count" : 3
}
]
}
}
}
This is a bit funny, since I have only one distinct value ppcbcl00021.domain.com . Now it is showing up as 2 broken distinct values.
How Do I go about getting a distinct value as "ppcbcl00021.domain.com"
This is because vip_name is set into text not keyword. So, even though you have ppcbcl00021.domain.com, in the ES, it will be stored as chunk of text ie ppcbcl00021 and domain.com.
Try again by setting vip_name to keyword
curl -XPUT http://172.31.38.157:9200/cb_inventory/_mapping/cb_inventory -d '{"properties":{"vip_name":{"type":"keyword"}}}'
Here is the mapping data for both customer and customer_query documents, where customer is the parent and customer_query the child document.
When I run a generic search against all customer_query documents, I get back 127 documents.
However, when I run the following query against the parent
curl -XGET "http://localhost:9200/fts_index/customer/_search" -d'
{
"query": {
"has_child" : {
"type" : "customer_query",
"query" : { "match_all": {} }
}
}
}
}'
I get back only 23 documents. There should be 127 documents returned back since each customer_query document has a unique parent id assigned to it that does match up to the customer type.
When I retry creating my customer_query documents, I get a different number of documents back each time leading me to think it is some kind of shard issue. I have 5 shards assigned to the index.
{
"took" : 59,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 23,
"max_score" : 1.0,
"hits" : [ {
"_index" : "fts_index",
"_type" : "customer",
"_id" : "7579f2c0-e4e4-4374-82d7-bf4c508fc51d",
"_score" : 1.0,
"_routing" : "8754248f-1c51-46bf-970a-493c349c70a7",
"_parent" : "8754248f-1c51-46bf-970a-493c349c70a7",
....
I can't wrap my head around this issue. Any thoughts on what could be the issue? Is this a routing issue? If so, how do I rectify that with my search?
As stated in the ElasticSearch documentation:
In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.
I want to execute searches returning hits when i have queries out for the aggregation. But i am not sure how can i achieve the above?
I am using the following query:
curl -XPOST 'localhost:9200/employee/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_domain": {
"terms": {
"field": "domain"
}
}
}
}'
and here is the result which i am getting,
{
"took" : 92,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_domain" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 744,
"buckets" : [ {
"key" : "finance",
"doc_count" : 30
}]
}
}
}
As we can see that the hits array is empty. I am not sure how to get those hits array. Any suggestion?
the hits are empty because you have set the size of the returning query to 0 when you specify:
"size": 0,
you can remove size completely and in this case you'll get 10 hits that is the default or you can set the size you want, for instance if you specify 100 you'll get 100 hits in response. This is related to the search results.
Now, if you also want to get results in the aggregation you can use Top Hits Aggregation for that.
I am using queries with size param set to 0 for getting fast counts without fetching docs data.
{
"query": {
<query_body>
},
"size": 0
}
Am I right with my assumption that the score calculation is not being performed in such cases?
I have some doubts. E.g. when I am querying with sort another than the _score I get "max_score": null which confirms that the score is not being calculated in that case. But in this current case ("size": 0) I get "max_score": 0 that looks more like the score is being calculated, but no docs are returned, so the max_score is 0.
Might not be the answers you are looking for, but still: It could well be that the score is still calculated. In your case I would use another solution. You should use the search type of a query:
?search_type=count
More information can be found here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#search-request-search-type
You can use a combination of size: 1 and _source: false to limit the size of the return results but you need to have at least size: 1 to have the max_score appear. By default it should be sorting by score so the top return result will have the highest score.
Here is what I did:
{
"size": 1,
"_source": false,
"query": {
<query_body>
}
}
Which returns a result like this:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.37365946,
"hits" : [
{
"_index" : "examples",
"_type" : "_doc",
"_id" : "9deff2cf-4e9b-46a9-8e56-2d97d1b2535a",
"_score" : 0.37365946
}
]
}
}
So you get one hit which lacks its _source field which means that this payload should always be this size exactly regardless of your documents.