I want to apply filter after aggregate query. For example, with the below aggregate query, I want to get only those entries where we have all the windows.
Note: we do not have to use include because it uses regular expression which is time consuming and we cannot ignore the case.
Query:
GET /record_new/_search
{"size":0, "aggs" : {
"software_tags" : {
"terms" : {
"field" : "software_tags.keyword",
"size" : 100
}
}
}
}
Response:
{
"took": 77,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5706542,
"max_score": 0,
"hits": []
},
"aggregations": {
"software_tags": {
"doc_count_error_upper_bound": 5514,
"sum_other_doc_count": 581800,
"buckets": [
{
"key": "Microsoft Windows",
"doc_count": 70641
},
{
"key": "Bitcoin",
"doc_count": 35423
},
{
"key": "Linux",
"doc_count": 33230
},
{
"key": "ICQ",
"doc_count": 21934
},
{
"key": "PHP",
"doc_count": 20562
},
{
"key": "Windows XP",
"doc_count": 19720
},
{
"key": "Android (operating system)",
"doc_count": 17774
},
{
"key": "C++",
"doc_count": 14792
},
{
"key": "Pretty Good Privacy",
"doc_count": 14307
},
{
"key": "Tor (anonymity network)",
"doc_count": 14110
}
]
}
}
}
I tried to do filter as well but I am not getting incorrect output. In output we are getting linux as well. I don't know what is happening here.
GET /record_new/_search
{"size":0, "query": {
"constant_score": {
"filter":
{ "term": { "software_tags": "windows" }}
}
}, "aggs" : {
"software_tags" : {
"terms" : {
"field" : "software_tags.keyword",
"size" : 10
}
}
}
}
Output:
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 93181,
"max_score": 0,
"hits": []
},
"aggregations": {
"software_tags": {
"doc_count_error_upper_bound": 1640,
"sum_other_doc_count": 171831,
"buckets": [
{
"key": "Microsoft Windows",
"doc_count": 70641
},
{
"key": "Windows XP",
"doc_count": 19720
},
{
"key": "Windows 7",
"doc_count": 12692
},
{
"key": "Linux",
"doc_count": 12311
},
{
"key": "Windows Vista",
"doc_count": 10172
},
{
"key": "Windows NT",
"doc_count": 5417
},
{
"key": "Windows Registry",
"doc_count": 5055
},
{
"key": "Windows 8",
"doc_count": 4829
},
{
"key": "Windows 2000",
"doc_count": 4738
},
{
"key": "Windows 10",
"doc_count": 4611
}
]
}
}
}
Try this query, it should look for records with windows in the software_tag:
{
"size":0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "software_tags: *windows* AND NOT *linux* AND NOT *<next OS name to exclude>*",
"analyze_wildcard": true
}
}
]
}
}, "aggs" : {
"software_tags" : {
"terms" : {
"field" : "software_tags.keyword",
"size" : 10
}
}
}
}
It might be a bit slower than the usual queries but thats because of the wildcard character in the query.
Related
I need a query that would return data from the last year, grouped by days. So far I have written a query that returns data for the entire year (I hope its correct), but I dont know how to group the data by day.
"query" : {
"range" : {
"timestamp" : {
"gt" : "2017-01-01 00:00:00",
"lt" : "2018-01-01 00:00:00"
}
}
}
Any help would be much appreciated.
I am using Elasticsearch 6.2.2.
You can check date_histogram aggregation
POST my_index/my_type/_search
{
"size": 0,
"aggs": {
"bucketName": {
"date_histogram": {
"field": "timestamp",
"interval": "day",
"min_doc_count": 1,
"format": "yyyy-MM-dd",
"order": {"_key": "desc"}
}
}
}
}
It will return you something like this
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 112233,
"max_score": 0,
"hits": []
},
"aggregations": {
"bucketName": {
"buckets": [
{
"key_as_string": "2018-03-07",
"key": 1520380800000,
"doc_count": 1
},
{
"key_as_string": "2018-03-06",
"key": 1520294400000,
"doc_count": 93
},
{
"key_as_string": "2018-03-05",
"key": 1520208000000,
"doc_count": 99
},
{
"key_as_string": "2018-03-04",
"key": 1520121600000,
"doc_count": 33
},
{
"key_as_string": "2018-03-03",
"key": 1520035200000,
"doc_count": 29
}
]
}
}
}
I have a few million documents with name and version (both of type keyword) as properties in each. What is the equivalent Elastic query for group by name, version?
I have tried the following query:
{
"size":0,
"query": {
"bool": {
"filter": {
"range": {
"time": {
"gte": "2017-01-28",
"lte": "2017-02-28"
}
}
}
}
},
"aggs": {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"group_by_version": {
"terms": {
"field": "version"
}
}
}
}
}
}
However the results are not same as doing Group by name, version.
The results are grouped by name and within each group, they are grouped by version.
How do I modify the above query to group by name, version tuple and return results in descending order?
Your help is greatly appreciated.
Update:
What i get is:
{
"took": 1424,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 115,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_name": {
"doc_count_error_upper_bound": 2,
"sum_other_doc_count": 115,
"buckets": [
{
"key": "product1",
"doc_count": 50,
"group_by_version": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 50,
"buckets": [
{
"key": "1.0",
"doc_count": 40
},
{
"key": "2.0",
"doc_count": 10
},
]
}
},
{
"key": "product3",
"doc_count": 35,
"group_by_version": {
"doc_count_error_upper_bound": 4,
"sum_other_doc_count": 35,
"buckets": [
{
"key": "8.0",
"doc_count": 20
},
{
"key": "9.0",
"doc_count": 15
}
]
}
},
{
"key": "product2",
"doc_count": 30,
"group_by_version": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 30,
"buckets": [
{
"key": "4.0",
"doc_count": 25
},
{
"key": "5.0",
"doc_count": 5
}
]
}
}
]
}
}
}
What i want is:
name, version count
product1 1.0 40
product2 4.0 25
product3 8.0 20
product3 9.0 15
product1 2.0 10
product2 5.0 5
I am using elasticsearch 1.7 and i have to find filter must not from aggregation key value
Below is the structure :
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z101"}
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z102"}
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z103"}
{"RU": "2016-06-25T15:07:46.144","zt":"un","zi":"z201"}
{"RU": "2016-06-25T15:07:46.144","zt":"un","zi":"z202"}
{"RU": "2016-06-25T15:07:46.144","zt":"g1","zi":"z101"}
{"RU": "2016-06-25T15:07:46.144","zt":"g1","zi":"z502"}
{"RU": "2016-06-25T15:07:46.144","zt":"g2","zi":"z201"}
{"RU": "2016-06-25T15:07:46.144","zt":"g2","zi":"z503"}
My query :
{"size": 0,
"aggs": {
"findunique": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"zt": [
"bl",
"un"
]
}
}
],
"must": [
{
"terms": {
"zt": [
"g1",
"g2"
]
}
}
]
}
},
"aggs": {
"uniquezi": {
"terms": {
"field": "zi"
}
}
}
}
}
}
-------------------------------------------------------
output :
{"aggregations": {
"findunique": {
"doc_count": 4,
"uniquezi": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z502",
"doc_count": 1
},
{
"key": "z503",
"doc_count": 1
}
]
}
}
}
}}
Now i am looking to don't show zi =z101 and z201 should not come in list as that belonging to zt = bl and zt = un
Please suggest me Thanks !
As a suggestion you could try adding two aggregations with filer set on "zt" field.
This way you will get two sets and can extract all from "Wanted" which are not in "Unwanted" later in code.
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"wanted" : { "terms" : { "zt" : [ "g1", "g2" ] }},
"unwanted" : { "terms" : { "zt" : [ "bl", "un" ] }}
}
},
"aggs" : {
"monthly" : {
"terms": {"field" : "zi"}
}
}
}
}
}
Response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"messages": {
"buckets": {
"wanted": {
"doc_count": 4,
"distinctValuesAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z502",
"doc_count": 1
},
{
"key": "z503",
"doc_count": 1
}
]
}
},
"unwanted": {
"doc_count": 5,
"distinctValuesAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z102",
"doc_count": 1
},
{
"key": "z103",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z202",
"doc_count": 1
}
]
}
}
}
}
}
}
We're using ElasticSearch to find offers based on 5 fields, such like some 'free text', offer state and client name. We also need to aggregate on the two fields client name and offer state. So when someone enters some free text and we found say 10 docs with state closed and 8 with state open, the 'state filter' should contain closed(10) and open(8).
Now the problem is, when I select the state 'closed' to be included in the filter, the aggregation result for open changes to 0. I want this to remain 8. So how can I prevent the filter on the aggregations to influence the aggregation itself?
Here is the first query, searching for 'java':
{
"query": {
"bool": {
"filter": [
],
"must": {
"simple_query_string": {
"query" : "java"
}
}
}
},
"aggs": {
"OFFER_STATE_F": {
"terms": {
"size": 0,
"field": "offer_state_f",
"min_doc_count": 0
}
}
},
"from": 0,
"size": 1,
"fields": ["offer_id_ft", "offer_state_f"]
}
The result is this:
{
"hits": {
"total": 960,
"max_score": 0.89408284000000005,
"hits": [
{
"_type": "offer",
"_index": "select",
"_id": "40542",
"fields": {
"offer_id_ft": [
"40542"
],
"offer_state_f": [
"REJECTED"
]
},
"_score": 0.89408284000000005
}
]
},
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"timed_out": false,
"aggregations": {
"OFFER_STATE_F": {
"buckets": [
{
"key": "REJECTED",
"doc_count": 778
},
{
"key": "ACCEPTED",
"doc_count": 130
},
{
"key": "CANCELED",
"doc_count": 22
},
{
"key": "WITHDRAWN",
"doc_count": 13
},
{
"key": "LONGLIST",
"doc_count": 12
},
{
"key": "SHORTLIST",
"doc_count": 5
},
{
"key": "INTAKE",
"doc_count": 0
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
},
"took": 2
}
As you see, the sum of the client_state_f buckets is equal to the total hits (960). Now, I include one of the states in the query, say 'ACCEPTED'. So my query becomes:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"term": {
"offer_state_f": "ACCEPTED"
}
}
]
}
}
],
"must": {
"simple_query_string": {
"query" : "java"
}
}
}
},
"aggs": {
"OFFER_STATE_F": {
"terms": {
"size": 0,
"field": "offer_state_f",
"min_doc_count": 0
}
}
},
"from": 0,
"size": 1,
"fields": ["offer_id_ft", "offer_state_f"]
}
What I want is 130 results, but the client_state_f buckets stilling summing up to 960. But what I got is this:
{
"hits": {
"total": 130,
"max_score": 0.89408284000000005,
"hits": [
{
"_type": "offer",
"_index": "select",
"_id": "16884",
"fields": {
"offer_id_ft": [
"16884"
],
"offer_state_f": [
"ACCEPTED"
]
},
"_score": 0.89408284000000005
}
]
},
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"timed_out": false,
"aggregations": {
"OFFER_STATE_F": {
"buckets": [
{
"key": "ACCEPTED",
"doc_count": 130
},
{
"key": "CANCELED",
"doc_count": 0
},
{
"key": "INTAKE",
"doc_count": 0
},
{
"key": "LONGLIST",
"doc_count": 0
},
{
"key": "REJECTED",
"doc_count": 0
},
{
"key": "SHORTLIST",
"doc_count": 0
},
{
"key": "WITHDRAWN",
"doc_count": 0
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
},
"took": 10
}
As you can see, only the ACCEPTED bucket is filled, all the others are 0.
You need to move your filters into the post_filter section instead of the query section.
That way, the filtering will applied after the aggregations are computed and you'll be able to aggregate the whole set of data, but only get result hits matching your filters.
Ok, I found the answer with the help of a colleague, and the thing is, Val i is right. +1 for him. What I did was placing ALL of my query filters in the post_filter, and that's the problem. I only have to place the filters for the fields on which I want to agregate in the post_filter. Thus:
{
"query": {
"bool": {
"filter": [
{
"term": {
"broker_f": "false"
}
}
],
"must": {
"simple_query_string": {
"query" : "java"
}
}
}
},
"aggs": {
"OFFER_STATE_F": {
"terms": {
"size": 0,
"field": "offer_state_f",
"min_doc_count": 0
}
}
},
"post_filter" : {
"bool": {
"should": [
{
"term": {
"offer_state_f": "SHORTLIST"
}
}
]
}
},
"from": 0,
"size": 1,
"fields": ["offer_id_ft", "offer_state_f"]
}
And now the result is correct:
{
"hits": {
"total": 5,
"max_score": 0.76667790000000002,
"hits": [
{
"_type": "offer",
"_index": "select",
"_id": "24454",
"fields": {
"offer_id_ft": [
"24454"
],
"offer_state_f": [
"SHORTLIST"
]
},
"_score": 0.76667790000000002
}
]
},
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"timed_out": false,
"aggregations": {
"OFFER_STATE_F": {
"buckets": [
{
"key": "REJECTED",
"doc_count": 777
},
{
"key": "ACCEPTED",
"doc_count": 52
},
{
"key": "CANCELED",
"doc_count": 22
},
{
"key": "LONGLIST",
"doc_count": 12
},
{
"key": "WITHDRAWN",
"doc_count": 12
},
{
"key": "SHORTLIST",
"doc_count": 5
},
{
"key": "INTAKE",
"doc_count": 0
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
},
"took": 4
}
How can field of type string be included in the result set of an aggregation?
For example given the following mapping:
{
"sport": {
"mappings": {
"runners": {
"properties": {
"name": {
"type": "string"
},
"city": {
"type": "string"
},
"region": {
"type": "string"
},
"sport": {
"type": "string"
}
}
}
}
}
}
Sample data:
curl -XPOST "http://localhost:9200/sport/_bulk" -d'
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Gary", "city":"New York","region":"A","sport":"Soccer"}
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Bob", "city":"New York","region":"A","sport":"Tennis"}
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Mike", "city":"Atlanta","region":"B","sport":"Soccer"}
'
How can the field name be included in result set of the aggregation:
{
"size": 0,
"aggregations": {
"agg": {
"terms": {
"field": "city"}
}
}
}
This seems to do what you want, if I'm understanding you correctly:
POST /sport/_search
{
"size": 0,
"aggregations": {
"city_terms": {
"terms": {
"field": "city"
},
"aggs": {
"name_terms": {
"terms": {
"field": "name"
}
}
}
}
}
}
With the data you provided, it returns:
{
"took": 43,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"city_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 2,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bob",
"doc_count": 1
},
{
"key": "gary",
"doc_count": 1
}
]
}
},
{
"key": "york",
"doc_count": 2,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bob",
"doc_count": 1
},
{
"key": "gary",
"doc_count": 1
}
]
}
},
{
"key": "atlanta",
"doc_count": 1,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mike",
"doc_count": 1
}
]
}
}
]
}
}
}
(You may want to add "index":"not_analyzed" to one or both fields in your mapping, if these results are not what you were expecting.)
Here's the code I used to test it:
http://sense.qbox.io/gist/07735aadc082c1c60409931c279f3fd85a340dbb