Applying filters on results of aggregation in elastic search - elasticsearch

I am stuck with a problem where I need to apply some filters on results of an aggregation in elastic search.
For example, assume that the following are the fields
event_name, location, time, user_id
Now my requirement is to get the user ids who have performed a specific action (lets say "logged_in") in the last one month atleast 5 times. I am able to get the users who have logged_in in the last one month. But how do I filter the results further?
The query I have written is:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range":{
"time":{
"from": 1412312824,
"to": 1422142824
}
}
},
{
"term": {
"action": "logged_in"
}
}
]
}
}
}
},
"aggs": {
"result": {
"terms": {
"field": "user_id"
}
}
}
}
Sample output:
user_id, doc_count
1 10
2 25
3 1
4 2
I need to apply filter on the above result. How do I do it?

I believe you can just add a min_doc_count key to your terms aggregation, like so:
...
"aggs": {
"result": {
"terms": {
"field": "user_id",
"min_doc_count": 5
}
}
}
...
Source: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-bucket-terms-aggregation.html#_minimum_document_count

Related

Deduplicate and perform composite aggregation on deduced result

I've an index in elastic search which contains data of daily transactions. Each doc has mainly three fields as below :
TxnId, Status, TxnType,userId
two documents can have same TxnIds.
I'm looking for a query that provides aggregation over status,TxnType for unique txnIds. Basically I'm looking for something like : select unique txnIds from user_table group by status,txnType.
I've a ES query which will dedup on TxnIds. I've another ES query which can perform composite aggregation on status and txnType. I want to do both things in Single query.
I tried collapse feature . I also tried cardinality and dedup features. But query is not giving correct output.:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"streamSource": 3
}
}
]
}
},
"collapse": {
"field": "txnId"
},
"aggs": {
"buckets": {
"composite": {
"size": 30,
"sources": [
{
"status": {
"terms": {
"field": "status"
}
}
},
{
"txnType": {
"terms": {
"field": "txnType"
}
}
}
]
}
}
}
}

Ignore "match" clause from query in aggregation

I have a query with aggregations. One of the aggregation is on the field starsCount. There is a query clause that filters on the starsCount field along with other match clauses (hidden for clarity).
I wish for the starsCount aggregation to ignore the starsCount filtering in its results (the aggregation's result should be as if I had run the same query without the match clause on the starsCount field) while the other aggregation keeps its current behavior
Can this be done in a single query or should I use multiple ?
Here is the (simplified) query:
{
[...]
"aggs": {
"group_by_service": {
"comment": "keep current behaviour",
"terms": {
"field": "services",
"size": 46
}
},
"group_by_stars": {
"comment": "ignore the filter on the starsCount field",
"terms": {
"field": "starsCount",
"size": 100
}
}
},
"query": {
"bool": {
"must": [
[...] filters on other properties, non-relevant
{
"match": {
"starsCount": {
"query": "2"
}
}
}
]
}
}
}
Yes you can achieve this in single query by making use of post filter and filter aggregation.
You need to follow the below steps to create the query:
Remove the starsCount match query from the main query as it should not affect the group_by_stars aggregation.
Since starsCount match query should filter the documents, move it to post_filter. Any query inside post_filter will filter the documents after calculating aggregations.
Now since starsCount is no more part of main query all the aggregations will not be affected by it. But what is required is that this filter should effect all other aggregations except group_by_stars aggregation. To achieve this we'll make use of filter aggregation and apply it to all the aggregations except group_by_stars aggregation.
The resultant query will be as below. (Note that instead of match query I have used term query. You can still use match but in this case term is a better choice.):
{
"aggs": {
"some_other_agg":{
"filter": {
"term": {
"starsCount": "2"
}
},
"aggs": {
"some_other_agg_filtered": {
"terms": {
"field": "some_other_field"
}
}
}
},
"group_by_service": {
"filter": {
"term": {
"starsCount": "2"
}
},
"aggs": {
"group_by_service_filtered": {
"terms": {
"field": "services",
"size": 46
}
}
}
},
"group_by_stars": {
"terms": {
"field": "starsCount",
"size": 100
}
}
},
"query": {
"bool": {
"must": [
{...} //filter on other properties
]
}
},
"post_filter": {
"term": {
"starsCount": "2"
}
}
}

Elasticsearch scoped aggregation not desired results

I have the following query but the aggregation doesn't seem to be acting on top of the query.
The query returns 3 results there are 10 items in the aggregation. Looks like the aggregation is acting on top of all queried results.
Basically, how do I get the aggregation to take the given query as the input?
{
"query": {
"filtered": {
"filter": {
"and": [
{
"geo_distance": {
"coordinates": [
-79.3931,
43.6709
],
"distance": "15km"
}
},
{
"term": {
"user.type": "2"
}
}
]
},
"query": {
"match": {
"user.shoes": "314"
}
}
}
},
"aggs": {
"dedup": {
"terms": { "field": "user.id" }
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
So as it turns out, I was expecting the aggregation to act on the paginated results given by the query. And that's incorrect.
The aggregation takes as input "all results" of the query, not just the paginated one.

Elasticsearch, counting not included terms

I'm trying to get a single, or a couple, of ES requests to count the terms I have not included in my current search.
Let me elaborate.... My front-end looks like this:
I have Closed currently selected, so the other items should show how many items they would add if I were to include that term.
Assume that closed == 500 and Rejected == 100;
While I have closed selected the rejected field should have the number 100 appended to it. If I deselect closed , it should show the number 500. If I select rejected and not select closed it should also show 500.
Easy enough huh? We just add a bucket counting the status field and that will return a bucket for each of these items, we then get the value from it and display it.
That part I got :) However.... when I actually add a term (for example one that filters on NoOffer) the buckets won't include the others field...
This is what my query looks like (global buckets by: ChintanShah25)
{
"size": 50,
"from": 1,
"sort": [
{
"createdAt": "desc"
}
],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"wildcard": {
"fromPlace": "*rotter*"
}
}
]
}
},
{
"bool": {
"should": [
{
"wildcard": {
"status": "closed"
}
}
]
}
}
]
}
},
"aggs": {
"status": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}
The global now shows all the different status codes, but it doesn't take into regard the rest of the statement. The "fromPlace" filter doesn't get applied.
I guess you are looking for global aggregation which will include all the fields regardless of the query. You could also use filter aggregation for selective stats if you want.
{
"query": {
"term": {
"status": {
"value": "closed"
}
}
},
"size": 0,
"aggs": {
"everything": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources