Elastic search 2.1 : Intersection of aggregations - elasticsearch

I have some sample data in elastic search, which looks like below
Data1: {
"name": "rahul",
"socialnetwork": "facebook",
"day":1
}Data2: {
"name": "rahul",
"searchengine": "google"
"day": 1
}Data3: {
"name": "vivek",
"socialnetwork": "facebook",
"day":1
}Data4: {
"name": "devendra",
"searchengine": "google",
"day":2
}Data5: {
"name": "rahul",
"socialnetwork": "facebook",
"day":2
}
I need to get aggregations on "name" field, where socialnetwork = "facebook" and searchengine = "google".
As far as I know, we can use two aggregations and get an intersection of aggregations.
1st aggregation :
{
"query": {
"match": {
"searchengine": "google"
}
},
"aggs": {
"searcheng": {
"terms": {
"field": "name"
}
}
}
}
2nd aggregation :
{
"query": {
"match": {
"socialnetwork": "facebook"
}
},
"aggs": {
"socialnet": {
"terms": {
"field": "name"
}
}
}
}
And get the common aggregations (i.e. intersection) from both the aggregations.
But I am not able to get intersection using elastic search.
I have tried many things: subaggregations doesn't help in this case, significant terms aggregations results are not good enough, filters, pipeline aggregations, but couldn't find a solution.
Above sample data is just a simplified version of a big data, there are more than two filters, around 20 filters.

No,you dont need to have intersection of two aggregations.
The above can be easily achieved using bool query.For your desired output you can use should clause.
{
"query": {
"bool": {
"should": [
{
"match": {
"searchengine": "google"
}
},
{
"match": {
"socialnetwork": "facebook"
}
}
],
"minimum_number_should_match": 1
}
},
"aggs": {
"searcheng": {
"terms": {
"field": "name",
"min_doc_count" :2
}
}
}
}
Hope it helps.

Related

Filter and sort based on attributes in Terms lookup document in Elastic Search

I have some documents in my index:
POST "/index/thing/_bulk" -s -d'
{ "index":{ "_id": 1 } }
{ "title":"One thing"}
{ "index":{ "_id": 2 } }
{ "title":"Second thing"}
{ "index":{ "_id": 3 } }
{ "title":"Three things"}
{ "index":{ "_id": 4 } }
{ "title":"And so fourth"}
{ "index":{ "_id": 5 } }
{ "title":"Five things"}
'
I also have documents which contain a users collection which are linked to the other documents (things) through the documents id attribute like so:
PUT /index/collection/1
{
"items": [
{"id": 1, "time_added": "2017-08-07T09:07:15.000Z", "condition": "fair"},
{"id": 3, "time_added": "2019-08-07T09:07:15.000Z", "condition": "good"},
{"id": 4, "time_added": "2016-08-07T09:07:15.000Z", "condition": "poor"}
]
}
I then use a terms lookup to get all the things in a users collection like so:
GET /documents/_search
{
"query" : {
"terms" : {
"_id" : {
"index" : "index",
"type" : "collection",
"id" : 1,
"path" : "items.id"
}
}
}
}
This works fine. I get the three documents in the collection and can search, sort and use aggregations like I want.
But is there a way to aggregate, filter and sort those documents based on the attributes (time_added or condition in this case) in the collection document? Say I wanted to sort based on time_added or filter for condition=="good" from the collection?
Maybe a script that can be applied to collection to sort or filter the items in there? It feels like this is getting pretty close to sql like left-join, so maybe Elastic Search is the wrong tool?
It looks like you need the nested data type
Taking your data as an example:
Without nested type:
POST collection/_bulk?filter_path=_
{"index":{}}
{"items":[{"id":11,"time_added":"2017-08-07T09:07:15.000Z","condition":"fair"},{"id":13,"time_added":"2019-08-07T09:07:15.000Z","condition":"good"},{"id":14,"time_added":"2016-08-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":21,"time_added":"2017-09-07T09:07:15.000Z","condition":"fair"},{"id":23,"time_added":"2019-09-07T09:07:15.000Z","condition":"good"},{"id":24,"time_added":"2016-09-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":31,"time_added":"2017-10-07T09:07:15.000Z","condition":"fair"},{"id":33,"time_added":"2019-10-07T09:07:15.000Z","condition":"good"},{"id":34,"time_added":"2016-10-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":41,"time_added":"2017-11-07T09:07:15.000Z","condition":"fair"},{"id":43,"time_added":"2019-11-07T09:07:15.000Z","condition":"good"},{"id":44,"time_added":"2016-11-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":51,"time_added":"2017-12-07T09:07:15.000Z","condition":"fair"},{"id":53,"time_added":"2019-12-07T09:07:15.000Z","condition":"good"},{"id":54,"time_added":"2016-12-07T09:07:15.000Z","condition":"poor"}]}
Query (you'd get incorrect results - expected one, got five):
GET collection/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"items.condition": {
"value": "good"
}
}
},
{
"range": {
"items.time_added": {
"lte": "2019-09-01"
}
}
}
]
}
}
}
Aggregation (incorect results - look at the first bucket "2016-08-01T00:00:00.000Z" - it contains 3 CONDITION sub-buckets with every condition type)
GET collection/_search
{
"size": 0,
"aggs": {
"DATE": {
"date_histogram": {
"field": "items.time_added",
"calendar_interval": "month"
},
"aggs": {
"CONDITION": {
"terms": {
"field": "items.condition.keyword",
"size": 10
}
}
}
}
}
}
With nested type
DELETE collection
PUT collection
{
"mappings": {
"properties": {
"items": {
"type": "nested"
}
}
}
}
# and POST the same data from above
Query (returns just one result)
GET collection/_search
{
"query": {
"nested": {
"path": "items",
"query": {
"bool": {
"must": [
{
"term": {
"items.condition": {
"value": "good"
}
}
},
{
"range": {
"items.time_added": {
"lte": "2019-09-01"
}
}
}
]
}
}
}
}
}
Aggregation (the first date bucket contains just one CONDITION sub-bucket)
GET collection/_search
{
"size": 0,
"aggs": {
"ITEMS": {
"nested": {
"path": "items"
},
"aggs": {
"DATE": {
"date_histogram": {
"field": "items.time_added",
"calendar_interval": "month"
},
"aggs": {
"CONDITION": {
"terms": {
"field": "items.condition.keyword",
"size": 10
}
}
}
}
}
}
}
}
Hope that helps :)

How to aggregate documents in different buckets and then apply filters to the result

I have many elasticsearch documents in this format:
{
"_index": "testIndex",
"_type": "_doc",
"_id": "0kt102sBt5sWDQMwsMNJ",
"_score": 1.4376891,
"_source": {
"id": "8dJs76YI",
"entity": "movie",
"actor": "Pier",
"action": "like",
"source": "tablet",
"tag": [
"drama"
],
"location": "3.698492,-73.697308",
"country": "",
"city": "",
"timestamp": "2019-07-04T05:35:01Z"
}
}
This index stores all the activities done against a movie entity. id is the movie id. action can be like, view, share etc. actor is the name of user.
I want to apply aggregation and get those movies which are having total likes between 1000 and 10000 and also liked by actor Pier but only those having tags as comedy.
The query need to have a combination of bool, terms and range query along with aggregations. I have tried filters aggregation but the official documentation example is not proving to be enough.
Can any one please give some example to prepare the query for this.
Thanks.
So I'd begin writing query with data that isn't part of aggregation, which is actor and tag.
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
}
}
This should filter only liked movies where Pier was part of the cast and it was of comedy genre.
The next thing is aggregating and getting counts per movie, so it certainly makes sense to use terms aggregation to group everything by id.
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
},
"aggs": {
"movies": {
"terms": {
"field": "id",
"min_doc_count": 1000
}
}
}
}
So with this query you should already have counts per movie, given that we already have filtered out, these counts are for liked comedy movies where Pier has been part of cast, now this has to filter each filter to ensure wanted amount of likes.
So now it's needed to add max likes per movie. You'll need to use bucket selector for that:
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
},
"aggs": {
"movieIds": {
"terms": {
"field": "id",
"min_doc_count": 1000
},
"aggs": {
"likesWithinRange": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": {
"inline": "params.doc_count < 10000"
}
}
}
}
}
}
}
Hopefully that works or at least puts you on a right direction.

Elasticsearch - Aggregations on part of bool query

Say I have this bool query:
"bool" : {
"should" : [
{ "term" : { "FirstName" : "Sandra" } },
{ "term" : { "LastName" : "Jones" } }
],
"minimum_should_match" : 1
}
meaning I want to match all the people with first name Sandra OR last name Jones.
Now, is there any way that I can get perform an aggregation on all the documents that matched the first term only?
For example, I want to get all of the unique values of "Prizes" that anybody named Sandra has. Normally I'd just do:
"query": {
"match": {
"FirstName": "Sandra"
}
},
"aggs": {
"Prizes": {
"terms": {
"field": "Prizes"
}
}
}
Is there any way to combine the two so I only have to perform a single query which returns all of the people with first name Sandra or last name Jones, AND an aggregation only on the people with first name Sandra?
Thanks alot!
Use post_filter.
Please refer the following query. Post_filter will make sure that your bool should clause don't effect your aggregation scope.
Aggregations are filtered based on main query as well, but they are unaffected by post_filter. Please refer to the link
{
"from": 0,
"size": 20,
"aggs": {
"filtered_lastname": {
"filter": {
"query": {
"match": {
"FirstName": "sandra"
}
}
},
"aggs": {
"prizes": {
"terms": {
"field": "Prizes",
"size": 10
}
}
}
}
},
"post_filter": {
"bool": {
"should": [{
"term": {
"FirstName": "Sandra"
}
}, {
"term": {
"LastName": "Jones"
}
}],
"minimum_should_match": 1
}
}
}
Running a filter inside the aggs before aggregating on prizes can help you achieve your desired usecase.
Thanks
Hope this helps

Document count aggregation via query in Elasticsearch (like facet.query in solr)

I have a main query and i need the number of matches for a couple of sub-queries.
In solr words I need a facet.query. What I am missing is a simple doc_count aggregation like the value_count aggregation.
Any suggestions?
I found two possible solutions which I do not like:
Use filter aggregation with value_count metric on _id:
example:
GET _search
{
"query": {
"match_main": {}
},
"aggs": {
"facetvalue1": {
"filter": {
"bool": {
"should": [
{"match": { "name": "fred" }},
{"term": { "lastname": "krueger" }}
]
}
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"facetvalue2": {
"filter": {
"term": { "name": "freddy" }
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
}
}
}
Use Multi Search API
example:
GET _msearch
{"index":"myindex"}
{"query":{"match_main": {}}}
{"index":"myindex"}
{"size": 0, "query":{"match_main": {}}, "filter": {"bool": {"should":[{"match": { "name": "fred" }},{"term": { "lastname": "krueger" }}]}}}
{"index":"myindex"}
{"size": 0, "query":{"match_main": {}},"filter": {"term": { "name": "freddy" }}}
I see that solution 2 is faster but imagine match_main as complex query!
So I would prefer solution 1 if there would be an doc_count:{} instead of value_count:{"field":"_id"}.
But back to my basic question: what is the counterpart of the solr facet.query in elasticsearch?
You can use a filters aggregation for this. Note the additional s, that is different from the filter aggregation you already mentioned.
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"values": {
"filters": {
"filters": {
"value1": {
"bool": {
"should": [
{
"match": {
"name": "fred"
}
},
{
"term": {
"lastname": "krueger"
}
}
]
}
},
"value2": {
"term": {
"name": "freddy"
}
}
}
}
}
}
}
This will return something like
"aggregations": {
"values": {
"buckets": {
"value1": {
"doc_count": 4
},
"value2": {
"doc_count": 1
}
}
}
}
Edit: As a general note, you don't have to use a metric aggregation on your bucket aggregations. If you don't provide any subaggregations, you will just get the document count. In this case, filters will provide the buckets, but multiple filter aggregations should work as well.

Multiple filters and an aggregate in elasticsearch

How can I use a filter in connection with an aggregate in elasticsearch?
The official documentation gives only trivial examples for filter and for aggregations and no formal description of the query dsl - compare it e.g. with postgres documentation.
Through trying out I found following query, which is accepted by elasticsearch (no parsing errors), but ignores the given filters:
{
"filter": {
"and": [
{
"term": {
"_type": "logs"
}
},
{
"term": {
"dc": "eu-west-12"
}
},
{
"term": {
"status": "204"
}
},
{
"range": {
"#timestamp": {
"from": 1398169707,
"to": 1400761707
}
}
}
]
},
"size": 0,
"aggs": {
"time_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "1h"
},
"aggs": {
"name": {
"percentiles": {
"field": "upstream_response_time",
"percents": [
98.0
]
}
}
}
}
}
}
Some people suggest using query instead of filter. But the official documentation generally recommends the opposite for filtering on exact values. Another issue with query: while filters offer an and, query does not.
Can somebody point me to documentation, a blog or a book, which describe writing non-trivial queries: at least an aggregate plus multiple filters.
I ended up using a filter aggregation - not filtered query. So now I have 3 nested aggs elements.
I also use bool filter instead of and as recommended by #alex-brasetvik because of http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
My final implementation:
{
"aggs": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"_type": "logs"
}
},
{
"term": {
"dc": "eu-west-12"
}
},
{
"term": {
"status": "204"
}
},
{
"range": {
"#timestamp": {
"from": 1398176502000,
"to": 1400768502000
}
}
}
]
}
},
"aggs": {
"time_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "1h"
},
"aggs": {
"name": {
"percentiles": {
"field": "upstream_response_time",
"percents": [
98.0
]
}
}
}
}
}
}
},
"size": 0
}
Put your filter in a filtered-query.
The top-level filter is for filtering search hits only, and not facets/aggregations. It was renamed to post_filter in 1.0 due to this quite common confusion.
Also, you might want to look into this post on why you often want to use bool and not and/or: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
more on #geekQ 's answer: to support filter string with space char,for multipal term search,use below:
{ "aggs": {
"aggresults": {
"filter": {
"bool": {
"must": [
{
"match_phrase": {
"term_1": "some text with space 1"
}
},
{
"match_phrase": {
"term_2": "some text with also space 2"
}
}
]
}
},
"aggs" : {
"all_term_3s" : {
"terms" : {
"field":"term_3.keyword",
"size" : 10000,
"order" : {
"_term" : "asc"
}
}
}
}
} }, "size": 0 }
Just for reference, as for the version 7.2, I tried with something as follows to achieve multiple filters for aggregation:
filter aggregation to filter for aggregation
use bool to set up the compound query
POST movies/_search?size=0
{
"size": 0,
"aggs": {
"test": {
"filter": {
"bool": {
"must": {
"term": {
"genre": "action"
}
},
"filter": {
"range": {
"year": {
"gte": 1800,
"lte": 3000
}
}
}
}
},
"aggs": {
"year_hist": {
"histogram": {
"field": "year",
"interval": 50
}
}
}
}
}
}

Resources