elasticsearch group-by multiple fields - elasticsearch

I am Looking for the best way to group data in elasticsearch.
Elasticsearch doesn't support something like 'group by' in sql.
Lets say I have 1k categories and millions of products. What do you think is the best way to render a complete category tree? Off course you need some metadata (icon, link-target, seo-titles,...) and custom sorting for the categories.
Using Aggregations:
Example: https://found.no/play/gist/8124563
Looks usable if you have to group by one field, and need some extra fields.
Using multiple Fields in a Facet (won't work):
Example: https://found.no/play/gist/1aa44e2114975384a7c2
Here we lose the relationship between the different fields.
Building funny Facets:
https://found.no/play/gist/8124810
For example, building a category tree using these 3 "solutions" sucks.
Solution 1 May work (ES 1 isn't stable right now)
Solution 2 Doesn't work
Solution 3 Is a pain because it feels ugly, you need to prepare a lot of data and the facets blow up.
Maybe an alternative could be not to store any category data in ES, just the id
https://found.no/play/gist/a53e46c91e2bf077f2e1
Then you could get the associated category from another system, like redis, memcache or the database.
This would end up in clean code, but the performance could become a problem.
For example loading, 1k Categories from Memcache / Redis / a database could be slow.
Another problem is that syncing 2 database is harder than syncing one.
How do you deal with such problems?
I am sorry for the links, but I can't post more than 2 in one article.

The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
Of course this can go on for as many fields as you'd like.
Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result

You can use Composite Aggregation query as follows. This type of query also paginates the results if the number of buckets exceeds from the normal value of ES. By using the field 'after' you can access the rest of buckets:
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"field1": {
"terms": {
"field": "field1"
}
}
},
{
"field2": {
"terms": {
"field": "field2"
}
}
},
{
"field3": {
"terms": {
"field": "field3"
}
}
},
]
}
}
}
You can find more detail in ES page bucket-composite-aggregation.

I think some developers will be definitely looking same implementation in Spring DATA ES and JAVA ES API.
Please finds :-
List<FieldObject> fieldObjectList = Lists.newArrayList();
SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
.addAggregation(
terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
.subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
)
.build();
Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
#Override
public Aggregations extract(SearchResponse aResponse) {
return aResponse.getAggregations();
}
});
Terms aField1Terms = aField1Aggregations.get("ByField1");
aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
String field1Value = aField1Bucket.getKey();
Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");
aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
String field2Value = aField2Bucket.getKey();
Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");
aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
String field3Value = aField3Bucket.getKey();
Long count = aField3Bucket.getDocCount();
FieldObject fieldObject = new FieldObject();
fieldObject.setField1(field1Value);
fieldObject.setField2(field2Value);
fieldObject.setField3(field3Value);
fieldObject.setCount(count);
fieldObjectList.add(fieldObject);
});
});
});
imports need to be done for same :-
import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery;
import static org.elasticsearch.search.aggregations.AggregationBuilders.terms;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.collect.Lists;
import org.elasticsearch.index.query.FilterBuilder;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.TermFilterBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.data.elasticsearch.core.ResultsExtractor;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.data.elasticsearch.core.query.SearchQuery;

sub-aggregations is what you need .. though this is never explicitly stated in the docs it can be found implicitly by structuring aggregations
It will result the sub-aggregation as if the query was filtered by result of the higher aggregation.
It actually looks like as if this is what happens in there.
{
"aggregations": {
"VALUE1AGG": {
"terms": {
"field": "VALUE1",
},
"aggregations": {
"VALUE2AGG": {
"terms": {
"field": "VALUE2",
}
}
}
}
}
}

Related

Elastic Search - Aggregating on Sub Aggregations

I am looking for a way to group aggregation results so I can filter them down. Currently my response is pretty large (> 1mb) and I'm hoping to return only the top matching filters.
I'm not sure if Elastic is capable of grouping aggregations by the sub aggregation without using nesting, but I figured I would give it a try.
The filter data is stored in an array on each of my objects:
// document a
"attributeValues" : [
"A12345|V12345",
"A22345|V22345",
...
]
// document b
"attributeValues" : [
"A12345|V15555",
"A22345|V22345",
...
]
I am currently aggregating on the values and getting results like this:
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
{
"key": "A22345|V22345",
"doc_count": 5
},
I would like to be able to group these aggregations by the first part of the string so that I can return only the top 10 matches and get something like this:
"topAttributes" : {
"buckets" : [
{
"key" : "A12345",
"doc_count" : 17,
"attributes" : {
"buckets" : [
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
I have tried to filter using the field script but I cannot seem to find anywhere online (checked many questions) to get the sub-aggregation's results.
The script would look something like this:
GET test_index/_search
{
"size" : 0,
"aggs": {
"attributeValuesTop": {
"terms": {
"size": 10,
"script": {
"source": """
return attributes.splitOnToken('|')[1];
"""
}
},
"aggs": {
"attributes": {
"terms": {
"field": "attributeValues",
"size": 10000
}
}
}
}
}
}
NOTE: I know we can use a nested solution, but nested is too slow for the amount of documents we have (millions of records) and the target of sub 300ms searches.

Elasticsearch Terms Aggregation on array -- filter to buckets that match your query?

I'm using a elasticsearch terms aggregation to bucket based on an array property on each document. I'm running into an issue where I get back buckets that are not in my query, and I'd like to those filter out.
Let's say each document is a Post, and has an array property media which specifies which social media website the post is on (and may be empty):
{
id: 1
media: ["facebook", "twitter", "instagram"]
}
{
id: 2
media: ["twitter", "instagram", "tiktok"]
}
{
id: 3
media: ["instagram"]
}
{
id: 4
media: []
}
And, let's say there's another index of Users, which stores a favorite_media property of the same type.
{
id: 42
favorite_media: ["twitter", "instagram"]
}
I have a query uses a terms lookup to filter, then does a terms aggregation.
{
"query": {
"filter": {
"terms": {
"index": "user_index",
"id": 42,
"path": "favorite_media"
}
}
},
"aggs": {
"Posts_by_media": {
"terms": {
"field": "media",
"size": 1000
}
}
}
}
This will result in:
{
...
"aggregations": {
"Posts_by_media": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 3
},
{
"key": "twitter",
"doc_count": 2
},
{
"key": "facebook",
"doc_count": 1
},
{
"key": "tiktok",
"doc_count": 1
}
]
}
}
}
Because media is an array property, any document that matches the filter will be used to create buckets, and I'll have buckets that don't match my filter. Here I want to only get back buckets facebook and instagram, since those are the two that I'm filtering to (via the terms-lookup).
I know terms aggregations offer a includes ability, but that doesn't work for me here since I'm using a terms-lookup, and don't know the data in favorite_media at query time.
How can I limit my buckets to be only those that match the filters in my query?
Thank you for your help!

Find all distinct values for particular field in index by using Elastic4s client

How can I use elastic4s to get all distinct values for particular field?
Example in JSON:
GET persons/_search
{
 "size":"0",
 "aggs" : {
  "uniq_gender" : {
   "terms" : { "field" : "Gender" }
   }
  }
}
Use terms aggregation.
It looks like this:
index: 'test',
body: {
"aggs": {
"uniq_gender": {
"terms": {"field": "Gender"}
}
}
},
The response would look a little something like:
{
"uniq_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "male",
"doc_count": 55
},
{
"key": "female",
"doc_count": 38
},
{
"key": "other",
"doc_count": 1
}
]
}
}
Important to note from the docs:
terms aggregation should be a field of type keyword or any other data type suitable for bucket aggregations. In order to use it with text you will need to enable fielddata.
Meaning if your gender field is index'd as text you have to set fielddata to true on the mapping, find out how to do so here.
For elastic4s client, use termsAgg(name: String, field: String).
For your case, code looks like this:
search("persons")
.size(0)
.aggs(
termsAgg("uniq_gender", "Gender")
)

ElasticSearch group by and aggregate

I have a bunch of network traffic logs in ES and want to get some high level stats for each source:dest pair.
In SQL, I’d do something like:
SELECT src, dst, SUM(bytes)
FROM net_traffic
WHERE start>1518585000000
AND end<1518585300000
GROUP BY src, dst
(start and end are just epoch times during which the traffic was seen)
How can I extract the same information from the data stored in ES?
I’m coding the solution in Ruby but ideally just want an ES query to pull out the data - so solution is hopefully agnostic of implementation language.
ElasticSearch supports sub aggregations. You must use from that and then in your application side convert result of query to what you want.
Query:
{
"size": 0,
"aggs": {
"src_agg": {
"terms": {
"field": "src"
},
"aggs": {
"dst_agg": {
"terms": {
"field": "dst"
}
}
}
}
}
}
Sample of result:
{
"key": "X1",
"doc_count": 5,
"agg2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "Y1",
"doc_count": 2 // ***
},
{
"key": "Y2",
"doc_count": 3 // ***
}]
}
}
You can extract desired data from *** in result:
(X1, Y1) = 2, (X1, Y2) = 3

Elasticsearch counts of multiple indices

I am creating a report to compare actual count from database with indexed records.
I have three indices index1, index2 and index3
To get the count for a single index i am using the following URL
http://localhost:9200/index1/_count?q=_type:invoice
=> {"count":50,"_shards":{"total":5,"successful":5,"failed":0}}
For multiple indices:
http://localhost:9200/index1,index2/_count?q=_type:invoice
=> {"count":80,"_shards":{"total":5,"successful":5,"failed":0}}
Now the count is added up, i want it to grouped by index also how can i pass filters group by a specific field
to get the output like this:
{"index1_count":50,"index2_count":50,"approved":10,"rejected":40 ,"_shards":{"total":5,"successful":5,"failed":0}}
You can use _search?search_type=count and do an aggregation based on _index field to make the distinction between the indices:
GET /index1,index2/_search?search_type=count
{
"aggs": {
"by_index": {
"terms": {
"field": "_index"
}
}
}
}
and the result would be something like this:
"aggregations": {
"by_index": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "index1",
"doc_count": 50
},
{
"key": "index2",
"doc_count": 80
}
]
}
}

Resources