ElasticSearch group by and aggregate

ElasticSearch group by and aggregate - ruby

I have a bunch of network traffic logs in ES and want to get some high level stats for each source:dest pair.
In SQL, I’d do something like:
SELECT src, dst, SUM(bytes)
FROM net_traffic
WHERE start>1518585000000
AND end<1518585300000
GROUP BY src, dst
(start and end are just epoch times during which the traffic was seen)
How can I extract the same information from the data stored in ES?
I’m coding the solution in Ruby but ideally just want an ES query to pull out the data - so solution is hopefully agnostic of implementation language.

ElasticSearch supports sub aggregations. You must use from that and then in your application side convert result of query to what you want.
Query:
{
"size": 0,
"aggs": {
"src_agg": {
"terms": {
"field": "src"
},
"aggs": {
"dst_agg": {
"terms": {
"field": "dst"
}
}
}
}
}
}
Sample of result:
{
"key": "X1",
"doc_count": 5,
"agg2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "Y1",
"doc_count": 2 // ***
},
{
"key": "Y2",
"doc_count": 3 // ***
}]
}
}
You can extract desired data from *** in result:
(X1, Y1) = 2, (X1, Y2) = 3

Related

Elastic Search - Aggregating on Sub Aggregations

I am looking for a way to group aggregation results so I can filter them down. Currently my response is pretty large (> 1mb) and I'm hoping to return only the top matching filters.
I'm not sure if Elastic is capable of grouping aggregations by the sub aggregation without using nesting, but I figured I would give it a try.
The filter data is stored in an array on each of my objects:
// document a
"attributeValues" : [
"A12345|V12345",
"A22345|V22345",
...
]
// document b
"attributeValues" : [
"A12345|V15555",
"A22345|V22345",
...
]
I am currently aggregating on the values and getting results like this:
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
{
"key": "A22345|V22345",
"doc_count": 5
},
I would like to be able to group these aggregations by the first part of the string so that I can return only the top 10 matches and get something like this:
"topAttributes" : {
"buckets" : [
{
"key" : "A12345",
"doc_count" : 17,
"attributes" : {
"buckets" : [
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
I have tried to filter using the field script but I cannot seem to find anywhere online (checked many questions) to get the sub-aggregation's results.
The script would look something like this:
GET test_index/_search
{
"size" : 0,
"aggs": {
"attributeValuesTop": {
"terms": {
"size": 10,
"script": {
"source": """
return attributes.splitOnToken('|')[1];
"""
}
},
"aggs": {
"attributes": {
"terms": {
"field": "attributeValues",
"size": 10000
}
}
}
}
}
}
NOTE: I know we can use a nested solution, but nested is too slow for the amount of documents we have (millions of records) and the target of sub 300ms searches.

Elasticsearch Terms Aggregation on array -- filter to buckets that match your query?

I'm using a elasticsearch terms aggregation to bucket based on an array property on each document. I'm running into an issue where I get back buckets that are not in my query, and I'd like to those filter out.
Let's say each document is a Post, and has an array property media which specifies which social media website the post is on (and may be empty):
{
id: 1
media: ["facebook", "twitter", "instagram"]
}
{
id: 2
media: ["twitter", "instagram", "tiktok"]
}
{
id: 3
media: ["instagram"]
}
{
id: 4
media: []
}
And, let's say there's another index of Users, which stores a favorite_media property of the same type.
{
id: 42
favorite_media: ["twitter", "instagram"]
}
I have a query uses a terms lookup to filter, then does a terms aggregation.
{
"query": {
"filter": {
"terms": {
"index": "user_index",
"id": 42,
"path": "favorite_media"
}
}
},
"aggs": {
"Posts_by_media": {
"terms": {
"field": "media",
"size": 1000
}
}
}
}
This will result in:
{
...
"aggregations": {
"Posts_by_media": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 3
},
{
"key": "twitter",
"doc_count": 2
},
{
"key": "facebook",
"doc_count": 1
},
{
"key": "tiktok",
"doc_count": 1
}
]
}
}
}
Because media is an array property, any document that matches the filter will be used to create buckets, and I'll have buckets that don't match my filter. Here I want to only get back buckets facebook and instagram, since those are the two that I'm filtering to (via the terms-lookup).
I know terms aggregations offer a includes ability, but that doesn't work for me here since I'm using a terms-lookup, and don't know the data in favorite_media at query time.
How can I limit my buckets to be only those that match the filters in my query?
Thank you for your help!

Subaggregation leads to missing data

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?
Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:
{
"size": 0,
"aggs": {
"outer_docs": {
"terms": {"size": 20, "field": "field_1_to_aggregate_on"},
"aggs": {
"inner_docs": {
"terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
"aggs": "things to display here"
}
}
}
}
}
If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.
{
"hits": {
"total": 9853,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 3,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "3", "doc_count": 1, "some": "data here"},
]
}
},
...
]
}
}
}
Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.
"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}
In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.
{
"hits": {
"total": 8,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 8,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "7", "doc_count": 2, "some": "data here"},
]
}
},
...
]
}
}
}
I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?
This is my version information:
{
'build_date': '2018-09-26T13:34:09.098244Z',
'build_flavor': 'default',
'build_hash': '04711c2',
'build_snapshot': False,
'build_type': 'deb',
'lucene_version': '7.4.0',
'minimum_index_compatibility_version': '5.0.0',
'minimum_wire_compatibility_version': '5.6.0',
'number': '6.4.2'
}
EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:
https://discuss.elastic.co/t/bug-in-aggregation-result-when-using-shards/164161
https://github.com/elastic/elasticsearch/issues/37425

Check your elastic deprecation logfile. You will probably have some warnings like this:
This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
search.max_buckets is a dynamic cluster setting that defaults to 10.000 buckets in 7.0.
Now, this is not documented anywhere, but in my experience: Allocating over 10.000 buckets result in the termination of your query, but you will get back the results that have been achieved until that moment. This explains missing data in your result
Using the composite Aggregation will help, your other option is to increase the max_buckets. Be careful with that, you can crash your entire cluster that way, because there is a cost for every bucket (RAM). It does not matter if you actually use all the allocated buckets, you can crash with empty buckets only.
See:
https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking-changes-7.0.html#_literal_search_max_buckets_literal_in_the_cluster_setting
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html
https://github.com/elastic/elasticsearch/issues/35896

How about using the composite aggregation for this? Pretty sure that solves your problem.
GET /_search
{
"aggs" : {
"all_docs": {
"composite" : {
"size": 1000,
"sources" : [
{ "outer_docs": { "terms": { "field": "field_1_to_aggregate_on" } } },
{ "inner_docs": { "terms": { "field": "field_2_to_aggregate_on" } } }
]
}
}
}
}
If you have many buckets, the composite aggregation will help you scroll through each of them using size/after.

It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.
We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.
We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.

Elasticsearch counts of multiple indices

I am creating a report to compare actual count from database with indexed records.
I have three indices index1, index2 and index3
To get the count for a single index i am using the following URL
http://localhost:9200/index1/_count?q=_type:invoice
=> {"count":50,"_shards":{"total":5,"successful":5,"failed":0}}
For multiple indices:
http://localhost:9200/index1,index2/_count?q=_type:invoice
=> {"count":80,"_shards":{"total":5,"successful":5,"failed":0}}
Now the count is added up, i want it to grouped by index also how can i pass filters group by a specific field
to get the output like this:
{"index1_count":50,"index2_count":50,"approved":10,"rejected":40 ,"_shards":{"total":5,"successful":5,"failed":0}}

You can use _search?search_type=count and do an aggregation based on _index field to make the distinction between the indices:
GET /index1,index2/_search?search_type=count
{
"aggs": {
"by_index": {
"terms": {
"field": "_index"
}
}
}
}
and the result would be something like this:
"aggregations": {
"by_index": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "index1",
"doc_count": 50
},
{
"key": "index2",
"doc_count": 80
}
]
}
}

elasticsearch group-by multiple fields

I am Looking for the best way to group data in elasticsearch.
Elasticsearch doesn't support something like 'group by' in sql.
Lets say I have 1k categories and millions of products. What do you think is the best way to render a complete category tree? Off course you need some metadata (icon, link-target, seo-titles,...) and custom sorting for the categories.
Using Aggregations:
Example: https://found.no/play/gist/8124563
Looks usable if you have to group by one field, and need some extra fields.
Using multiple Fields in a Facet (won't work):
Example: https://found.no/play/gist/1aa44e2114975384a7c2
Here we lose the relationship between the different fields.
Building funny Facets:
https://found.no/play/gist/8124810
For example, building a category tree using these 3 "solutions" sucks.
Solution 1 May work (ES 1 isn't stable right now)
Solution 2 Doesn't work
Solution 3 Is a pain because it feels ugly, you need to prepare a lot of data and the facets blow up.
Maybe an alternative could be not to store any category data in ES, just the id
https://found.no/play/gist/a53e46c91e2bf077f2e1
Then you could get the associated category from another system, like redis, memcache or the database.
This would end up in clean code, but the performance could become a problem.
For example loading, 1k Categories from Memcache / Redis / a database could be slow.
Another problem is that syncing 2 database is harder than syncing one.
How do you deal with such problems?
I am sorry for the links, but I can't post more than 2 in one article.

The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
Of course this can go on for as many fields as you'd like.
Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result

You can use Composite Aggregation query as follows. This type of query also paginates the results if the number of buckets exceeds from the normal value of ES. By using the field 'after' you can access the rest of buckets:
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"field1": {
"terms": {
"field": "field1"
}
}
},
{
"field2": {
"terms": {
"field": "field2"
}
}
},
{
"field3": {
"terms": {
"field": "field3"
}
}
},
]
}
}
}
You can find more detail in ES page bucket-composite-aggregation.

I think some developers will be definitely looking same implementation in Spring DATA ES and JAVA ES API.
Please finds :-
List<FieldObject> fieldObjectList = Lists.newArrayList();
SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
.addAggregation(
terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
.subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
)
.build();
Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
#Override
public Aggregations extract(SearchResponse aResponse) {
return aResponse.getAggregations();
}
});
Terms aField1Terms = aField1Aggregations.get("ByField1");
aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
String field1Value = aField1Bucket.getKey();
Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");
aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
String field2Value = aField2Bucket.getKey();
Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");
aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
String field3Value = aField3Bucket.getKey();
Long count = aField3Bucket.getDocCount();
FieldObject fieldObject = new FieldObject();
fieldObject.setField1(field1Value);
fieldObject.setField2(field2Value);
fieldObject.setField3(field3Value);
fieldObject.setCount(count);
fieldObjectList.add(fieldObject);
});
});
});
imports need to be done for same :-
import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery;
import static org.elasticsearch.search.aggregations.AggregationBuilders.terms;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.collect.Lists;
import org.elasticsearch.index.query.FilterBuilder;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.TermFilterBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.data.elasticsearch.core.ResultsExtractor;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.data.elasticsearch.core.query.SearchQuery;

sub-aggregations is what you need .. though this is never explicitly stated in the docs it can be found implicitly by structuring aggregations
It will result the sub-aggregation as if the query was filtered by result of the higher aggregation.
It actually looks like as if this is what happens in there.
{
"aggregations": {
"VALUE1AGG": {
"terms": {
"field": "VALUE1",
},
"aggregations": {
"VALUE2AGG": {
"terms": {
"field": "VALUE2",
}
}
}
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio