Elasticsearch java api get average of terms aggregation - elasticsearch

I'm using elasticsearch with java api and I'm trying to get average value of lowest record from each bucket of term aggregation. One solution I found is to get results like this
AggregationBuilders.terms("group_by_flights").field("flight_id)
.subAggregation(AggregationBuilders.min("minimum").field("duration")))
and then count average on the code side. The problem is that if there will be lot of result, it will allocate a lot of memory to count it. I would like to do this on elastic side.
I found, that there is something like avg bucket pipeline aggregation, which can be add as sibling aggregation to terms (and others)
"the average": {
"avg_bucket": {
"buckets_path": "some_bucket_path"
}
}
Problem is that in java api you can add pipeline aggregation only as subaggregation. So if we construct our aggregation like this our terms aggregation won't be seen
AggregationBuilders.terms("group_by_flights").field("flight_id")
.subAggregation(PipelineAggregatorBuilders.avgBucket("avg", "group_by_flights.duration" *<- this wont't be seen because its subaggregation*))
I was thinking about making some empty top aggregation and then add all aggregations as subaggregations, but it seems like silly walk-around, and I'm not understanding something correctly.
Any ideas?

The only solution I found so far is to make aggregations as sub aggregation of "empty aggregation"
AggregationBuilders.global("global_aggregation")
.subAggregation((AggregationBuilders.terms("group_by_flights").field("flight_id"))
.subAggregation(AggregationBuilders.min("min").field("duration")))
.subAggregation(PipelineAggregatorBuilders.avgBucket("avg_bucket_aggs","group_by_flights>min"))

My solution is use FilterAggregationBuilder to do it, this one can filtering data.The first sub aggregation to make data bucket, the second sub aggregation to merge bucket data.
AggregationBuilders.filter("global_aggregation", bool)
.subAggregation((AggregationBuilders.terms("group_by_flights").field("flight_id"))
.subAggregation(AggregationBuilders.min("min").field("duration")))
.subAggregation(PipelineAggregatorBuilders.avgBucket("avg_bucket_aggs", "group_by_flights>min"));

Related

How to filter the aggregation results in Kibana (elastic search)?

I want to filter the elastic search aggregation results in Kibana (v6.2). For example, I want to show only sum of hours those that are more than 100 (like HAVING command in SQL). I know that we can filter the results in filter section over other fields, but I don't know how to apply the filter on aggregation functions. I tried to use post_filter in filter section in Kibana, but it didn't work.
Any ideas?
You can augment aggregation query within advanced field
It will be added to request as shown on picture
Another question is what to put into this field. You can check script values for sum aggregation

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?
Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

Elasticsearch subaggregation not working as expected

I am trying to perform aggregation on a term and then perform sub aggregation on the result test to filter the results on a date range. But sub aggregation filter has no affect on the search response. The search response is always returning all the documents without applying filter.
For example:
TermsBuilder aggregationBuilders = AggregationBuilders.terms("form.id").field("form.id").size(0);
aggregationBuilders.subAggregation(AggregationBuilders.filter("indexDate").filter(QueryBuilders.rangeQuery("indexDate").lte(date)));
You need to use filter aggregations the other way around, i.e. as a top aggregation and then you add the terms aggregation as a sub-aggregation.
TermsBuilder formBuckets = AggregationBuilders.terms("form.id")
.field("form.id")
.size(0);
FilterBuilder dateFilter = AggregationBuilders.filter("indexDate")
.filter(QueryBuilders.rangeQuery("indexDate").lte(date))
.subAggregation(formBuckets);
I see in your other question, you have somehow "solved" this issue by moving the filter on indexDate to the query section. That will also work in your case.

Elasticsearch Two sibling aggregations cannot have the same name

I'm running Elastic 1.4.4 and have created some code with a range aggregation and a sub-aggregation (cardinality).
If I run my code with a matchall query and some filterquery to get a subset all works fine.
I see the range aggregation and the expected values for the sub-aggregation.
But as soon as I add a query things fall apart.
I run into an error: "Two sibling aggregations cannot have the same name".
This seems strange since I have the same aggregation/sub-aggregation defined before without any problem.
This is basically what I do:
SearchResponse response =
client.prepareSearch(esIndex)
.setQuery(query)
.setFrom(startAt)
.setSize(topSize)
.addAggregation(AggregationBuilders.cardinality(UNIQUE_IPS).field(Constants.MAIN_COLLAPSE_FIELD))
.addAggregation(collapse)
.addAggregation(filtersetbranche) \
.addAggregation(Constants.FTE_RANGES_AGGS.subAggregation(AggregationBuilders.cardinality("ftecounts").field(Constants.MAIN_COLLAPSE_FIELD)))
.execute()
.actionGet();
I found the issue.
I added a subaggregation to the Constants.FTE_RANGES_AGGS dynamically.
So, the first time I call this piece of code all is fine.
But the next time I added the same subaggregation with the same name hence the error "Two sibling aggregations cannot have the same name".
So, I have to rewrite it to use a copy of the Constant or do it in a different way.

RethinkDB: custom scoring (like Elasticsearch)

I recently discovered RethinkDB, and find it's query language to be much simpler than Elasticsearch. The only use case I haven't been able to find a solution for is specifying how to score results based on the document's fields, like you can do in Elasticsearch (http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/script-score.html). Is there a way to score the query results in RethinkDB and return only the top-n results?
If you have a query like r.table('comments').filter(r.row('name').eq('tldr')), then you can do something like r.table('comments').filter(r.row('name').eq('tldr')).map({score: CALCULATE_SCORE(r.row), row: r.row}).orderBy('score').limit(n) to return the top n results. Note that his does work proportional to the number of results in the original query. If that's too expensive, you can do something similar with an index by writing r.table('comments').indexCreate('score', CALCULATE_SCORE(r.row)) and then writing r.table('comments').orderBy({index: 'score'}).limit(n).

Resources