ElasticSearch Java API sorting aggregation - elasticsearch

I have terms aggregation and I need sort result buckets by another field (date). Or I need to add 2 sub aggregations with max (and top hit) and min (and top hit).
I didn't find any API that allows me to do this.
I think I can add max subAggregation with top hit for the main terms aggregation, and create another terms aggregation with min with top hits sub aggregation, but it will be so heavy job.

Related

Elasticsearch Stats and Sum Aggregation under the hood

I want to understand how elastic search works under the hood for stats aggregation and sum aggregation.
My use case needs date histogram aggregation as primary aggregation and sum aggregation or stat aggregation as the nested aggregation. I executed queries using both the aggregations on same amount of data in Kibana. And the time both the queries took for execution was similar. So, for all our use cases we might use stats aggregation all the time if there's no performance difference between stats & sum aggregation.
I couldn't find any detailed information about internal working of these aggregations. Request to provide me with any information on it or point me to any documentation describing how these aggregations work under the hood.
Elasticsearch version : 7.1
Thank You
When in doubt, go to the source.
If you look at the implementation of StatsAggregator.java and SumAggregator.java, you'll see that they are very similar.
SumAggregator only computes a sum, while StatsAggregator computes sum, min, max, count and avg. Even though the latter seems to do more job, it is also only iterating once through the data in order to compute additional metrics, but those computations are not computationally expensive.
So if you know you need just the sum, use SumAggregator, but if you also need either min, max, count or avg, then go for StatsAggregator instead, so you only iterate once through the data.

Kibana visualization without time aggregation

I have data with the following format::
{
timestamp: Date,
x: number
}
I want to display these values simply in a line, without any aggregation over x, but in Kibana it always requires me to select some kind of aggregation, like average.
It is possible to create the line-chart that you request, but for Kibana to create an visualization, I'm afraid an aggregation would be necessary.
Kibana basis its visualization on buckets (Date, x-axis) and metrics (x, y-axis). Buckets are aggregations of documents over a specified search (almost 30 aggregation methods)
. Metrics are value(s) based on the documents contained in each bucket (almost 20 aggregation methods)
(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html).
However, you could try to create buckets with 'date_histogram' for which the time interval is small enough so it contain one document. Then for the metric aggregation you could select min or max aggregation (Note: This assumes though that you timestamp is unique for each document).

Applying custom filters for aggregation in Elastic Search

I have more that 2 million documents which has price and discount. I have to get percentage of products with 10%, 20%, 30%,......, 90%, 100% discount(discounts are rounded off). Its not possible to fetch data and aggregate it on application layer at it will take too much time. I am also afraid that it will create lag for other users as thread will be busy for a long time.
Is there any way I can create custom filters upon aggregation logic?
It doesn't look like you need a custom filter here. This functionality is a standard part of the histogram aggregation. You can also take a look at the range aggregation in case you need more flexible ranges.
If you need a complete flexibility in how terms are calculated you can also use script in terms aggregation to return the value you want to group your records by. However, with 2 million documents, it might be better to pre-calculate the discount during before indexing the document, store this value as a separate field and then use histogram aggregation.

How do I compute facets/aggregations for the top n documents, with pagination in Elasticsearch?

Suppose I have an index for cars on a dealer's car lot. Each document resembles the following:
{
color: 'red',
model_year: '2015',
date_added: '2015-07-20'
}
Suppose I have a million cars.
Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars.
I could just use from and size to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on model_year and color (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set.
How do I limit my search to the most recently added 1000 documents for pagination and aggregation?
As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a match_all list of results. Even if you would use size at the query level, it will still not give you what you need because size is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches.
This feature request is not new and has been asked for before some time ago.
In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first terminate_after number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied.
In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the terminate_after is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after date_added and the query is just a match_all all the documents will have the same score and it will be returning an irrelevant set of documents.
In conclusion:
there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in sampler aggregation or with terminate_after and get a set of documents
my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.

Limiting aggreation to the top X hits in elasticsearch

ElasticSearch builds the aggregation results based on all the hits of the query independently of the from and size parameters. This is what we want in most cases, but I have a particular case in which I need to limit the aggregation to the top N hits. The limits filter is not suitable as it does not fetch the best N items but only the first X matching the query (per shard) independently of their score.
Is there any way to build a query whose hit count has an upper limit N in order to be able to build an aggregation limited to those top N results? And if so how?
Subsidiary question: Limiting the score of matching documents could be an alternative even though in my case I would require a fixed bound. Does the min_score parameter affect aggregation?
You are looking for Sampler Aggregation.
I have a similar answer explained here
Optionally, you can use the field or script and max_docs_per_value
settings to control the maximum number of documents collected on any
one shard which share a common value.
If you are using an ElasticSearch cluster with version > 1.3, you can use top_hits aggregation by nesting it in your aggregation, ordering on the field you want and set the size parameter to X.
The related documentation can be found here.
I need to limit the aggregation to the top N hits
With nested aggregations, your top bucket can represent those N hits, with nested aggregations operating on that bucket. I would try a filter aggregation for the top level aggregation.
The tricky part is to make use the of _score somehow in the filter and to limit it exactly to N entries... There is a limit filter that works per shard, but I don't think it would work in this context.
It looks like Sampler Aggregation can now be used for this purpose. Note that it is only available as of Elastic 2.0.

Resources