Elasticsearch 5 (Searchkick) Aggregation Bucket Averages - elasticsearch

We have an ES index holding scores given for different products. What we're trying to do is aggregate on product names and then get the average scores for each of product name 'buckets'. Currently the default aggregation functionality only gives us the counts for each bucket - is it possible to extend this to giving us average score per product name?
We've looked at pipeline aggregations but the documentation is pretty dense and doesn't seem to quite match what we're trying to do.
Here's where we've got to:
{
"aggs"=>{
"prods"=>{
"terms"=>{
"field"=>"product_name"
},
"aggs"=>{
"avgscore"=>{
"avg"=>{
"field"=>"score"
}
}
}
}
}
}
Either this is wrong, or could it be that there's something in how searckick compiles its ES queries that is breaking things?
Thanks!

Think this is the pipeline aggregation you want...
POST /_search
{
"size": 0,
"aggs": {
"product_count" : {
"terms" : {
"field" : "product"
},
"aggs": {
"total_score": {
"sum": {
"field": "score"
}
}
}
},
"avg_score": {
"avg_bucket": {
"buckets_path": "product_count>total_score"
}
}
}
}
Hopefully I have that the right way round, if not - switch the first two buckets.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Get max bucket of terms aggregation (with pipeline aggregation)

I was wondering how to get the bucket with the highest doc_count when using a terms aggregation with Elasticsearch. I'm using the Kibana sample data kibana_sample_data_flights:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
}
}
}
If there was a single bucket with the max doc_count value I could set the size of the terms aggregation to 1, however this doesn't work if there are two buckets with the same max doc_count value.
Since I came accross pipeline aggregations, I feel there should be an easy way to achieve this. The max bucket aggregation seems to be able to deal with multiple max buckets, since the guide says this:
[...] which identifies the bucket(s) with the maximum value of [...]
However the only way to make this work was using a work-around with a sub-aggregation using value_count:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
},
"aggs": {
"counter": {
"value_count": {
"field": "_id"
}
}
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>counter"
}
}
}
}
a) Is there a better way in general, to find the terms bucket with the max value?
b) Is there a better way using pipeline aggrations?
Thanks in advance!
Well you can simplify as below and you don't need to make use of value_count aggregation.
However, unfortunately using max_bucket is the only way to get what you are looking for.
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>_count" <---- Note the usage of _count
}
}
}
}
Hope this helps!

Finding unique documents in an index in elastic search

I am having duplicates entries in my index and I want to find out only unique documents in the index . TopHits aggregation solves this problem but my other requirement is to support sorting on the results (across buckets). Hence I cant use top hits aggregation.
Other options I can think of is to write a plugin or use painless script.
Need help to solve this.It would be great if you can redirect me to some examples.
Top hits aggregation find the value from the complete result set while If you use cardinality it gives only filtered result set.
You can use cardinality aggregation like below:
{
"aggs" : {
"UNIQUE_COUNT" : {
"cardinality" : {
"field" : "your_field"
}
}
}
}
This aggregation comes with some responsibility, You can find the below ElasticSearch documentation to understand it better.
Link: Cardinality Aggregation
For sorting, you can refer the below example, where you can pass your aggregation in order of terms for which your bucket get created:
{
"aggs": {
"AGG_NAME": {
"terms": {
"field": "you_field",
"size": 10,
"order": {
"UNIQUE_COUNT.doc_count": "asc"
},
"min_doc_count": 1
},
"aggs": {
"UNIQUE_COUNT": {
"cardinality": {
"field": "your_field"
}
}
}
}
}
}

Restructuring Elasticsearch model for fast aggregations

My business domain is real estate listings, and i'm trying to build a faceted UI. So i need to do aggregations to know how many listings have 1 beds, 2 beds, how many in this price range, how many with a pool etc etc. Pretty standard stuff.
Currently my model is like this:
{
"beds": 1,
"baths": 1,
"price": 100000,
"features": ['pool','aircon'],
"inspections": [{
"startsOn": "2019-01-20"
}]
}
To build my faceted UI, i'm doing multiple aggregations, e.g:
{
"aggs" : {
"beds" : {
"terms" : { "field" : "beds" }
},
"baths" : {
"terms" : { "field" : "baths" }
},
"features" : {
"terms" : { "field" : "features" }
}
}
}
You get the idea. If i've got 10 fields, i'm doing 10 aggregations.
But after seeing this article, i'm thinking i should just re-structure my model to be like this:
{
"beds": 1,
"baths": 1,
"price": 100000,
"features": ['pool','aircon'],
"attributes": ['bed_1','bath_1','price_100000-200000','has_pool','has_aircon','has_inspection_tomorrow']
}
Then i only need the 1 agg:
{
"aggs": {
"attributes": {
"terms": {
"field": "attributes"
}
}
}
}
So i've got a couple of questions.
Is the only drawback in this approach that logic is moved to the client? If so, im happy with this - for performance, since i don't see this logic changing very often.
Can i leverage this field in my queries too? For example, what if i wanted to match all documents with 1 bedroom and price = 100000 and with a pool, etc. Terms queries work on an 'any' match, but how can i find documents where the array of values contain all the provided terms?
Alternatively, if you can think of a better structure for modelling for search speed, please let me know!
Thanks
For the second point your can use the terms set query (doc here).
This query is like a terms query, but you will have control over how many terms must match.
You can configure it through a script like that :
GET /my-index/_search
{
"query": {
"terms_set": {
"codes" : {
"terms" : ["bed_1","bath_1","price_100000-200000"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
}
}
will require all params to match

instruct elasticsearch to return random results from different types

I have an index in ES with say 3 types A,B,C. Each type holds 1000 products. When the user makes a query with no scoring , then ES returns first all results from A, then all from B and then all from C.
What I need is to present mixed results from the 3 types.
I looked into the random scoring but it s not quite what I need.
Any ideas?
Do you really need randomness or simple 3 results from a type? Three results from each type could be realized through the top hits aggregation. First you aggregate by the _type field, then the top hits aggregation is applied:
{
"query": {
"function_score": {
"query": {
"match_all": {
}
},
"random_score": {
"seed": 137677928418000
}
}
},
"aggs": {
"all_type": {
"terms": {
"field": "_type"
},
"aggs": {
"by_top_hit": {
"top_hits": {
"size": 3
}
}
}
}
}
}
Edit: I added random scoring, to get random results, I think to get special numbers of documents for each _type is difficult, a solution is probably to get just enough from all _type fields.

Resources