elastic search sort aggregation by selected field - elasticsearch

How can I sort the output from an aggregation by a field that is in the source data, but not part of the output of the aggregation?
In my source data I have a date field that I would like the output of the aggregation to be sorted by date.
Is that possible? I've looked at using "order" within the aggregation, but I don't think it can see that date field to use it for sorting?
I've also tried adding a sub aggregation which includes the date field, but again, I cannot get it to sort on this field.
I'm calculating a hash for each document in my ETL on the way in to elastic. My data set contains a lot of duplication, so I'm trying to use the aggregation on the hash field to filter out duplicates and that works fine. I need the output from the aggregation to retain a date sort order so that I can work with the output in angular.
The documents are like this:
{_id: 123,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 124,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 132,
_source: {
"hash": "0202020202020"
"user": "1"
"dateTime" : "2001/2/20 09:20:43"
"action": "Logout"
}
{_id: 200,
_source: {
"hash": "0303030303030303"
"user": "2"
"dateTime" : "2001/2/22 09:32:14"
"action": "Login"
}
So I want to use an aggregation on the hash value to remove duplicates from my set and then render the response in date order.
My query:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"action": "Login"
}
}
]
},
"size": 0,
"aggs": {
"md5": {
"terms": {
"field": "hash",
"size": 0
}
},
"size": 0,
"aggs": {
"byDate": {
"terms": {
"field": "dateTime",
"size": 0
}
}
}
}
}
}
}
}
Currently the output is ordered on the hash and I need it ordered on the date field within each hash bucket. Is that possible?

If the aggregation on "hash" is just for removing duplicates, it might work for you to simply aggregate on "dateTime" first, followed by the terms aggregation on "hash". For example:
GET my_index/test/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool": {
"must" : [
{ "term": {"action":"Login"} }
]
}
}
}
},
"size": 0,
"aggs": {
"byDate" : {
"terms": {
"field" : "dateTime",
"order": { "_term": "asc" } <---- EDIT: must specify order here
},
"aggs": {
"byHash": {
"terms": {
"field": "hash"
}
}
}
}
}
}
This way, your results would be sorted by "dateTime" first.

Related

Elasticsearch : How to do 'group by' with painless in scripted fields?

I would like to do something like the following using painless:
select day,sum(price)/sum(quantity) as ratio
from data
group by day
Is it possible?
I want to do this in order to visualize the ratio field in kibana, since kibana itself doesn't have the ability to divide aggregated values, but I would gladly listen to alternative solutions beyond scripted fields.
Yes, it's possible, you can achieve this with the bucket_script pipeline aggregation:
{
"aggs": {
"days": {
"date_histogram": {
"field": "dateField",
"interval": "day"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
}
UPDATE:
You can use the above query through the Transform API which will create an aggregated index out of the source index.
For instance, I've indexed a few documents in a test index and then we can dry-run the above aggregation query in order to see how the target aggregated index would look like:
POST _transform/_preview
{
"source": {
"index": "test2",
"query": {
"match_all": {}
}
},
"dest": {
"index": "transtest"
},
"pivot": {
"group_by": {
"days": {
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "day"
}
}
},
"aggregations": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
The response looks like this:
{
"preview" : [
{
"quantity" : 12.0,
"price" : 1000.0,
"days" : 1580515200000,
"ratio" : 83.33333333333333
}
],
"mappings" : {
"properties" : {
"quantity" : {
"type" : "double"
},
"price" : {
"type" : "double"
},
"days" : {
"type" : "date"
}
}
}
}
What you see in the preview array are documents that are going to be indexed in the transtest target index, that you can then visualize in Kibana as any other index.
So what a transform actually does is run the aggregation query I gave you above and it will then store each bucket into another index that can be used.
I found a solution to get the ratio of sums with TSVB visualization in kibana.
You may see the image here to see an example.
At first, you have to create two sum aggregations, one that sums price and another that sums quantity. Then, you choose the 'Bucket Script' aggregation to divide the aforementioned sums, with the use of painless script.
The only drawback that I found is that you can not aggregate on multiple columns.

Elasticsearch: Aggregate all unique values of a field and apply a condition or filter by another field

My documents look like this:
{
"ownID": "Val_123",
"parentID": "Val_456",
"someField": "Val_78",
"otherField": "Val_90",
...
}
I am trying to get all (unique, as in one instance) results for a list of ownID values, while filtering by a list of parentID values and vice-versa.
What I did so far is:
Get (separate!) unique values for ownID and parentID in key1 and key2
{
"size": 0,
"aggs": {
"key1": {
"terms": {
"field": "ownID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
},
"key2": {
"terms": {
"field": "parentID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
}
}
}
Use filter to get (some) results matching either ownID OR parentID
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": ["Val_1","Val_2","Val_3"]
}
},
{
"terms": {
"parentID": ["Val_8","Val_9"]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": ["ownID", "parentID","otherField"]
}
}
}
}
}
However, I need to get separate results for each filter in the second query, and get:
(1) the parentID of the documents matching some value of ownID
(2) the ownID for the documents matching some value of parentID.
So far I managed to do it using two similar queries (see below for (1)), but I would ideally want to combine them and query only once.
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": [ "Val1", Val_2, Val_3 ]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": "parentID"
}
}
}
}
}
I'm using Elasticsearch version 5.2
If I got your question correctly then you need to get all the aggregations count correct irrespective of the filter query but in search hits you want the filtered documents only, so for this elasticsearch has another type of filter : "post filter" : refer to this : https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-post-filter.html
its really simple, it will just filter the results after the aggregations have been computed.

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).
I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Faceted search in webshop with Elastic

I've seen a few examples about faceted search in Elastic but all of them know in advance on what fields you would want created buckets.
How should I work when I have a webshop with multiple categories, where the properties of the values are different in every category?
Is there a way to describe what properties your documents have when you ran a query (eg filter by category)?
I have this query right now:
{
"from" : 0, "size" : 10,
"query": {
"bool" : {
"must" : [
{ "terms": {"color": ["red", "green", "purple"]} },
{ "terms": {"make": ["honda", "toyota", "bmw"]} }
]
}
},
"aggregations": {
"all_cars": {
"global": {},
"aggs": {
"colors": {
"filter" : { "terms": {"make": ["honda", "toyota", "bmw"]} },
"aggregations": {
"filtered_colors": { "terms": {"field": "color.keyword"} }
}
},
"makes": {
"filter" : { "terms": {"color": ["red", "green"]} },
"aggregations": {
"filtered_makes": { "terms": {"field": "make.keyword"} }
}
}
}
}
}
}
How can I know on what fields I can make aggregations. Is there a way to describe the properties of a document after running a query? So I can know what the possible fields ,to aggregate on, are.
Right now I am storing all properties of my article in an array and I can quickly aggregate them like this:
{
"size": 0,
"aggregations": {
"array_aggregation": {
"terms": {
"field": "properties.keyword",
"size": 10
}
}
}
}
This is a step in the right direction but that way I don't know what the type of a property is.
Here's a sample object
"price": 10000,
"color": "red",
"make": "honda",
"sold": "2014-10-28",
"properties": [
"price",
"color",
"make",
"sold"
]
You can use the filter aggregation which will filter and then create a terms aggregation inside?

Query or Filter for minimum field value?

Example: a document stored in an index represents test scores and meta data about each test.
{ "test": 1, "user":1, "score":100, "meta":"other data" },
{ "test": 2, "user":2, "score":65, "meta":"other data" },
{ "test": 3, "user":2, "score":88, "meta":"other data" },
{ "test": 4, "user":1, "score":23, "meta":"other data" }
I need to be able to filter out all but the lowest test score and return the associated metadata with that test for each test taker. So my expected result set would be:
{ "test": 2, "user":2, "score":65, "meta":"other data" },
{ "test": 4, "user":1, "score":23, "meta":"other data" }
The only way I see to do this now is by first doing a terms aggregation by user with a nested min aggregation to get their lowest score.
POST user/tests/_search
{
"aggs" : {
"users" : {
"terms" : {
"field" : "user",
"order" : { "lowest_score" : "asc" }
},
"aggs" : {
"lowest_score" : { "min" : { "field" : "score" } }
}
}
},"size":0
}
Then I'd have to take the results of that query and do a filtered query for EACH user and filter on the lowest score value to grab the rest of the metadata. Yuk.
POST user/tests/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{"term": { "user": {"value": "1" }}},
{"term": { "score": {"value": "22" }}}
]
}
}
}
}
}
I'd like to know if there is a way to return return one response that has the lowest test score for each test taker and includes the original _source document.
Solutions?
UPDATE - SOLVED
The following gives me the lowest score document for each user and is ordered by the overall lowest score. And, it includes the original document.
GET user/tests/_search?search_type=count
{
"aggs": {
"users": {
"terms": {
"field": "user",
"order" : { "lowest_score" : "asc" }
},
"aggs": {
"lowest_score": { "min": { "field": "score" }},
"lowest_score_top_hits": {
"top_hits": {
"size":1,
"sort": [{"score": {"order": "asc"}}]
}
}
}
}
}
}
Maybe you could try this with top hits aggregation:
GET user/tests/_search?search_type=count
{
"aggs": {
"users": {
"terms": {
"field": "user",
"order": {
"_term": "asc"
}
},
"aggs": {
"lowest_score": {
"min": {
"field": "score"
}
},
"agg_top": {
"top_hits": {"size":1}
}
}
}
},
"size": 20
}

Resources