sub field aggregation group by order by in elasticsearch - elasticsearch

I am unable to find the correct syntax to get an aggregation of a sub object ordered by a count field.
A good example of this is a twitter document:
{
"properties" : {
"id" : {
"type" : "long"
},
"message" : {
"type" : "string"
},
"user" : {
"type" : "object",
"properties" : {
"id" : {
"type" : "long"
},
"screenName" : {
"type" : "string"
},
"followers" : {
"type" : "long"
}
}
}
}
}
How would I go about getting the Top Influencers for a given set of tweets? This would be a unique list of the top 10 "user" objects ordered by the "user.followers" field.
I have tried using top_hits but get an exception:
org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA]
Data too large, data for [user.id]
"aggs": {
"top-influencers": {
"terms": {
"field": "user.id",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit": {
"max": {
"field": "user.followers"
}
}
}
}
}
I can get almost what I want using the "sort" field on the query (no aggregation), however if a user has multiple tweets then they will appear twice in the result. I need to be able to group by the sub object "user" and only return each user once.
---UPDATE---
I have managed to get a list of the top users returning in very good time. Unfortunatly it still isnt unique. Also the docs say top_hits is designed to be a sub agg..., I am using it as a top level agg...
"aggs": {
"top_influencers": {
"top_hits": {
"sort": [
{
"user.followers": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user.id",
"user.screenName",
"user.followers"
]
},
"size": 10
}
}
}

Try this:
{
"aggs": {
"GroupByType": {
"terms": {
"field": "user.id",
"size": 10000
},
"aggs": {
"Group": {
"top_hits":{
"size":1,
"_source": {
"includes": ["user.id", "user.screenName", "user.followers"]
},
"sort":[{
"user.followers": {
"order": "desc"
}
}]
}
}
}
}
}
}
You can then take the top 10 results of this query. Note that normal search in elastic search only goes up to 10000 records.

Related

Elasticsearch query - Most recent log for each user, for field logtype='x'

I am trying to query elasticsearch to get the most recent log for all userIDs where we only include one log per user with field logtype='x'
if logtype='x' then get 1 log per userID where the date of this log is the most recent for each userID
Example log:{"logtype"="x", "number":232423, "userID":123, "time":"2021-02-03T20:25:44.603045+05:30"}
How can I create this query?
Assuming your index mapping looks like:
{
"properties" : {
"logtype" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"number" : {
"type" : "long"
},
"time" : {
"type" : "date"
},
"userID" : {
"type" : "long"
}
}
}
you'll need these aggregations: a terms ordered by the result of a max, plus a top_hits to fetch the most recent log per a userId:
POST logs/_search
{
"size": 0,
"query": {
"match": {
"logtype": "x"
}
},
"aggs": {
"by_user_id": {
"terms": {
"field": "userID",
"size": 1,
"order": {
"latest_date": "desc"
}
},
"aggs": {
"latest_date": {
"max": {
"field": "time"
}
},
"latest_log": {
"top_hits": {
"size": 1,
"sort": [
{
"time": {
"order": "desc"
}
}
]
}
}
}
}
}
}

Elasticsearch : How to do 'group by' with painless in scripted fields?

I would like to do something like the following using painless:
select day,sum(price)/sum(quantity) as ratio
from data
group by day
Is it possible?
I want to do this in order to visualize the ratio field in kibana, since kibana itself doesn't have the ability to divide aggregated values, but I would gladly listen to alternative solutions beyond scripted fields.
Yes, it's possible, you can achieve this with the bucket_script pipeline aggregation:
{
"aggs": {
"days": {
"date_histogram": {
"field": "dateField",
"interval": "day"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
}
UPDATE:
You can use the above query through the Transform API which will create an aggregated index out of the source index.
For instance, I've indexed a few documents in a test index and then we can dry-run the above aggregation query in order to see how the target aggregated index would look like:
POST _transform/_preview
{
"source": {
"index": "test2",
"query": {
"match_all": {}
}
},
"dest": {
"index": "transtest"
},
"pivot": {
"group_by": {
"days": {
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "day"
}
}
},
"aggregations": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
The response looks like this:
{
"preview" : [
{
"quantity" : 12.0,
"price" : 1000.0,
"days" : 1580515200000,
"ratio" : 83.33333333333333
}
],
"mappings" : {
"properties" : {
"quantity" : {
"type" : "double"
},
"price" : {
"type" : "double"
},
"days" : {
"type" : "date"
}
}
}
}
What you see in the preview array are documents that are going to be indexed in the transtest target index, that you can then visualize in Kibana as any other index.
So what a transform actually does is run the aggregation query I gave you above and it will then store each bucket into another index that can be used.
I found a solution to get the ratio of sums with TSVB visualization in kibana.
You may see the image here to see an example.
At first, you have to create two sum aggregations, one that sums price and another that sums quantity. Then, you choose the 'Bucket Script' aggregation to divide the aforementioned sums, with the use of painless script.
The only drawback that I found is that you can not aggregate on multiple columns.

Count nested objects no more than once in each document in Elasticsearch

I have an index with documents of the following structure:
{
"_id" : "1234567890abcdef",
...
"entities" : [
{
"name" : "beer",
"evidence_start" : 12,
"evidence_end" : 16
},
{
"name" : "water",
"evidence_start" : 55,
"evidence_end" : 60
},
{
"name" : "beer",
"evidence_start" : 123,
"evidence_end" : 127
},
...
]
}
entities is an object of type nested here. I need to count how many documents contain mentions about beer.
The issue is that an obvious bucket aggregation returns the amount of mentions, not documents, so that if beer is mentioned twice in the same document, it adds up 2 to the total result as well.
A query I use to do that is:
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
}
}
}
}
},
...
}
Is there a way of counting only distinct mentions without scripting?
Many thanks in advance.
you simply need to a reverse nested aggregation as a sub aggregation, to count the number of "main documentd" instead of nested documents.
You should try
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
},
"aggs": {
"main_document_count": {
"reverse_nested": {}
}
}
}
}
}
},
...
}

Elasticsearch one record for one matching query

I have one elasticsearch index in which I have so many records. There is a field username, I want to get latest 1 post of each username by passing comma separated values, example ::
john,shahid,mike,jolie
and I want latest 1 post of each usernames. How can I do this? I can do it by passing one username at a time but it will hit so many http requests. I want to do it in one request.
You could use a filtered terms aggregation coupled with a top_hits one in order to achieve what you need:
{
"size": 0,
"query": {
"bool": {
"filter": {
"terms": {
"username": [ "john", "shahid", "mike", "jolie" ]
}
}
}
},
"aggs": {
"usernames": {
"filter": {
"terms": {
"username": [ "john", "shahid", "mike", "jolie" ]
}
},
"aggs": {
"by_username": {
"terms": {
"field": "username"
},
"aggs": {
"top1": {
"top_hits": {
"size": 1,
"sort" : {"created_date" : "desc"}
}
}
}
}
}
}
}
}
This query can give you all the posts of these 4 ids sorted by post_date in descending order. You can process on that data to get the result.
{
"sort" : [
{ "post_date" : {"order" : "desc"}}
],
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"username" : ["john","shahid","mike","jolie]
}
}
}
}
}

get buckets count in elasticsearch aggregations

I am using elasticsearch to search a database with a lot of duplicates.
I am using field colapse and it works, however it returns the amount of hits (including duplicates) and not the amount of buckets.
"aggs": {
"uniques": {
"terms": {
"field": "guid"
},
"aggs": {
"jobs": { "top_hits": { "_source": "title", "size": 1 }}
}
}
}
I can count the buckets by making another request using cardinality (but it only returns count, not the documents):
{
"aggs" : {
"uniques" : {
"cardinality" : {
"field" : "guid"
}
}
}
}
Is there a way to return both requests (buckets + total bucket count) in one search?
Thanks
You can combine both of these aggregations into 1 request.
{
"aggs" : {
"uniques" : {
"cardinality" : {
"field" : "guid"
}
},
"uniquesTerms": {
"terms": {
"field": "guid"
},
"aggs": {
"jobs": { "top_hits": { "_source": "title", "size": 1 }}
}
}
}

Resources