Count nested objects no more than once in each document in Elasticsearch - elasticsearch

I have an index with documents of the following structure:
{
"_id" : "1234567890abcdef",
...
"entities" : [
{
"name" : "beer",
"evidence_start" : 12,
"evidence_end" : 16
},
{
"name" : "water",
"evidence_start" : 55,
"evidence_end" : 60
},
{
"name" : "beer",
"evidence_start" : 123,
"evidence_end" : 127
},
...
]
}
entities is an object of type nested here. I need to count how many documents contain mentions about beer.
The issue is that an obvious bucket aggregation returns the amount of mentions, not documents, so that if beer is mentioned twice in the same document, it adds up 2 to the total result as well.
A query I use to do that is:
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
}
}
}
}
},
...
}
Is there a way of counting only distinct mentions without scripting?
Many thanks in advance.

you simply need to a reverse nested aggregation as a sub aggregation, to count the number of "main documentd" instead of nested documents.
You should try
{
...
"aggs": {
"entities": {
"nested": {
"path": "entities"
},
"aggs": {
"entity_count": {
"terms": {
"field": "entities.name",
"size" : 20
},
"aggs": {
"main_document_count": {
"reverse_nested": {}
}
}
}
}
}
},
...
}

Related

How to perform sub-aggregation in elasticsearch?

I have a set of article documents in elasticsearch with fields content and publish_datetime.
I am trying to retrieve most frequent words from articles with publish year == 2021.
GET articles/_search
{
"query": {
"match_all": {}
},
"aggs": {
"word_counts": {
"terms": {
"field": "content"
}
},
"publish_datetime": {
"terms": {
"field": "publish_datetime"
}
},
"aggs": {
"word_counts_2021": {
"bucket_selector": {
"buckets_path": {
"word_counts": "word_counts",
"pd": "publish_datetime"
},
"script": "LocalDateTime.parse(params.pd).getYear() == 2021"
}
}
}
}
}
This fails on
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [word_counts_2021]",
"line" : 17,
"col" : 25
}
],
"type" : "parsing_exception",
"reason" : "Unknown aggregation type [word_counts_2021]",
"line" : 17,
"col" : 25,
"caused_by" : {
"type" : "named_object_not_found_exception",
"reason" : "[17:25] unknown field [word_counts_2021]"
}
},
"status" : 400
}
which does not make sense, because word_counts2021 is the name of the aggregation accordings to docs. It's not an aggregation type. I am the one who pics the name, so I thought it could have had basically any value.
Does anyone have any idea, what's going on there. So far, it seems pretty unintuitive service to me.
The agg as you have it written seems to be filtering publish_datetime buckets so that you only include those in the year 2021 to do that you must nest the sub-agg under that particular terms aggregation.
Like so:
GET articles/_search
{
"query": {
"match_all": {}
},
"aggs": {
"word_counts": {
"terms": {
"field": "content"
}
},
"publish_datetime": {
"terms": {
"field": "publish_datetime"
}
"aggs": {
"word_counts_2021": {
"bucket_selector": {
"buckets_path": {
"pd": "publish_datetime"
},
"script": "LocalDateTime.parse(params.pd).getYear() == 2021"
}
}
}
}
}
}
But, if that field has a date time type, I would suggest simply filtering with a range query and then aggregating your documents.

How to count number of fields inside nested field? - Elasticsearch

I did the following mapping. I would like to count the number of products in each nested field "products" (for each document separately). I would also like to do a histogram aggregation, so that I would know the number of specific bucket sizes.
PUT /receipts
{
"mappings": {
"properties": {
"id" : {
"type": "integer"
},
"user_id" : {
"type": "integer"
},
"date" : {
"type": "date"
},
"sum" : {
"type": "double"
},
"products" : {
"type": "nested",
"properties": {
"name" : {
"type" : "text"
},
"number" : {
"type" : "double"
},
"price_single" : {
"type" : "double"
},
"price_total" : {
"type" : "double"
}
}
}
}
}
}
I've tried this query, but I get the number of all the products instead of number of products for each document separately.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products"
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count" : 6552,
"bucket_size" : {
"value" : 0
}
}
}
UPDATE
Now I have this code where I make separate buckets for each id and count the number of products inside them.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size" : 0,
"aggs": {
"terms":{
"terms":{
"field": "_id"
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 490,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"nested" : {
"doc_count" : 21,
"bucket_size" : {
"value" : 21
}
}
},
{
"key" : "10",
"doc_count" : 1,
"nested" : {
"doc_count" : 5,
"bucket_size" : {
"value" : 5
}
}
},
{
"key" : "100",
"doc_count" : 1,
"nested" : {
"doc_count" : 12,
"bucket_size" : {
"value" : 12
}
}
},
...
Is is possible to group these values (21, 5, 12, ...) into buckets to make a histogram of them?
products is only the path to the array of individual products, not an aggregatable field. So you'll need to use it on one of your product's field -- such as the number:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
Note that is a product has no number, it'll not contribute to the total count. It's therefore best practice to always include an ID in each of them and then aggregate on that field.
Alternatively you could use a script to account for missing values. Luckily value_count does not deduplicate -- meaning if two products are alike and/or have empty values, they'll still be counted as two:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"script": {
"source": "doc['products.number'].toString()"
}
}
}
}
}
}
}
UPDATE
You could also use a nested composite aggregation which'll give you the histogrammed product count w/ the corresponding receipt id:
GET /receipts/_search
{
"size": 0,
"aggs": {
"my_aggs": {
"nested": {
"path": "products"
},
"aggs": {
"composite_parent": {
"composite": {
"sources": [
{
"receipt_id": {
"terms": {
"field": "_id"
}
}
},
{
"product_number": {
"histogram": {
"field": "products.number",
"interval": 1
}
}
}
]
}
}
}
}
}
}
The interval is modifiable.

Elasticsearch one record for one matching query

I have one elasticsearch index in which I have so many records. There is a field username, I want to get latest 1 post of each username by passing comma separated values, example ::
john,shahid,mike,jolie
and I want latest 1 post of each usernames. How can I do this? I can do it by passing one username at a time but it will hit so many http requests. I want to do it in one request.
You could use a filtered terms aggregation coupled with a top_hits one in order to achieve what you need:
{
"size": 0,
"query": {
"bool": {
"filter": {
"terms": {
"username": [ "john", "shahid", "mike", "jolie" ]
}
}
}
},
"aggs": {
"usernames": {
"filter": {
"terms": {
"username": [ "john", "shahid", "mike", "jolie" ]
}
},
"aggs": {
"by_username": {
"terms": {
"field": "username"
},
"aggs": {
"top1": {
"top_hits": {
"size": 1,
"sort" : {"created_date" : "desc"}
}
}
}
}
}
}
}
}
This query can give you all the posts of these 4 ids sorted by post_date in descending order. You can process on that data to get the result.
{
"sort" : [
{ "post_date" : {"order" : "desc"}}
],
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"username" : ["john","shahid","mike","jolie]
}
}
}
}
}

get buckets count in elasticsearch aggregations

I am using elasticsearch to search a database with a lot of duplicates.
I am using field colapse and it works, however it returns the amount of hits (including duplicates) and not the amount of buckets.
"aggs": {
"uniques": {
"terms": {
"field": "guid"
},
"aggs": {
"jobs": { "top_hits": { "_source": "title", "size": 1 }}
}
}
}
I can count the buckets by making another request using cardinality (but it only returns count, not the documents):
{
"aggs" : {
"uniques" : {
"cardinality" : {
"field" : "guid"
}
}
}
}
Is there a way to return both requests (buckets + total bucket count) in one search?
Thanks
You can combine both of these aggregations into 1 request.
{
"aggs" : {
"uniques" : {
"cardinality" : {
"field" : "guid"
}
},
"uniquesTerms": {
"terms": {
"field": "guid"
},
"aggs": {
"jobs": { "top_hits": { "_source": "title", "size": 1 }}
}
}
}

sub field aggregation group by order by in elasticsearch

I am unable to find the correct syntax to get an aggregation of a sub object ordered by a count field.
A good example of this is a twitter document:
{
"properties" : {
"id" : {
"type" : "long"
},
"message" : {
"type" : "string"
},
"user" : {
"type" : "object",
"properties" : {
"id" : {
"type" : "long"
},
"screenName" : {
"type" : "string"
},
"followers" : {
"type" : "long"
}
}
}
}
}
How would I go about getting the Top Influencers for a given set of tweets? This would be a unique list of the top 10 "user" objects ordered by the "user.followers" field.
I have tried using top_hits but get an exception:
org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA]
Data too large, data for [user.id]
"aggs": {
"top-influencers": {
"terms": {
"field": "user.id",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit": {
"max": {
"field": "user.followers"
}
}
}
}
}
I can get almost what I want using the "sort" field on the query (no aggregation), however if a user has multiple tweets then they will appear twice in the result. I need to be able to group by the sub object "user" and only return each user once.
---UPDATE---
I have managed to get a list of the top users returning in very good time. Unfortunatly it still isnt unique. Also the docs say top_hits is designed to be a sub agg..., I am using it as a top level agg...
"aggs": {
"top_influencers": {
"top_hits": {
"sort": [
{
"user.followers": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user.id",
"user.screenName",
"user.followers"
]
},
"size": 10
}
}
}
Try this:
{
"aggs": {
"GroupByType": {
"terms": {
"field": "user.id",
"size": 10000
},
"aggs": {
"Group": {
"top_hits":{
"size":1,
"_source": {
"includes": ["user.id", "user.screenName", "user.followers"]
},
"sort":[{
"user.followers": {
"order": "desc"
}
}]
}
}
}
}
}
}
You can then take the top 10 results of this query. Note that normal search in elastic search only goes up to 10000 records.

Resources