Elasticsearch significant terms minimum - elasticsearch

I've got something like this:
GET index_*/_search?search_type=count
{
"aggs": {
"products": {
"terms": {
"field": "products_id",
"size": 100
},
"aggs": {
"significant_products": {
"significant_terms": {
"field": "also_purchased_id",
"size": 40
}
}
}
}
}
}
And i want to say significant_terms to give me more results. It gives sometimes only 10 even when the doc_count says 400. If i add "min_doc_count": 10 to significant terms its just doing weird things. Some keys won't give me any result and some just 3 oder 4? So how can i do that?
Thanks!

Related

Elasticsearch - get N top items in group

I keep such data in elasticsearch with such a structure.
"_source" : {
"artist" : "Roger McGuinn",
"track_id" : "TRBIACM128F930021A",
"title" : "The Bells Of Rhymney",
"score" : 0,
"user_id" : "61583201a0b70d3f7ed79b60",
"timestamp" : 1634991817
}
How can I get the top N songs with the best score for each user. If a user has rated a song several times, I would like to take into account only the most recent rating.
I'm done with this ,but instead the top 10 songs for the user, I just get the first 10 songs found, without including the score
{
"size": 0,
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 1
},
"aggs": {
"group_by_track": {
"terms": {
"field": "track_id.keyword"
},
"aggs": {
"take_the latest_score": {
"terms": {
"field": "timestamp",
"size": 1
},
"aggs": {
"take N tracks": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
}
}
What I understand is that you'd want to return list of valid users with the highest rated track based on date/times.
You can make use of Date Histogram aggregation followed by Terms aggregation on which you can further extend pipeline to include Top Hits aggregation:
Aggregation Query:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"songs_over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h", <---- Note this. Change this to 1d if you'd want to return results on daily basis
"min_doc_count": 1
},
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 10 <---- Note this. To return 10 users
},
"aggs": {
"take N tracks": {
"top_hits": {
"sort": [
{
"score": {
"order": "desc". <---- Also note this to sort based on score
}
}],
"_source": {
"includes": ["track_id", "score"]. <---- To return track_id and score
},
"size": 1
}
}
}
}
}
}
}
}
What this would give you for e.g since I'm using fixed_interval as 1h is, for every hour, return all highest rated track of valid users in that time.
Feel free to filter out the docs using Range Query on which you can run the above aggregation query.

Suspiciously low result on Elasticsearch

The following query returns 24 buckets:
{
"query": {
"bool": {
"filter": [
{
"match": {
"partnerCategory": 6
}
}
]
}
},
"size": 0,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId"
}
}
}
}
The expected result is about 750 buckets long. 24 is very low.
If you take into consideration that if you add up the "doc_count" of each bucket, it doesn't match the number of hits if you don't aggregate.
The sum of the buckets doc_count should be at least 20k. Now it's 2.5k.
So, can anyone tell me what's going on? I'm doing something wrong?
Have you tried to set the size option of the terms aggregation to a very high value? e.g.,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId",
"size": 1000
}
}
}
Also, checks whether also the result of the cardinality aggregation is lower than what you expect. e.g.,
"aggs": {
"cardinality_partnerid": {
"cardinality": {
"field": "partnerId"
}
}
}

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Result filter and pagination in Elasticsearch

I need some help or an idea for the correct procedure.
I already indexed a big vaste of documents. Now I found out that there are some documents with almost the same content, f.e.
{
"title": "myDocument",
"date": "2017-09-18",
"page": 1
}
{
"title": "myDocument",
"date": "2017-09-18",
"page": 2
}
The title field is mapped as text, date is date and page is integer. As you can see the only difference is the page value.
Now I want to make a query and filter out these duplicates. Field collapsing seems a good way to do it but in this case I can't get the correct count of results and that's important for me.
An other way would be to get all results first and then filter out "manually" but then I have a problem with pagination.
Try something like this.
GET index/type/_search
{
"aggs": {
"count_by_title_date_page":{
"terms": {
"field": "title.keyword",
"size": 100
},
"aggs": {
"date": {
"terms": {
"field": "date.keyword",
"size": 100
},
"aggs": {
"page": {
"terms": {
"field": "page.keyword",
"size": 100
}
}
}
}
}
}
}
}

Resources