Use distinct field for count with significant_terms in Elastic Search - elasticsearch

Is there a way to get the signification_terms aggregation to use document counts based on a distinct field?
I have an index with posts and their hashtags but they are from multiple sources so there will be multiple ones with the same permalink field but I only want to count unique permalinks per each hashtag. I have managed to get the unique totals using the cardinality aggregation: (ie "cardinality": { field": "permalink.keyword"}) but can't work out how to do this with the Significant terms aggregation. My query is as follows:
GET /posts-index/_search
{
"aggregations": {
"significant_hashtag": {
"significant_terms": {
"background_filter": {
"bool": {
"filter": [
{
"range": {
"created": {
"gte": 1656414622,
"lte": 1656630000
}
}
}
]
}
},
"field": "hashtag.keyword",
"mutual_information": {
"background_is_superset": false,
"include_negatives": true
},
"size": 100
}
}
},
"query": {
"bool": {
"filter": [
{
"range": {
"created": {
"gte": 1656630000,
"lte": 1659308400
}
}
}
]
}
},
"size": 0
}

Related

Get very large total result count from pipeline aggregation

I have a query that I'm executing on an event table, which finds all productIds for product events where the active field changed from one date to another. This query returns an extremely large dataset, which I plan to paginate using partitions.
In order to know how large my partitions should be, I need a total count of docs returned by this query. However, If I run the query itself and return all of the docs, I unsurprisingly get a memory error (this occurs even if I use filter to return just the count).
Is there a way to process and return just the total result count?
{
"query": {
"bool": {
"should": [{
"range": {
"timeRange": { "gte": "2022-05-22T00:00:00.000Z", "lte": "2022-05-22T00:00:00.000Z" }
}, {
"range": {
"timeRange": { "gte": "2022-05-01T00:00:00.000Z", "lte": "2022-05-01T00:00:00.000Z" }
}
}
]
}
},
"version": true,
"aggs": {
"total_entities": {
"stats_bucket": {
"buckets_path": "group_by_entity_id>distinct_val_count"
}
},
"group_by_entity_id": {
"terms": {
"field": "productId",
"size": 500000
},
"aggs": {
"distinct_val_count": {
"cardinality": {
"field": "active"
}
},
"distinct_val_count_filter": {
"bucket_selector": {
"buckets_path": {
"distinct_val_count": "distinct_val_count"
},
"script": "params.distinct_val_count > 1"
}
}
}
}
}
}

How to search for an array of terms, in elasticsearch?

Contextualizing: I have this query that I search for a term, in two fields, and the result should bring me items that resemble the one inserted in the wildcard. But eventually I'll get a list of search terms...
I use this query to search when I get only 1 string:
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"wildcard": {
"shortName": "BAN*"
}
},
{
"wildcard": {
"name": "BAN*"
}
}
]
}
},
{
"range": {
"dhCot": {
"gte": "2022-04-11T00:00:00.000Z",
"lt": "2022-04-12T00:00:00.000Z"
}
}
}
]
}
},
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "dtBuy",
"interval": "1H",
"format": "yyyy-MM-dd:HH:mm:ssZ"
},
"aggs": {
"documents": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But in some moments, I will get an array of strings, like this ["BANANA","APPLE","ORANGE"]
So, how do I search for items that exactly match the items within the array? Is it possible?
The object inserted in elastic is this one:
{
"name": "BANANA",
"priceDay": 1,
"priceWeek": 3,
"variation": 2,
"dataBuy":"2022-04-11T11:01:00.585Z",
"shortName": "BAN"
}
If you want to search for items that exactly match the items within the array, you can use the terms query
{
"query": {
"terms": {
"name": ["BANANA","APPLE","ORANGE"]
}
}
}
You can include the terms query, in your existing query either in the should clause or must clause depending on your use case.

Bucket sort on dynamic aggregation name

I would like to sort my aggregations value from quantity.
But my problem is that each aggregation have a name that couldn't be know in advance :
Given this query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
}
]
}
},
"aggs": {
"sorting": {
"bucket_sort": {
"sort": [
{
"year>quantity": {
"order": "desc"
}
}
]
}
},
"UNKNOWN_1": {
"aggs": {
"year": {
"filter": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
}
]
}
},
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
}
}
}
}
},
"UNKNOWN_2": {
"aggs": {
"year": {
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
}
}
}
}
},
....
}
}
it miss one level on my bucket_sort aggregation to reach that quantity value.
Here is one elastic record :
{
datetime: '2021-12-01',
item.quantity: 5
}
Note that I have remove the biggest part of the request for comprehension, like filter aggregation, ect....
I tried something with wildcard :
"sorting": {
"bucket_sort": {
"sort": [
{
"*>year>quantity": {
"order": "desc"
}
}
]
}
},
But got the same error....
Is it possible to achieve this behaviour ?
I think you misunderstood the "bucket_sort" aggregation: it won't sort your aggregations but it sorts the buckets coming from one multi-bucket aggregation. Also the bucket_sort aggregation has to be subordinate to that multi-bucket aggregation.
From the docs:
[The bucket sort aggregation is] "a parent pipeline aggregation which sorts the buckets of its parent multi-bucket aggregation"
If I get it correct, you try to create "buckets" with specific filter aggregations and you can't know in advance how many of those filter aggregations you create.
For that you can use the "multi filters" aggregation where you can specify as many filters as you want and each of them creates a bucket.
Subordinated to that filters-aggregation you can create one single sum aggregation on item.quantity.
Also subordinated to the filters-aggregations you then add your buckets_sort aggregation, where you also just have to name the sibling "sum" aggregation.
All in all it might look like that:
{
"aggs": {
"your_filters": {
"filters": {
"filters": {
"unknown_1": {
"range": {
"datetime": {
"gte": "2021-01-01",
"lte": "2021-12-09"
}
}
},
"unknown_2": {
/** more filters here... **/
}
}
},
"aggs": {
"quantity": {
"sum": {
"field": "item.quantity"
}
},
"sorting": {
"bucket_sort": {
"sort": [
{ "quantity": { "order": "desc" } }
]
}
}
}
}
}
}

Elasticsearch Pagination with timestamp range

Elasticsearch official documentation introduce that elasticsearch can realize pagination by composite aggregations.
The composite aggregation will fetch data many times to get all results.
So my question is, Can I use range from now-1h to now when I execute composite aggregation?
If I can. How to composite aggregation query keep source data unchanging when every range query have different now.
If I can't. My query below has no error and the result seems to be right.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "now-1h"
}
}
}
]
}
},
"aggs": {
"user_device": {
"composite": {
"after": {
"user_name": "alen.lv"
},
"size": 100,
"sources": [
{
"user_name": {
"terms": {
"field": "user_name"
}
}
}
]
},
"aggs": {
"user_mac": {
"terms": {
"field": "user_mac",
"size": 1000
}
}
}
}
}
}

Filter/Query support in Elasticsearch Top hits Aggregation

Elasticsearch documentation states that The top_hits aggregation returns regular search hits, because of this many per hit features can be supported Crucially, the list includes Named filters and queries
But trying to add any filter or query throws SearchParseException: Unknown key for a START_OBJECT
Use case: I have items which have list of nested comments
items{id} -> comments {date, rating}
I want to get top rated comment for each item in the last week.
{
"query": {
"match_all": {}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"comment": {
"nested": {
"path": "comments"
},
"aggs": {
"top_comment": {
"top_hits": {
"size": 1,
//need filter here to select only comments of last week
"sort": {
"comments.rating": {
"order": "desc"
}
}
}
}
}
}
}
}
}
}
So is the documentation wrong, or is there any way to add a filter?
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-top-hits-aggregation.html
Are you sure you have mapped them as Nested? I've just tried to execute such query on my data and it did work fine.
If so, you could simply add a filter aggregation, right after nested aggregation (hopefully I haven't messed up curly brackets):
POST data/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
}
}
}
}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"nested": {
"nested": {
"path": "comments"
},
"aggs": {
"filterComments": {
"filter": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
},
"aggs": {
"topComments": {
"top_hits": {
"size": 1,
"sort": {
"comments.rating": "desc"
}
}
}
}
}
}
}
}
}
}
}
P.S. Always include FULL path for nested objects.
So this query will:
Filter documents that have comments younger than one week to narrow down documents for aggregation and to find those, who actually have such comments (filtered query)
Do terms aggregation based on id field
Open nested sub documents (comments)
Filter them by date
Return the most badass one (most rated)

Resources