Elasticsearch Terms Aggregation on array -- filter to buckets that match your query? - elasticsearch

I'm using a elasticsearch terms aggregation to bucket based on an array property on each document. I'm running into an issue where I get back buckets that are not in my query, and I'd like to those filter out.
Let's say each document is a Post, and has an array property media which specifies which social media website the post is on (and may be empty):
{
id: 1
media: ["facebook", "twitter", "instagram"]
}
{
id: 2
media: ["twitter", "instagram", "tiktok"]
}
{
id: 3
media: ["instagram"]
}
{
id: 4
media: []
}
And, let's say there's another index of Users, which stores a favorite_media property of the same type.
{
id: 42
favorite_media: ["twitter", "instagram"]
}
I have a query uses a terms lookup to filter, then does a terms aggregation.
{
"query": {
"filter": {
"terms": {
"index": "user_index",
"id": 42,
"path": "favorite_media"
}
}
},
"aggs": {
"Posts_by_media": {
"terms": {
"field": "media",
"size": 1000
}
}
}
}
This will result in:
{
...
"aggregations": {
"Posts_by_media": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 3
},
{
"key": "twitter",
"doc_count": 2
},
{
"key": "facebook",
"doc_count": 1
},
{
"key": "tiktok",
"doc_count": 1
}
]
}
}
}
Because media is an array property, any document that matches the filter will be used to create buckets, and I'll have buckets that don't match my filter. Here I want to only get back buckets facebook and instagram, since those are the two that I'm filtering to (via the terms-lookup).
I know terms aggregations offer a includes ability, but that doesn't work for me here since I'm using a terms-lookup, and don't know the data in favorite_media at query time.
How can I limit my buckets to be only those that match the filters in my query?
Thank you for your help!

Related

Elastic Search - Aggregating on Sub Aggregations

I am looking for a way to group aggregation results so I can filter them down. Currently my response is pretty large (> 1mb) and I'm hoping to return only the top matching filters.
I'm not sure if Elastic is capable of grouping aggregations by the sub aggregation without using nesting, but I figured I would give it a try.
The filter data is stored in an array on each of my objects:
// document a
"attributeValues" : [
"A12345|V12345",
"A22345|V22345",
...
]
// document b
"attributeValues" : [
"A12345|V15555",
"A22345|V22345",
...
]
I am currently aggregating on the values and getting results like this:
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
{
"key": "A22345|V22345",
"doc_count": 5
},
I would like to be able to group these aggregations by the first part of the string so that I can return only the top 10 matches and get something like this:
"topAttributes" : {
"buckets" : [
{
"key" : "A12345",
"doc_count" : 17,
"attributes" : {
"buckets" : [
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
I have tried to filter using the field script but I cannot seem to find anywhere online (checked many questions) to get the sub-aggregation's results.
The script would look something like this:
GET test_index/_search
{
"size" : 0,
"aggs": {
"attributeValuesTop": {
"terms": {
"size": 10,
"script": {
"source": """
return attributes.splitOnToken('|')[1];
"""
}
},
"aggs": {
"attributes": {
"terms": {
"field": "attributeValues",
"size": 10000
}
}
}
}
}
}
NOTE: I know we can use a nested solution, but nested is too slow for the amount of documents we have (millions of records) and the target of sub 300ms searches.

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

Elasticsearch counts of multiple indices

I am creating a report to compare actual count from database with indexed records.
I have three indices index1, index2 and index3
To get the count for a single index i am using the following URL
http://localhost:9200/index1/_count?q=_type:invoice
=> {"count":50,"_shards":{"total":5,"successful":5,"failed":0}}
For multiple indices:
http://localhost:9200/index1,index2/_count?q=_type:invoice
=> {"count":80,"_shards":{"total":5,"successful":5,"failed":0}}
Now the count is added up, i want it to grouped by index also how can i pass filters group by a specific field
to get the output like this:
{"index1_count":50,"index2_count":50,"approved":10,"rejected":40 ,"_shards":{"total":5,"successful":5,"failed":0}}
You can use _search?search_type=count and do an aggregation based on _index field to make the distinction between the indices:
GET /index1,index2/_search?search_type=count
{
"aggs": {
"by_index": {
"terms": {
"field": "_index"
}
}
}
}
and the result would be something like this:
"aggregations": {
"by_index": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "index1",
"doc_count": 50
},
{
"key": "index2",
"doc_count": 80
}
]
}
}

Retrieve document frequency for terms in query result with aggregations

For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?
Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}

elasticsearch group-by multiple fields

I am Looking for the best way to group data in elasticsearch.
Elasticsearch doesn't support something like 'group by' in sql.
Lets say I have 1k categories and millions of products. What do you think is the best way to render a complete category tree? Off course you need some metadata (icon, link-target, seo-titles,...) and custom sorting for the categories.
Using Aggregations:
Example: https://found.no/play/gist/8124563
Looks usable if you have to group by one field, and need some extra fields.
Using multiple Fields in a Facet (won't work):
Example: https://found.no/play/gist/1aa44e2114975384a7c2
Here we lose the relationship between the different fields.
Building funny Facets:
https://found.no/play/gist/8124810
For example, building a category tree using these 3 "solutions" sucks.
Solution 1 May work (ES 1 isn't stable right now)
Solution 2 Doesn't work
Solution 3 Is a pain because it feels ugly, you need to prepare a lot of data and the facets blow up.
Maybe an alternative could be not to store any category data in ES, just the id
https://found.no/play/gist/a53e46c91e2bf077f2e1
Then you could get the associated category from another system, like redis, memcache or the database.
This would end up in clean code, but the performance could become a problem.
For example loading, 1k Categories from Memcache / Redis / a database could be slow.
Another problem is that syncing 2 database is harder than syncing one.
How do you deal with such problems?
I am sorry for the links, but I can't post more than 2 in one article.
The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
Of course this can go on for as many fields as you'd like.
Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result
You can use Composite Aggregation query as follows. This type of query also paginates the results if the number of buckets exceeds from the normal value of ES. By using the field 'after' you can access the rest of buckets:
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"field1": {
"terms": {
"field": "field1"
}
}
},
{
"field2": {
"terms": {
"field": "field2"
}
}
},
{
"field3": {
"terms": {
"field": "field3"
}
}
},
]
}
}
}
You can find more detail in ES page bucket-composite-aggregation.
I think some developers will be definitely looking same implementation in Spring DATA ES and JAVA ES API.
Please finds :-
List<FieldObject> fieldObjectList = Lists.newArrayList();
SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
.addAggregation(
terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
.subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
)
.build();
Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
#Override
public Aggregations extract(SearchResponse aResponse) {
return aResponse.getAggregations();
}
});
Terms aField1Terms = aField1Aggregations.get("ByField1");
aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
String field1Value = aField1Bucket.getKey();
Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");
aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
String field2Value = aField2Bucket.getKey();
Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");
aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
String field3Value = aField3Bucket.getKey();
Long count = aField3Bucket.getDocCount();
FieldObject fieldObject = new FieldObject();
fieldObject.setField1(field1Value);
fieldObject.setField2(field2Value);
fieldObject.setField3(field3Value);
fieldObject.setCount(count);
fieldObjectList.add(fieldObject);
});
});
});
imports need to be done for same :-
import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery;
import static org.elasticsearch.search.aggregations.AggregationBuilders.terms;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.collect.Lists;
import org.elasticsearch.index.query.FilterBuilder;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.TermFilterBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.data.elasticsearch.core.ResultsExtractor;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.data.elasticsearch.core.query.SearchQuery;
sub-aggregations is what you need .. though this is never explicitly stated in the docs it can be found implicitly by structuring aggregations
It will result the sub-aggregation as if the query was filtered by result of the higher aggregation.
It actually looks like as if this is what happens in there.
{
"aggregations": {
"VALUE1AGG": {
"terms": {
"field": "VALUE1",
},
"aggregations": {
"VALUE2AGG": {
"terms": {
"field": "VALUE2",
}
}
}
}
}
}

Resources