"Filter then Aggregation" or just "Filter Aggregation"? - elasticsearch

I am working on ES recently and I found that I could achieve the almost same result but I have no clear idea as to the DIFFERENCE between these two.
"Filter then Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"term": {
"DestCountry": "CA"
}
}
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
"Filter Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"ca": {
"filter": {
"term": {
"DestCountry": "CA"
}
},
"aggs": {
"_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
}
My Questions
Why there are two similar functions? I believe I am wrong about it but what's the difference then?
(please do ignore the result format, it's not the question I am asking ;p)
Which is better if I want to filter out the unrelated/unmatched and start the aggregation on lots of documents?

When you use it in "query", you're creating a context on ALL the docs in your index. In this case, it acts like a normal filter like: SELECT * FROM index WHERE (my_filter_condition1 AND my_filter_condition2 OR my_filter_condition3...).
When you use it in "aggs", you're creating a context on ALL the docs that might have (or haven't) been previously filtered. Let's say that if you have an structure like:
#OPTION A
{
"aggs":{
t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } }
}
}
}
Without a "query", is exactly the same as having
#OPTION B
{
"query":{
"filter" : { "term": { "type": "t-shirt" } }
}
}
BUT the results will be returned in different fields.
In the Option A, the results will be returned in the aggregations field.
In the Option B, the results will be returned in the hits field.
I would recommend to apply your filters always on the query part, so you can work with subsecuent aggregations of the already filtered docs. Also because Aggrgegations cost more performance than queries.
Hope this is helpful! :D

Both filters, used in isolation, are equivalent. If you load no results (hits), then there is no difference. But you can combine listing and aggregations. You can query or filter your docs for listing, and calculate aggregations on bucket further limited by the aggs filter. Like this:
POST kibana_sample_data_flights/_search
{
"size": 100,
"query": {
"bool": {
"filter": {
"term": {
... some other filter
}
}
}
},
"aggs": {
"ca_filter": {
"term": {
"TestCountry": "CA"
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
But more likely you will need the other way, ie. make aggregations on all docs, to display summary informations, while you display docs from specific query. In this case you need to combine aggragations with post_filter.

Answer from #Val's comment, I may just quote here for reference:
In option A, the aggregation will be run on ALL documents. In option B, the documents are first filtered and the aggregation will be run only on the selected documents. Say you have 10M documents and the filter select only a 100, it's pretty evident that option B will always be faster.

Related

Filter documents prior to bucketing using GeoTile Aggregation in Elasticsearch

I am looking for an example where documents are filtered prior to bucketing via the GeoTile aggregation. For example, I would like to have buckets that hold the number of documents where some value is greater than x. Any pointers would be appreciated. Right now I have:
{
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"aggs": {
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
I don't know where to go from here. Any pointers would be appreciated.
Simply add a top-level filter aggregation.
In pseudo code:
POST /your-index/_search
{
aggs:
filter_agg_name:
filter:
...actual filters
aggs:
...the rest of your aggs
}
Applied to your particular use case:
POST _search
{
"aggs": {
"my_applicable_filters": {
"filter": {
"bool": {
"must": [
{
"range": {
"some_numeric_or_date_field": {
"gte": 42
}
}
}
]
}
},
"aggs": {
"avg_my_field": {
"avg": {
"field": "properties.my_field"
}
},
"large-grid": {
"geotile_grid": {
"field": "coordinates",
"precision": 8
}
}
}
}
}
}
Note that your original aggregation query wasn't syntactically correct. You were close but keep in mind that:
1. Some aggregations can have direct children (sub-aggregations) of the form:
POST /your-index/_search
{
aggs:
top_level_agg_name:
agg_type:
...agg_def
aggs:
1st_child_name:
...1st_child_defs
2nd_child_name:
...2nd_child_defs
...
}
I said some because the avg aggregation does not support sub-aggregations (since it's not a bucket aggregation). That's the reason I've applied the following instead:
2. Aggregations can run irrespective of each other while specified in a single request:
POST /your-index/_search
{
aggs:
some_agg_name:
agg_type:
...agg_def
other_agg_name:
agg_type:
...agg_def
...
}
That way, you can get the average of properties.my_field AND geo-cluster your coordinates at the same time.
Conversely, when you realize that geotile_grid is indeed a bucket aggregation capable of accepting sub-aggregations, you can first group your docs by the corresponding geo hash and then calculate the average. Now that I think about it, that may've been your original intent 😉.
Speaking of moments of clarity, you can learn a lot about how aggregations relate to each other in my recently released Elasticsearch Handbook.

how to achieve an exists filter on ES5.0?

The exists filter has been replaced by an exists query in ES5.0.
So how can we achieve, within the same query the equivalent? In other words, we don't want to do two query but just on for various aggregations, including the exists count?
So I want to count the number of time the field "the_field" exists (or is not null)
"aggregation":{
"exists_count":{
"filter":{
"exists":{
"field":"the_field"
}
}
}
}
I think you can use stats aggregation,
{ "aggs" :
{ "time_stats" :
{ "extended_stats" :
{ "field" : "time" }
}
}
}
Look at elastic stats doc
With Elastic 5.0, filters didn't so much get replaced by queries, but combined. Syntactically they look the same, but the context in which you use it determines if it gets interpreted as a query (factors into scoring) or as a filter to simply weed out documents. The below code should achieve exactly what you want:
{
"query": {
"match_all": {}
},
"aggs": {
"field_exists": {
"filter": {
"exists": {
"field": "name"
}
}
}
}
}
The aggregation returned will look something like this, with the doc_count representing the number of documents where the "name field exists. Hope this helps!
{
"aggregations": {
"field_exists": {
"doc_count": 11984
}
}
}

Elasticsearch terms query on array of values

I have data on ElasticSearch index that looks like this
{
"title": "cubilia",
"people": [
"Ling Deponte",
"Dana Madin",
"Shameka Woodard",
"Bennie Craddock",
"Sandie Bakker"
]
}
Is there a way for me to do a search for all the people whos name starts with
"ling" (should be case insensitive) and get distinct terms properly cased "Ling Deponte" not "ling deponte"?
I am find with changing mappings on the index in any way.
Edit does what I want but is really bad query:
{
"size": 0,
"aggs": {
"person": {
"filter": {
"bool":{
"should":[
{"regexp":{
"people.raw":"(.* )?[lL][iI][nN][gG].*"
}}
]}
},
"aggs": {
"top-colors": {
"terms": {
"size":10,
"field": "people.raw",
"include":
{
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
}
}
people.raw is not_analyzed
Yes, and you can do it without a regular expression by taking advantage of Elasticsearch's full text capabilities.
GET /test/_search
{
"query": {
"match_phrase": {
"people": "Ling"
}
}
}
Note: This could also be match or match_phrase_prefix in this case. The match_phrase* queries imply an order of the values in the text. match simply looks for any of the values. Since you only have one value, it's pretty much irrelevant.
The problem is that you cannot limit the document responses to just that name because the search API returns documents. With that said, you can use nested documents and get the desired behavior via inner_hits.
You do not want to do wildcard prefixing whenever possible because it simply does not work at scale. To put it in SQL terms, that's like doing a full table scan; you effectively lose the benefit of the inverted index because it has to walk it entirely to find the actual start.
Combining the two should work pretty well though. Here, I use the query to widdle down results to what you are interested in, then I use your inner aggregation to only include based on the value.
{
"size": 0,
"query": {
"match_phrase": {
"people": "Ling"
}
}
"aggs": {
"person": {
"terms": {
"size":10,
"field": "people.raw",
"include": {
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
Hi Please find the query it may help for your request
GET skills/skill/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"skillNames.raw": "jav*"
}
}
]
}
}
}
}
}
My intention is to find documents starting with the "jav"

Return parent data with child document from Elasticsearch

Is is possible to return parent data with a search for child documents within an Elasticsearch query?
I have two document types, e.g. Book and Chapter, that are related as Parent/Child (not nested).
I want to run a search on the child document and return the child document, with some of the fields from the parent document. I'm trying to avoid executing a separate query on the parent.
Update
The only way possible I can find is to use the has_child query and then a series of aggregations to drill back to the children and apply the query/filter again. However, this seems overly complicated and inefficient.
GET index/_search
{
"size": 10,
"query": {
"has_child": {
"type": "chapter",
"query": {
"term": {
"field": "value"
}
}
}
},
"aggs": {
"name1": {
"terms": {
"size": 50,
"field": "id"
},
"aggs": {
"name2": {
"top_hits": {
"size": 50
}
},
"name3": {
"children": {
"type": "type2"
},
"aggs": {
"docFilter": {
"filter": {
"query": {
"match": {
"_all": "value"
}
}
},
"aggs": {
"docs": {
"top_hits": {
"size": 50
}
}
}
}
}
}
}
}
}
}
It is possible do a has_child query to return the parent docs with a top hits aggregation to return the child docs, but it is a bit cumbersome.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
The Inner Hits feature that is due to be released in 1.5.0 will do what you want.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-request-inner-hits.html
You could build the source from master and try it out.
This can be now be done with ElasticSearch. Just use 'has_parent' in the search query:
'has_parent': {
'parent_type': 'book',
'query': {
'match_all': {}
},
'inner_hits': {}
}
The results will appear in the inner_hits of the response.
As Dan Tuffery say in his comment, currently, this can be achieve with Inner Hits, in Java you can understand it more easy with the next snippet of code.
SearchResponse searchResponse = this.transportClient.prepareSearch("your_index")
.setTypes("your_type")
.setQuery(QueryBuilders.filteredQuery(
null,
FilterBuilders.hasParentFilter(
"parent_type_name",
FilterBuilders.termFilter("foo", "foo"))
.innerHit(new QueryInnerHitBuilder()))
)
.execute().actionGet();
List<YourObject> list = new ArrayList<>();
for (SearchHit searchHit : searchHits.getHits()) {
YourObject yourObject = this.objectMapper.readValue(searchHit.getSourceAsString(), YourObject.class);
yourObject.setYourParentObject(this.objectMapper.readValue(searchHit.getInnerHits().get("parent_type_name").getAt(0).getSourceAsString(), YourParentObject.class));
list.add(yourObject);
}

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources