ElasticSearch: Aggregations of URLs keeps splitting field - elasticsearch

I'm trying to write an elasticsearch query that groups all blogs with the same blog domain (wordpress.com, blog.com, etc). This is how my query looks like:
{
"engagements": [
"blogs"
],
"query": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"weight": {
"gte": 120,
"lte": 150
}
}
}
]
}
}
}
},
"facets": {
"my_facet": {
"terms": {
"field": "blog_domain" <-------------------------------------
}
}
}
},
"api": "_search"
}
However, it's returning this:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
...
]
},
"facets": {
"my_facet": {
"_type": "terms",
"missing": 0,
"total": 21,
"other": 3,
"terms": [
{
"term": "http",
"count": 3
},
{
"term": "noblepig.com",
"count": 2
},
{
"term": "hawaiian",
"count": 2
},
{
"term": "dream",
"count": 2
},
{
"term": "dessert",
"count": 2
},
{
"term": "2015",
"count": 2
},
{
"term": "05",
"count": 2
},
{
"term": "www.bt",
"count": 1
},
{
"term": "photos",
"count": 1
},
{
"term": "images.net",
"count": 1
}
]
}
}
}
This isn't what I want.
Right now my database has three records:
"http://www.bt-images.net/8-cute-photos-cats/",
"http://noblepig.com/2015/05/hawaiian-dream-dessert/",
"http://noblepig.com/2015/05/hawaiian-dream-dessert/"
I want it to return something like:
"facets": {
"my_facet": {
"_type": "terms",
"missing": 0,
"total": 21,
"other": 3,
"terms": [
{
"term": "http://noblepig.com/2015/05/hawaiian-dream-dessert/",
"count": 2
},
{
"term": "http://www.bt-images.net/8-cute-photos-cats/",
"count": 1
},
How would I do this? I looked it up and saw people recommending mappings but I don't know where to put that in this query and my table has 100 million records so it's too late to do that. If you have suggestions, could you please paste the whole query?
The same happens when I use aggs:
{
"engagements": [
"blogs"
],
"query": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"weight": {
"gte": 13,
"lte": 75
}
}
}
]
}
}
}
},
"aggs": {
"blah": {
"terms": {
"field": "blog_domain"
}
}
}
},
"api": "_search"
}

The right way to do this is to have a different mapping for that field. You can change the mapping on the way by adding a sub-field to blog_domain but you cannot change the documents that were already indexed. The mapping change will take effect for the new documents.
Just for the sake of mentioning this, your blog_domain should look like this:
"blog_domain": {
"type": "string",
"fields": {
"notAnalyzed": {
"type": "string",
"index": "not_analyzed"
}
}
}
meaning it should have a sub-field (in my sample is called notAnalyzed) and in your aggregation you should use blog_domain.notAnalyzed.
But, if you don't want to or can't make this change, there is a way but I believe it's slower: using scripted aggregation. Something like this:
{
"aggs": {
"blah": {
"terms": {
"script": "_source.blog_domain",
"size": 10
}
}
}
}
And you need to enable dynamic scripting, if you don't have it enabled.

If you use Elasticsearch 5.x, you could the mapping below
PUT your_index
{
"mappings": {
"your_type": {
"properties": {
"blog_domain": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
}

Related

Calculate the change percent of a sliding time window in Elasticsearch

Elasticsearch newbie here. I have a series of log messages like these
{
"#timestamp": "whatever",
"type": "toBeMonitored",
"success": true
}
I was tasked to react on a change of -30% of the total amount of successful messages compared to yesterday's same interval. So if I do the check at 8 AM today, I should compare today's total count from midnight to 8 AM to yesterday's same interval.
I tried creating a date histogram aggregation but I would like to have the diff percentage as a query result and not do the math on the development side.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"histo": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "1h"
}
}
}
}
Any idea on how this might be accomplished?
You can leverage the derivative pipeline aggregation to achieve exactly what you expect:
POST /sales/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"histo": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "1h"
},
"aggs": {
"successDiff": {
"derivative": {
"buckets_path": "_count"
}
}
}
}
}
}
In each bucket you're going to get the difference between the document count in the previous bucket vs the current bucket.
Ended up dropping the date_histogram aggregation and using the date_range one. It's much easier to work with, even though it does not return the difference compared to yesterday's same time period. I did that in code.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"ranged_documents": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"key": "yesterday",
"from": "now-1d/d",
"to": "now-24h/h"
},
{
"key": "today",
"from": "now/d",
"to": "now/h"
}
],
"keyed": true
}
}
}
}
This query would yield a result similar to the one below
{
"_shards": {
"total": 42,
"failed": 0,
"successful": 42,
"skipped": 0
},
"hits": {
"hits": [],
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null
},
"took": 134,
"timed_out": false,
"aggregations": {
"ranged_documents": {
"buckets": {
"yesterday": {
"from_as_string": "2020-10-12T00:00:00.000Z",
"doc_count": 268300,
"to_as_string": "2020-10-12T12:00:00.000Z",
"from": 1602460800000,
"to": 1602504000000
},
"today": {
"from_as_string": "2020-10-13T00:00:00.000Z",
"doc_count": 251768,
"to_as_string": "2020-10-13T12:00:00.000Z",
"from": 1602547200000,
"to": 1602590400000
}
}
}
}
}

Nested object aggregation term with mixed nested/non-nested filter

We have facets showing the number of results that will show when clicking the filters (and combining them). Something like this:
Before we introduced nested objects, the following would do the job:
GET /x_v1/_search/
{
"size": 0,
"aggs": {
"FilteredDescriptiveFeatures": {
"filter": {
"bool": {
"must": [
{
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
{
"terms": {
"products.sterile": [
"0"
]
}
}
]
}
},
"aggs": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
}
}
}
}
}
}
This gives the result:
"aggregations": {
"FilteredDescriptiveFeatures": {
"doc_count": 280,
"DescriptiveFeatures": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "somekey",
"doc_count": 42
},
We needed to make products a nested object though, and I'm currently trying rewrite the above to work with this change.
My attempt looks like the following. It doesn't give the correct result though, and doesn't seem properly connected to the filter.
GET /x_v2/_search/
{
"size": 0,
"aggs": {
"FilteredDescriptiveFeatures": {
"filter": {
"bool": {
"must": [
{
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
{
"nested": {
"path": "products",
"query": {
"terms": {
"products.sterile": [
"0"
]
}
}
}
}
]
}
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggregations": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
}
}
}
}
}
}
}
}
This gives the result:
"aggregations": {
"FilteredDescriptiveFeatures": {
"doc_count": 280,
"nested": {
"doc_count": 1437,
"DescriptiveFeatures": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "somekey",
"doc_count": 164
},
I've also tried to put the nested definition higher up to contain both the filter and aggs, but then the filter term breadcrumbs.categoryId, which is not in the nested object, won't work.
Is what I'm trying to do even possible?
And how can it be solved?
In your FilteredDescriptiveFeatures step, you return all documents that have one product with sterile = 0
But after in the nested step you dont specify again this filter. So all nested products are return in this step, thus you make your terms aggregations on all products, not only products with sterile = 0
You should move your sterile filter in the nested step. And like Richa points out, you need to use a reverse_nested aggregation in the final step to count elasticsearch document and not nested products sub-documents.
Could you try this query ?
{
"size": 0,
"aggs": {
"filteredCategory": {
"filter": {
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
"aggs": {
"nestedProducts": {
"nested": {
"path": "products"
},
"aggs": {
"filteredByProductsAttributes": {
"filter": {
"terms": {
"products.sterile": [
"0"
]
}
},
"aggs": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
},
"aggs": {
"productCount": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
}
}
What I under stand from the description is that you want to filter your results on the basis of some Nested and Non Nested Fields and then apply aggregations on the Nested Field. I created a sample Index and data with some Nested and Non Nested Fields and created a query
Mapping
PUT stack-557722203
{
"mappings": {
"_doc": {
"properties": {
"category": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"user": {
"type": "nested", // NESTED FIELD
"properties": {
"fName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
Sample Data
POST _bulk
{"index":{"_index":"stack-557722203","_id":"1","_type":"_doc"}}
{"category":"X","user":[{"fName":"A","lName":"B","type":"X"},{"fName":"A","lName":"C","type":"X"},{"fName":"P","lName":"B","type":"Y"}]}
{"index":{"_index":"stack-557722203","_id":"2","_type":"_doc"}}
{"category":"X","user":[{"fName":"P","lName":"C","type":"Z"}]}
{"index":{"_index":"stack-557722203","_id":"3","_type":"_doc"}}
{"category":"X","user":[{"fName":"A","lName":"C","type":"Y"}]}
{"index":{"_index":"stack-557722203","_id":"4","_type":"_doc"}}
{"category":"Y","user":[{"fName":"A","lName":"C","type":"Y"}]}
Query
GET stack-557722203/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"nested": {
"path": "user",
"query": {
"term": {
"user.fName.keyword": {
"value": "A"
}
}
}
}
},
{
"term": {
"category.keyword": {
"value": "X"
}
}
}
]
}
},
"aggs": {
"group BylName": {
"nested": {
"path": "user"
},
"aggs": {
"group By lName": {
"terms": {
"field": "user.lName.keyword",
"size": 10
},
"aggs": {
"reverse Nested": {
"reverse_nested": {} // NOTE THIS
}
}
}
}
}
}
}
Output
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"group BylName": {
"doc_count": 4,
"group By lName": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "B",
"doc_count": 2,
"reverse Nested": {
"doc_count": 1
}
},
{
"key": "C",
"doc_count": 2,
"reverse Nested": {
"doc_count": 2
}
}
]
}
}
}
}
As per the discrepancy in data where you are getting, more documents in doc_count when you changed the mapping to Nested is because of the way Nested and Object(NonNested) documents are stored. See here to understand how are they internally stored. In order to connect them back to the root Document , you can use Reverse Nested aggregation and then you will have the same result.
Hope this helps!!

How to get the parent document in a nested top_hits aggregation?

This is my document/mapping with a nested prices array:
{
"name": "Foobar",
"type": 1,
"prices": [
{
"date": "2016-03-22",
"price": 100.41
},
{
"date": "2016-03-23",
"price": 200.41
}
]
}
Mapping:
{
"properties": {
"name": {
"index": "not_analyzed",
"type": "string"
},
"type": {
"type": "byte"
},
"prices": {
"type": "nested",
"properties": {
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"price": {
"type": "double"
}
}
}
}
}
I use a top_hits aggregation to get the min price of the nested price array. I also have to filter the prices by date. Here is the query and the response:
POST /index/type/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"prices": {
"nested": {
"path": "prices"
},
"aggs": {
"date_filter": {
"filter": {
"range": {
"prices.date": {
"gte": "2016-03-21"
}
}
},
"aggs": {
"min": {
"top_hits": {
"sort": {
"prices.price": {
"order": "asc"
}
},
"size": 1
}
}
}
}
}
}
}
}
Response:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"prices": {
"doc_count": 4,
"date_filter": {
"doc_count": 4,
"min": {
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "index",
"_type": "type",
"_id": "4225796ALL2016061541031",
"_nested": {
"field": "prices",
"offset": 0
},
"_score": null,
"_source": {
"date": "2016-03-22",
"price": 100.41
},
"sort": [
100.41
]
}
]
}
}
}
}
}
}
Is there a way to get the parent source document (or some fields from it) with _id="4225796ALL2016061541031" in the response (e.g. name)? A second query is not an option.
Instead of applying aggregations use query and inner_hits like :
{
"query": {
"nested": {
"path": "prices",
"query": {
"range": {
"prices.date": {
"gte": "2016-03-21"
}
}
},
"inner_hits": {
"sort": {
"prices.price": {
"order": "asc"
}
},
"size": 1
}
}
}
}
Fetch data of parent_documentdata from _source and actual data from inner_hits.
Hope it helps

How to get analyzed word count by Elasticsearch?

I would like to count each token analyzed.
First, I tried following codes:
mapping:
{
"docs": {
"mappings": {
"doc": {
"dynamic": "false",
"properties": {
"text": {
"type": "string",
"analyzer": "kuromoji"
}
}
}
}
}
}
query:
{
"query": {
"match_all": {}
},
"aggs": {
"word-count": {
"terms": {
"field": "text",
"size": "1000"
}
}
},
"size": 0
}
I queried my index after inserting my data, I got a following result:
{
"took": 41
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10000,
"max_score": 0,
"hits": []
},
"aggregations": {
"word-count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 36634,
"buckets": [
{
"key": "はい",
"doc_count": 4734
},
{
"key": "いただく",
"doc_count": 2440
},
...
]
}
}
}
Unfortunately, term aggregation provides only a doc_count. It's not a word count. So, I think the way to get approximate word count using _index['text']['TERM'].df() and _index['text']['TERM'].ttf().
Maybe the approximate word count is the following equation:
WordCount = doc_count['TERM'] / _index['text']['TERM'].df() * _index['text']['TERM'].ttf()
'TERM' is key in buckets. I tried to write a scripted metric aggregation, but i didn't know how to get keys in buckets.
{
"query": {
"match_all": {}
},
"aggs": {
"doc-count": {
"terms": {
"field": "text",
"size": "1000"
}
},
"aggs": {
"word-count": {
"scripted_metric": {
// ???
}
}
}
},
"size": 0
}
How can I get keys in buckets?
If it is impossible, how can I get a analyzed word count?
You can try with the token count data type. Simply add a sub-field of that type to your text field:
{
"docs": {
"mappings": {
"doc": {
"dynamic": "false",
"properties": {
"text": {
"type": "string",
"analyzer": "kuromoji"
},
"fields": {
"nb_tokens": {
"type": "token_count",
"analyzer": "kuromoji"
}
}
}
}
}
}
}
Then you can use text.nb_tokens in your aggregation.
Can you try dynamic_scripting,though this will affect performance..
{
"query": {
"match_all": {}
},
"aggs": {
"word-count": {
"terms": {
"script": "_source.text",
"size": "1000"
}
}
},
"size": 0
}

Nested query in nested, filter aggregation fails

I am trying to use a nested query filter inside of a nested, filter aggregation. When I do so, the aggregation returns with no items. If I change the query to just a plain old match_all filter, I do get items back in the bucket.
Here is a simplified version of the mapping I'm working with:
"player": {
"properties": {
"rating": {
"type": "float"
},
"playerYears": {
"type": "nested",
"properties": {
"schoolsOfInterest": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
This query, with a match_all filter on the aggregation:
GET /players/_search
{
"size": 0,
"aggs": {
"rating": {
"nested": {
"path": "playerYears"
},
"aggs": {
"rating-filtered": {
"filter": {
"match_all": {}
},
"aggs": {
"rating": {
"histogram": {
"field": "playerYears.rating",
"interval": 1
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"match_all": {}
}
}
}
}
returns the following:
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 167316,
"max_score": 0,
"hits": []
},
"aggregations": {
"rating": {
"doc_count": 363550,
"rating-filtered": {
"doc_count": 363550,
"rating": {
"buckets": [
{
"key_as_string": "-1",
"key": -1,
"doc_count": 20978
},
{
"key_as_string": "0",
"key": 0,
"doc_count": 312374
},
{
"key_as_string": "1",
"key": 1,
"doc_count": 1162
},
{
"key_as_string": "2",
"key": 2,
"doc_count": 12104
},
{
"key_as_string": "3",
"key": 3,
"doc_count": 9558
},
{
"key_as_string": "4",
"key": 4,
"doc_count": 5549
},
{
"key_as_string": "5",
"key": 5,
"doc_count": 1825
}
]
}
}
}
}
}
But this query, which has a nested filter in the aggregation, returns an empty bucket:
GET /players/_search
{
"size": 0,
"aggs": {
"rating": {
"nested": {
"path": "playerYears"
},
"aggs": {
"rating-filtered": {
"filter": {
"nested": {
"query": {
"match_all": {}
},
"path": "playerYears.schoolsOfInterest"
}
},
"aggs": {
"rating": {
"histogram": {
"field": "playerYears.rating",
"interval": 1
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"match_all": {}
}
}
}
}
the empty bucket:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 167316,
"max_score": 0,
"hits": []
},
"aggregations": {
"rating": {
"doc_count": 363550,
"rating-filtered": {
"doc_count": 0,
"rating": {
"buckets": []
}
}
}
}
}
Is it possible to use nested filters inside of nested, filtered aggregations? Is there a known bug in elasticsearch about this? The nested filter works fine in the query context of the search, and it works fine if I don't use a nested aggregation.
Based on the information provided, and a few assumptions, I would like to provide two suggestions. I hope it helps solve your problem.
Case 1: using reverse nested aggregation:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"rating": {
"nested": {
"path": "playerYears.schoolsOfInterest"
},
"aggs": {
"rating-filtered": {
"filter": {
"match_all": {}
},
"aggs": {
"rating_nested": {
"reverse_nested": {},
"aggs": {
"rating": {
"histogram": {
"field": "rating",
"interval": 1
}
}
}
}
}
}
}
}
}
}
Case 2: changes to filtered aggregation:
{
"size": 0,
"aggs": {
"rating-filtered": {
"filter": {
"nested": {
"query": {
"match_all": {}
},
"path": "playerYears.schoolsOfInterest"
}
},
"aggs": {
"rating": {
"histogram": {
"field": "playerYears.rating",
"interval": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"match_all": {}
}
}
}
}
I would suggest you to use case 1 and verify your required results.

Resources