How to get analyzed word count by Elasticsearch? - elasticsearch

I would like to count each token analyzed.
First, I tried following codes:
mapping:
{
"docs": {
"mappings": {
"doc": {
"dynamic": "false",
"properties": {
"text": {
"type": "string",
"analyzer": "kuromoji"
}
}
}
}
}
}
query:
{
"query": {
"match_all": {}
},
"aggs": {
"word-count": {
"terms": {
"field": "text",
"size": "1000"
}
}
},
"size": 0
}
I queried my index after inserting my data, I got a following result:
{
"took": 41
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10000,
"max_score": 0,
"hits": []
},
"aggregations": {
"word-count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 36634,
"buckets": [
{
"key": "はい",
"doc_count": 4734
},
{
"key": "いただく",
"doc_count": 2440
},
...
]
}
}
}
Unfortunately, term aggregation provides only a doc_count. It's not a word count. So, I think the way to get approximate word count using _index['text']['TERM'].df() and _index['text']['TERM'].ttf().
Maybe the approximate word count is the following equation:
WordCount = doc_count['TERM'] / _index['text']['TERM'].df() * _index['text']['TERM'].ttf()
'TERM' is key in buckets. I tried to write a scripted metric aggregation, but i didn't know how to get keys in buckets.
{
"query": {
"match_all": {}
},
"aggs": {
"doc-count": {
"terms": {
"field": "text",
"size": "1000"
}
},
"aggs": {
"word-count": {
"scripted_metric": {
// ???
}
}
}
},
"size": 0
}
How can I get keys in buckets?
If it is impossible, how can I get a analyzed word count?

You can try with the token count data type. Simply add a sub-field of that type to your text field:
{
"docs": {
"mappings": {
"doc": {
"dynamic": "false",
"properties": {
"text": {
"type": "string",
"analyzer": "kuromoji"
},
"fields": {
"nb_tokens": {
"type": "token_count",
"analyzer": "kuromoji"
}
}
}
}
}
}
}
Then you can use text.nb_tokens in your aggregation.

Can you try dynamic_scripting,though this will affect performance..
{
"query": {
"match_all": {}
},
"aggs": {
"word-count": {
"terms": {
"script": "_source.text",
"size": "1000"
}
}
},
"size": 0
}

Related

How to count number of values per group?

I have an index with the following mapping:
"my_index":{
"mapping": {
"properties": {
"rec_values": {
"type": "nested",
"properties": {
"name": {
"type:" "keyword"
},
"schm_p": {
"type:" "keyword"
},
"tbl_p": {
"type:" "keyword"
},
I want to count number values for each schm_p
something like:
select count(*)
from my_index
group by rec_values.schm_p
How can I do it ?
You need to do a Composite Aggregation, like this:
{
"size": 0,
"aggs": {
"parameters": {
"nested": {
"path": "rec_values"
},
"aggs": {
"group": {
"composite": {
"size": 100, // your size
"sources": [{
"count_schm_p": {
"terms": {
"field": "rec_values.schm_p"
}
}
}]
}
}
}
}
}
}
you need to use the aggregation for this query something like this:
GET my_index/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"count_schm_p": {
"terms": {
"field": "rec_values.schm_p.keyword",
"size": 100
}
}
}
}
this query would return a response like this
{
"took": 561,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"count_schm_p": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 829099,
"buckets": [
{
"key": "type_a",
"doc_count": 1704640
},
{
"key": "type_b",
"doc_count": 1454079
},
{
"key": "type_c",
"doc_count": 894678
},
{
"key": "type_d",
"doc_count": 208489
}
]
}
}
}
the count of each schm_p is inside your aggregation key
note: the size inside your query need to match with how many schm_p types do you have.

Nested object aggregation term with mixed nested/non-nested filter

We have facets showing the number of results that will show when clicking the filters (and combining them). Something like this:
Before we introduced nested objects, the following would do the job:
GET /x_v1/_search/
{
"size": 0,
"aggs": {
"FilteredDescriptiveFeatures": {
"filter": {
"bool": {
"must": [
{
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
{
"terms": {
"products.sterile": [
"0"
]
}
}
]
}
},
"aggs": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
}
}
}
}
}
}
This gives the result:
"aggregations": {
"FilteredDescriptiveFeatures": {
"doc_count": 280,
"DescriptiveFeatures": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "somekey",
"doc_count": 42
},
We needed to make products a nested object though, and I'm currently trying rewrite the above to work with this change.
My attempt looks like the following. It doesn't give the correct result though, and doesn't seem properly connected to the filter.
GET /x_v2/_search/
{
"size": 0,
"aggs": {
"FilteredDescriptiveFeatures": {
"filter": {
"bool": {
"must": [
{
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
{
"nested": {
"path": "products",
"query": {
"terms": {
"products.sterile": [
"0"
]
}
}
}
}
]
}
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggregations": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
}
}
}
}
}
}
}
}
This gives the result:
"aggregations": {
"FilteredDescriptiveFeatures": {
"doc_count": 280,
"nested": {
"doc_count": 1437,
"DescriptiveFeatures": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "somekey",
"doc_count": 164
},
I've also tried to put the nested definition higher up to contain both the filter and aggs, but then the filter term breadcrumbs.categoryId, which is not in the nested object, won't work.
Is what I'm trying to do even possible?
And how can it be solved?
In your FilteredDescriptiveFeatures step, you return all documents that have one product with sterile = 0
But after in the nested step you dont specify again this filter. So all nested products are return in this step, thus you make your terms aggregations on all products, not only products with sterile = 0
You should move your sterile filter in the nested step. And like Richa points out, you need to use a reverse_nested aggregation in the final step to count elasticsearch document and not nested products sub-documents.
Could you try this query ?
{
"size": 0,
"aggs": {
"filteredCategory": {
"filter": {
"terms": {
"breadcrumbs.categoryIds": [
"category"
]
}
},
"aggs": {
"nestedProducts": {
"nested": {
"path": "products"
},
"aggs": {
"filteredByProductsAttributes": {
"filter": {
"terms": {
"products.sterile": [
"0"
]
}
},
"aggs": {
"DescriptiveFeatures": {
"terms": {
"field": "products.descriptiveFeatures",
"size": 1000
},
"aggs": {
"productCount": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
}
}
What I under stand from the description is that you want to filter your results on the basis of some Nested and Non Nested Fields and then apply aggregations on the Nested Field. I created a sample Index and data with some Nested and Non Nested Fields and created a query
Mapping
PUT stack-557722203
{
"mappings": {
"_doc": {
"properties": {
"category": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"user": {
"type": "nested", // NESTED FIELD
"properties": {
"fName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
Sample Data
POST _bulk
{"index":{"_index":"stack-557722203","_id":"1","_type":"_doc"}}
{"category":"X","user":[{"fName":"A","lName":"B","type":"X"},{"fName":"A","lName":"C","type":"X"},{"fName":"P","lName":"B","type":"Y"}]}
{"index":{"_index":"stack-557722203","_id":"2","_type":"_doc"}}
{"category":"X","user":[{"fName":"P","lName":"C","type":"Z"}]}
{"index":{"_index":"stack-557722203","_id":"3","_type":"_doc"}}
{"category":"X","user":[{"fName":"A","lName":"C","type":"Y"}]}
{"index":{"_index":"stack-557722203","_id":"4","_type":"_doc"}}
{"category":"Y","user":[{"fName":"A","lName":"C","type":"Y"}]}
Query
GET stack-557722203/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"nested": {
"path": "user",
"query": {
"term": {
"user.fName.keyword": {
"value": "A"
}
}
}
}
},
{
"term": {
"category.keyword": {
"value": "X"
}
}
}
]
}
},
"aggs": {
"group BylName": {
"nested": {
"path": "user"
},
"aggs": {
"group By lName": {
"terms": {
"field": "user.lName.keyword",
"size": 10
},
"aggs": {
"reverse Nested": {
"reverse_nested": {} // NOTE THIS
}
}
}
}
}
}
}
Output
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"group BylName": {
"doc_count": 4,
"group By lName": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "B",
"doc_count": 2,
"reverse Nested": {
"doc_count": 1
}
},
{
"key": "C",
"doc_count": 2,
"reverse Nested": {
"doc_count": 2
}
}
]
}
}
}
}
As per the discrepancy in data where you are getting, more documents in doc_count when you changed the mapping to Nested is because of the way Nested and Object(NonNested) documents are stored. See here to understand how are they internally stored. In order to connect them back to the root Document , you can use Reverse Nested aggregation and then you will have the same result.
Hope this helps!!

Elasticsearch. Terms aggregation on nested field with duplicated values

I have some problem with nested aggregation in Elasticsearch. I have mapping with nested field:
POST my_index/ my_type / _mapping
{
"properties": {
"name": {
"type": "keyword"
},
"nested_fields": {
"type": "nested",
"properties": {
"key": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
}
}
}
Then I add one document to index:
POST my_index/ my_type
{
"name":"object1",
"nested_fields":[
{
"key": "key1",
"value": "value1"
},
{
"key": "key1",
"value": "value2"
}
]
}
As you see, in my nested array I have two items, which have similar key field, but different value field. Then I want to make such query:
GET / my_index / my_type / _search
{
"query": {
"nested": {
"path": "nested_fields",
"query": {
"bool": {
"must": [
{
"term": {
"nested_fields.key": {
"value": "key1"
}
}
},
{
"terms": {
"nested_fields.value": [
"value1",
"value2"
]
}
}
]
}
}
}
},
"aggs": {
"agg_nested_fields": {
"nested": {
"path": "nested_fields"
},
"aggs": {
"agg_nested_fields_key": {
"terms": {
"field": "nested_fields.key",
"size": 10
}
}
}
}
}
}
As you see, I want to find all documents, which have at least one object in nested_field array, with key property equal to key1 and one of provided values (value1 or value2). Then I want to group founded documents by nested_fields.key. But I have such response
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.87546873,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "AVuLNXxiryKmA7VEwOfV",
"_score": 0.87546873,
"_source": {
"name": "object1",
"nested_fields": [
{
"key": "key1",
"value": "value1"
},
{
"key": "key1",
"value": "value2"
}
]
}
}
]
},
"aggregations": {
"agg_nested_fields": {
"doc_count": 2,
"agg_nested_fields_key": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 2
}
]
}
}
}
}
As you see from the response, I have one hit (it is correct), but the document was counted two times in aggregation (see doc_count: 2), because it has two items with 'key1' value in nested_fields array. How can I get the right count in aggregation?
You will have to use reverse_nested aggs inside the nested aggregation to return the aggregation count on root document.
{
"query": {
"nested": {
"path": "nested_fields",
"query": {
"bool": {
"must": [{
"term": {
"nested_fields.key": {
"value": "key1"
}
}
},
{
"terms": {
"nested_fields.value": [
"value1",
"value2"
]
}
}
]
}
}
}
},
"aggs": {
"agg_nested_fields": {
"nested": {
"path": "nested_fields"
},
"aggs": {
"agg_nested_fields_key": {
"terms": {
"field": "nested_fields.key",
"size": 10
},
"aggs": {
"back_to_root": {
"reverse_nested": {
"path": "_source"
}
}
}
}
}
}
}
}

ElasticSearch: Aggregations of URLs keeps splitting field

I'm trying to write an elasticsearch query that groups all blogs with the same blog domain (wordpress.com, blog.com, etc). This is how my query looks like:
{
"engagements": [
"blogs"
],
"query": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"weight": {
"gte": 120,
"lte": 150
}
}
}
]
}
}
}
},
"facets": {
"my_facet": {
"terms": {
"field": "blog_domain" <-------------------------------------
}
}
}
},
"api": "_search"
}
However, it's returning this:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
...
]
},
"facets": {
"my_facet": {
"_type": "terms",
"missing": 0,
"total": 21,
"other": 3,
"terms": [
{
"term": "http",
"count": 3
},
{
"term": "noblepig.com",
"count": 2
},
{
"term": "hawaiian",
"count": 2
},
{
"term": "dream",
"count": 2
},
{
"term": "dessert",
"count": 2
},
{
"term": "2015",
"count": 2
},
{
"term": "05",
"count": 2
},
{
"term": "www.bt",
"count": 1
},
{
"term": "photos",
"count": 1
},
{
"term": "images.net",
"count": 1
}
]
}
}
}
This isn't what I want.
Right now my database has three records:
"http://www.bt-images.net/8-cute-photos-cats/",
"http://noblepig.com/2015/05/hawaiian-dream-dessert/",
"http://noblepig.com/2015/05/hawaiian-dream-dessert/"
I want it to return something like:
"facets": {
"my_facet": {
"_type": "terms",
"missing": 0,
"total": 21,
"other": 3,
"terms": [
{
"term": "http://noblepig.com/2015/05/hawaiian-dream-dessert/",
"count": 2
},
{
"term": "http://www.bt-images.net/8-cute-photos-cats/",
"count": 1
},
How would I do this? I looked it up and saw people recommending mappings but I don't know where to put that in this query and my table has 100 million records so it's too late to do that. If you have suggestions, could you please paste the whole query?
The same happens when I use aggs:
{
"engagements": [
"blogs"
],
"query": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"weight": {
"gte": 13,
"lte": 75
}
}
}
]
}
}
}
},
"aggs": {
"blah": {
"terms": {
"field": "blog_domain"
}
}
}
},
"api": "_search"
}
The right way to do this is to have a different mapping for that field. You can change the mapping on the way by adding a sub-field to blog_domain but you cannot change the documents that were already indexed. The mapping change will take effect for the new documents.
Just for the sake of mentioning this, your blog_domain should look like this:
"blog_domain": {
"type": "string",
"fields": {
"notAnalyzed": {
"type": "string",
"index": "not_analyzed"
}
}
}
meaning it should have a sub-field (in my sample is called notAnalyzed) and in your aggregation you should use blog_domain.notAnalyzed.
But, if you don't want to or can't make this change, there is a way but I believe it's slower: using scripted aggregation. Something like this:
{
"aggs": {
"blah": {
"terms": {
"script": "_source.blog_domain",
"size": 10
}
}
}
}
And you need to enable dynamic scripting, if you don't have it enabled.
If you use Elasticsearch 5.x, you could the mapping below
PUT your_index
{
"mappings": {
"your_type": {
"properties": {
"blog_domain": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
}

ElasticSearch - Dot in field name of nested object

I have data of this form:
{
"workers": {
"worker.1": {
"jobs": 1234
},
},
"total_jobs": 1234
}
and I'm trying to deal with having the "dot" in the field-name. I tried this mapping:
{
"worker_stats": {
"properties": {
"workers": {
"type": "object",
"properties": {
"worker.1": {
"type": "nested",
"index_name": "worker_1",
"properties": {
"jobs": {
"type": "integer"
}
}
}
}
},
"total_jobs": {
"type": "integer"
}
}
}
}
but when I fetch my mapping, the index_name is no-where to be seen, and when I add a document, it's still got the dot.
Ultimately, I'm just trying to do some aggregations:
{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"aggs": {
"worker1_stats": {
"aggs": {
"stats": {
"stats": {
"field": "workers.worker.1.jobs"
}
}
},
"nested": {
"path": "workers.worker.1"
}
}
}
}
but the dot interferes.
What can I do to deal with this dot? Is there a way to use script instead of field? (Is my use of nested even correct?
I think you can use a index_name, path, and type : object in your mapping to change the name of that field during indexing.
Here is my example:
PUT /twitter/
{
"settings" : {
"number_of_shards" : 5,
"number_of_replicas" : 0
},
"mappings": {
"tweet":{
"properties": {
"desc.youbet":{"type":"object","path":"just_name",
"properties": {
"one": {
"type": "integer", "index_name":"one"
}
}
}
}
}
}
}
PUT /twitter/tweet/1
{
"name":"chicken",
"desc.youbet":{
"one":1,
}
}
PUT /twitter/tweet/2
{
"name":"chicken",
"desc.youbet":{
"one":1,
}
}
You can now used desc to do operations on and search for what was one in your document so this:
POST /twitter/tweet/_search
{
"query": {"match_all": {}},
"aggs":{
"stats": {
"stats": {"field": "one"}
}
}, "size":0
}
Results in something like this:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"stats": {
"count": 2,
"min": 1,
"max": 1,
"avg": 1,
"sum": 2
}
}
}

Resources