Broken aggregation in elasticsearch - elasticsearch

I'm getting erroneous results on performing terms aggregation in the field names in the index.
The following is the mappings I have used to the names field:
{
"dbnames": {
"properties": {
"names": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Here is the results I'm getting for a simple terms aggregation on the field:
"aggregations": {
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Martin",
"doc_count": 1
},
{
"key": "John martin",
"doc_count": 1
},
{
"key": " Victor Moses",
"doc_count": 1
}
]
}
}
As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.

The easiest way would be to make sure you properly case the value of your names field at indexing time.
If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casing": { <--- custom casing analyzer
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"names": {
"type": "string",
"index_analyzer": "casing" <--- use your custom analyzer
}
}
}
}
}'
Then we can index some data:
curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'
And finally the terms aggregation on the names field would return your the expected results:
curl -XPOST localhost:9200/your_index/your_type/_search-d '{
"size": 0,
"aggs": {
"dbnames": {
"terms": {
"field": "names"
}
}
}
}'
Results:
{
"dbnames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john martin",
"doc_count": 2
},
{
"key": "victor moses",
"doc_count": 1
}
]
}
}

There are 2 options here
Use not_analyzed option - This one has a disadvantage that same
string with different cases wont be seen as on
keyword tokenizer + lowercase filter - This one does not have the
above issue
I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

Related

Fielddata is disabled on text fields by default in elasticsearch

I have problem that I updated from elasticsearch 2.x to 5.1. However, some of my data does not work in newer elasticsearch because of this "Fielddata is disabled on text fields by default" https://www.elastic.co/guide/en/elasticsearch/reference/5.1/fielddata.html before 2.x it was enabled it seems.
Is there way to enable fielddata automatically to text fields?
I tried code like this
curl -XPUT http://localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"properties": {
"fielddata-*": {
"type": "text",
"fielddata": true
}
}
}
}
}'
but it looks like elasticsearch does not understand wildcard there in field name. Temporary solution to this is that I am running python script every 30 minutes, scanning all indices and adding fielddata=true to fields which are new.
The problem is that I have string data like "this is cool" in elasticsearch.
curl -XPUT 'http://localhost:9200/example/exampleworking/1' -d '
{
"myfield": "this is cool"
}'
when trying to aggregate that:
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
"Fielddata is disabled on text fields by default. Set fielddata=true on [myfield]"
that elasticsearch documentation suggest using .keyword instead of adding fielddata. However, that is not returning data what I want.
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield.keyword"
}
}
}
}'
returns:
"buckets" : [
{
"key" : "this is cool",
"doc_count" : 1
}
]
which is not correct. Then I add fielddata true and everything works:
curl -XPUT 'http://localhost:9200/example/_mapping/exampleworking' -d '
{
"properties": {
"myfield": {
"type": "text",
"fielddata": true
}
}
}'
and then aggregate
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
return correct result
"buckets" : [
{
"key" : "cool",
"doc_count" : 1
},
{
"key" : "is",
"doc_count" : 1
},
{
"key" : "this",
"doc_count" : 1
}
]
How I can add this fielddata=true automatically to all indices to all text fields? Is that even possible? In elasticsearch 2.x this is working out of the box.
i will answer to myself
curl -XPUT http:/localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"strings2": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fielddata": true
}
}
}
]
}
}
}'
this is doing what i want. Now all indexes have default settings fielddata true
Adding "fielddata": true allows the text field to be aggregated, but this has performance problems at scale. A better solution is to use a multi-field mapping.
Unfortunately, this is hidden a bit deep in Elasticsearch's documentations, in a warning under the fielddata mapping parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#before-enabling-fielddata
Here's a complete example of how this helps with a terms aggregation, tested on Elasticsearch 7.12 as of 2021-04-24:
Mapping (in ES7, under the mappings property of the body of a "put index template" request etc):
{
"properties": {
"bio": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
Four documents indexed:
{
"bio": "Dogs are the best pet."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are the greatest."
}
Aggregation query:
{
"size": 0,
"aggs": {
"bios_with_cats": {
"filter": {
"match": {
"bio": "cats"
}
},
"aggs": {
"bios": {
"terms": {
"field": "bio.keyword"
}
}
}
}
}
}
Aggregation query results:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"bios_with_cats": {
"doc_count": 3,
"bios": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Cats are cute.",
"doc_count": 2
},
{
"key": "Cats are the greatest.",
"doc_count": 1
}
]
}
}
}
}
Basically, this aggregation says "Of the documents whose bios are like 'cats', how many of each distinct bio are there?" The one document without "cats" in its bio property is excluded, and then the remaining documents are grouped into buckets, one of which has one document and the other has two documents.

Autocomplete functionality using elastic search

I have an elastic search index with following documents and I want to have an autocomplete functionality over the specified fields:
mapping: https://gist.github.com/anonymous/0609b1d110d91dceb9a90faa76d1d5d4
Usecase:
My query is of the form prefix type eg "sta", "star", "star w" .."start war" etc with an additional filter as tags = "science fiction". Also there queries could match other fields like description, actors(in cast field, not this is nested). I also want to know which field it matched to.
I investigated 2 ways for doing that but non of the methods seem to address the usecase above:
1) Suggester autocomplete:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-suggesters-completion.html
With this it seems I have to add another field called "suggest" replicating the data which is not desirable.
2) using a prefix filter/query:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-prefix-filter.html
this gives the whole document back not the exact matching terms.
Is there a clean way of achieving this, please advise.
Don't create mapping separately, insert data directly into index. It will create default mapping for that. Use below query for autocomplete.
GET /netflix/movie/_search
{
"query": {
"query_string": {
"query": "sta*"
}
}
}
I think completion suggester would be the cleanest way but if that is undesirable you could use aggregations on name field.
This is a sample index(I am assuming you are using ES 1.7 from your question
PUT netflix
{
"settings": {
"analysis": {
"analyzer": {
"prefix_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
},
"keyword_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim"
]
}
},
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
},
"mappings": {
"movie":{
"properties": {
"name":{
"type": "string",
"fields": {
"prefix":{
"type":"string",
"index_analyzer" : "prefix_analyzer",
"search_analyzer" : "keyword_analyzer"
},
"raw":{
"type": "string",
"analyzer": "keyword_analyzer"
}
}
},
"tags":{
"type": "string", "index": "not_analyzed"
}
}
}
}
}
Using multi-fields, name field is analyzed in different ways. name.prefix is using keyword tokenizer with edge ngram filter
so that string star wars can be broken into s, st, sta etc. but while searching, keyword_analyzer is used so that search query does not get broken into multiple small tokens. name.raw will be used for aggregation.
The following query will give top 10 suggestions.
GET netflix/movie/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"tags": "sci-fi"
}
},
"query": {
"match": {
"name.prefix": "sta"
}
}
}
},
"size": 0,
"aggs": {
"unique_movie_name": {
"terms": {
"field": "name.raw",
"size": 10
}
}
}
}
Results will be something like
"aggregations": {
"unique_movie_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "star trek",
"doc_count": 1
},
{
"key": "star wars",
"doc_count": 1
}
]
}
}
UPDATE :
You could use highlighting for this purpose I think. Highlight section will get you the whole word and which field it matched. You can also use inner hits and highlighting inside it to get nested docs also.
{
"query": {
"query_string": {
"query": "sta*"
}
},
"_source": false,
"highlight": {
"fields": {
"*": {}
}
}
}

Removing stopwords from basic Terms aggregation in Elasticsearch?

I'm a little new to Elasticsearch, but basically I have an single index called posts with multiple post documents that take the following form:
"post": {
"id": 123,
"message": "Some message"
}
I'm trying to get the most frequently occurring words in the message field across the entire index, with a simple Terms aggregation:
curl -XPOST 'localhost:9200/posts/_search?pretty' -d '
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
'
Unfortunately, this aggregation includes stopwords, so I end up with a list of words like "and", "the", "then", etc. instead of more meaningful words.
I've tried applying an analyzer to exclude those stopwords, but to no avail:
curl -XPUT 'localhost:9200/posts/?pretty' -d '
{
"settings": {
"analysis": {
"analyzer": {
"standard": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}'
Am I applying the analyzer correctly, or am I going about this the wrong way? Thanks!
I guess you forgot set analyzer to your message filed of your type field. Because Elasticsearch use their indexed data while aggregating data. This means that Elasticsearch dont get your stopwords if you analyze your field correctly. You can check this link. I used sense plugin of kibana to execute following requests. Check mapping create request
PUT /posts
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": ["test", "testable"]
}
}
}
}
}
### Dont forget these lines
POST /posts/post/_mapping
{
"properties": {
"message": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
POST posts/post/1
{
"id": 1,
"message": "Some messages"
}
POST posts/post/2
{
"id": 2,
"message": "Some testable message"
}
POST posts/post/3
{
"id": 3,
"message": "Some test message"
}
POST /posts/_search
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
This is my resultset for this search request :
{
"hits": {
...
},
"aggregations": {
"frequent_words": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "some",
"doc_count": 3
},
{
"key": "message",
"doc_count": 2
},
{
"key": "messages",
"doc_count": 1
}
]
}
}
}
In latest version 5.5, the string type has been changed to text/keyword. I enabled the stopwords for the field title and it is working for search. Means if i search for the, it is not returning but if I use below for aggregation
"field": "message_analyzed.keyword"
getting the stopwords too in aggregation bucket.
Any suggestion are welcome.
Thanks

ElasticSearch: How do I aggregate whole string sentence by term of an analyzed field?

I have a analyzed field, for instance, let's name it "motto". I want to full-text saerch "life" and aggregate them by count.
...
"query":{
"term":{
"motto":"life"
}
},
"aggs": {
"match_count": {
"terms": "motto"
}
}
...
The result I want it to be:
...
{
...
"buckets": [
{
"key":"life is good",
"doc_count":3
}
]
...
}
...
The result actually it is:
{
...
"buckets": [
{
"key": "life",
"doc_count": 3
},
{
"key": "good",
"doc_count": 3
},
{
"key": "is",
"doc_count": 3
}
]
...
}
How do I aggregate them as the way I want it?
What you can do is to create a not_analyzed sub-field to the motto field, like this:
curl -XPUT localhost:9200/your_index/your_type/_mapping -d '{
"your_type": {
"properties": {
"motto": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
When done, you need to re-index your data in order to populate the motto.raw sub-field.
And finally, you'll be able to run a query like this, i.e. search on motto but aggregate on motto.raw:
...
"query":{
"term":{
"motto":"life"
}
},
"aggs": {
"match_count": {
"terms": { "field": "motto.raw" }
}
}
...

Elasticsearch Query aggregated by unique substrings (email domain)

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "sender_not_analyzed"
}
}
}
}
which returns buckets that look like:
"aggregations": {
"sender-stats": {
"buckets": [
{
"key": "<Mike <mike#fizzbuzz.com>#MISSING_DOMAIN>",
"doc_count": 5017
},
{
"key": "jon.doe#foo.com",
"doc_count": 3963
},
{
"key": "jane.doe#foo.com",
"doc_count": 2857
},
{
"key": "jon.doe#bar.com",
"doc_count":1544
}
How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the # to the end of string?
This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup
POST email_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"domain"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"domain": {
"type": "pattern_replace",
"pattern": ".*#(.*)",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"domain": {
"type": "string",
"analyzer": "my_custom_analyzer"
},
"sender_not_analyzed": {
"type": "string",
"index": "not_analyzed",
"copy_to": "domain"
}
}
}
}
}
Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.
GET email_index/_search
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "domain"
}
}
}
}
This will give you desired result.

Resources