Elasticsearch Completion Suggester - How to discard non-letter characters during indexing? - elasticsearch

Here's my index:
PUT autocomplete-food
{
"mappings": {
"properties": {
"suggest": {
"type": "completion"
}
}
}
}
Adding a document to this index:
PUT autocomplete-food/_doc/1?refresh
{
"suggest": [
{
"input": "Starbucks",
"weight": 10
},
{
"input": ["+(Coffee","Latte","Flat White"],
"weight": 5
}
]
}
Search query for suggestions:
POST autocomplete-food/_search?pretty
{
"suggest": {
"suggest": {
"prefix": "coff",
"completion": {
"field": "suggest"
}
}
}
}
Search result:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"suggest" : [
{
"text" : "coff",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "+(Coffee",
"_index" : "autocomplete-food",
"_type" : "_doc",
"_id" : "1",
"_score" : 5.0,
"_source" : {
"suggest" : [
{
"input" : "Starbucks",
"weight" : 10
},
{
"input" : [
"+(Coffee",
"Latte",
"Flat White"
],
"weight" : 5
}
]
}
}
]
}
]
}
}
Notice the "text" value is "+(Coffee". I don't want to index/get the non-letter characters. I was expecting that as the default analyzer is "simple" analyzer, this won't happen. But the "input" field in the response also contains the special characters.
How do I achieve discarding the non-letter characters?
P.S - Elasticsearch version 7.17
I tried changing the analyzer from default (simple) one to standard. But it did not help.

If you look at the output of the simple analyzer for the input +(Coffee you can find this:
POST _analyze
{
"analyzer": "simple",
"text": "+(Coffee"
}
Results =>
{
"tokens" : [
{
"token" : "coffee",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
As you can see, the simple analyzer doesn't index the non-letter characters, which is why you can find the +(Coffee suggestion by inputting just coff otherwise it would not work.
Maybe there's a misconception about how analyzers work, because you cannot expect them to modify the content of your documents.
Regarding suggesters, whatever you add as input are the suggestions you'd like to be returned, so you're in charge of making them look like suggestions you'd like to be returned, but the analyzer will not make those modifications for you, only index those terms in the suggester's FST so you can find them.

Related

Why does elastic search wildcard query return no results?

Query #1 in Kibana returns results, however Query #2 returns no results. I search for only "bob" and get results, but when searching for "bob smith", no results, even though "Bob Smith" exists in the index. Any reason why?
Query #1: returns results
GET people/_search
{
"query": {
"wildcard" : {
"name" : "*bob*"
}
}
}
Results:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 23,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "xxxxx",
"_score" : 1.0,
"_source" : {
"name" : "Bob Smith",
...
Query #2: returns nothing.. why(?)
GET people/_search
{
"query": {
"wildcard" : {
"name" : "*bob* *smith*"
}
}
}
results...nothing
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Look like the reason of the empty result is your index mapping. If you use "text" type field, you actually search in the inverted index, mean you search in the token "bob" and token "smith" (standard analyzer) and not in the "Bob Smith". If you want to search in "Bob Smith" as one token, you need to use "keyword" type (maybe with lowercase normalizer, if you want to use not key sensetive search)
For example:
PUT test
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "lowercase_normalizer"
}
}
}
}
PUT test/_doc/1
{
"name" : "Bob Smith"
}
GET test/_search
{
"query": {
"wildcard": {
"name": "*bob* *Smith*"
}
}
}

elasticsearch does not return expected returns

I'm complete new on elasticsearch. I tried search API but it's not returning what I expected
What I did
POST /test/_doc/1
{
"name": "Hello World"
}
GET /test/_doc/1
Response:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_version" : 5,
"_seq_no" : 28,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "Hello World"
}
}
GET /test/_mapping
Response:
{
"test" : {
"mappings" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"query" : {
"properties" : {
"term" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
}
GET /test/_search
{
"query": {
"term": {
"name": "Hello"
}
}
}:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
GET /test/_search
{
"query": {
"term": {
"name": "Hello World"
}
}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
My elasticsearch version is 7.3.2
The last two search should return me document 1, is that correct? Why does it hit nothing?
Problem is that you have term queries. Term queries are not analysed. Hence Hello didn't match the term hello in your index. Note the case difference.
Unlike full-text queries, term-level queries do not analyze search terms. Instead, term-level queries match the exact terms stored in a field.
Reference
Whereas match queries analyse the search term also.
{
"query": {
"match": {
"name": "Hello"
}
}
}
You can use _analyze to check how your terms are indexed.

elasticsearch match_phrase query for exact sub-string search

I used match_phrase query for search full-text matching.
But it did not work as I thought.
Query:
POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}
Results:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]
expectation:
To only get results where the given string is an exact sub-string in the field. For example:
https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance
Mapping:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.
The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:
GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}
{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.
Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Now if you analyze a URL you'll now see URL paths kept whole:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}
{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}
This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}
EDIT:
Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}

enabled fielddata on text field in ElasticSearch but aggregation is not working

According to the documentation you can run ElasticSearch aggregations on fields that are type keyword or not a text field or which have fielddata set to true in the index mapping.
I am trying to count city_names in an nginx log. It works fine with the int field result. But it does not work with the field city_name even when I updated the index mapping for that to put fielddata=true. The should have been not required as it was of type keyword.
To say it does not work means that:
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
Here is the field mapping:
"city_name" : {
"type" : "text",
"fielddata" : true
},
And here is the aggression query:
curl -XGET --user $pwd --header 'Content-Type: application/json' https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
"aggs" : {
"cities": {
"terms" : { "field": "city_name"}
}
}
}'
If you don't get any error when executing your search it seems that is more like a problem with the data. Are you sure you have, at least, one document with the field city_name filled?
I tried to reproduce your issue with ElasticSearch 6.6.2.
I created an index
PUT cities
{
"mappings": {
"city": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
},
"city_name": {
"type": "text",
"fielddata": true
}
}
}
}
}
I added one document without the city_name
PUT cities/city/1
{
"id": "1"
}
When i performed the search:
GET cities/_search
{
"aggs": {
"cities": {
"terms" : { "field": "city_name"}
}
}
}
I got no buckets in the cities aggregation. But when I added one document with the city name filled:
PUT cities/city/2
{
"id": "2",
"city_name": "London"
}
I got the expected result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "cities",
"_type" : "city",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : "2",
"city_name" : "london"
}
},
{
"_index" : "cities",
"_type" : "city",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : "1"
}
}
]
},
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "london",
"doc_count" : 1
}
]
}
}
}

Why isn't my elastic search query returning the text analyzed by english analyzer?

I have an index named test_blocks
{
"test_blocks" : {
"aliases" : { },
"mappings" : {
"block" : {
"dynamic" : "false",
"properties" : {
"content" : {
"type" : "string",
"fields" : {
"content_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"id" : {
"type" : "long"
},
"title" : {
"type" : "string",
"fields" : {
"title_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"user_id" : {
"type" : "long"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1438642440687",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "1070099"
},
"uuid" : "45vkIigXSCyvHN6g-w5kkg"
}
},
"warmers" : { }
}
}
When I do a search for killing, a word in the content, the search results return as expected.
http://localhost:9200/test_blocks/_search?q=killing&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.07431685,
"hits" : [ {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "218",
"_score" : 0.07431685,
"_source":{"block":{"id":218,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":82}}
}, {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "219",
"_score" : 0.07431685,
"_source":{"block":{"id":219,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":83}}
} ]
}
}
However given that I have an english analyzer for the content field (content_en), I would have expected it to return me the same document for the query kill. But it doesn't. I get 0 hits.
http://localhost:9200/test_blocks/_search?q=kill&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
My understanding through this analyze query is that "killing" would have got broken down in to "kill"
http://localhost:9200/_analyze?analyzer=english&text=killing
{
"tokens" : [ {
"token" : "kill",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
So why isn't the query "kill" match that document ? Are my mappings incorrect or is it my search that is incorrect?
I am using elasticsearch v1.7.0
You need to use fuzzysearch (some introduction available here):
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"match": {
"title": {
"query": "kill",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}'
UPD. Having content_en field with content which was given by stemmer, it makes sense to actually query that field:
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "kill",
"fields": ["block.title", "block.title.title_en"]
}
}
}'
The following queries http://localhost:9200/_search?q=kill. ,http://localhost:9200/_search?q=kill. end up searching across
_all field .
_all field uses the default analyzer which unless overridden happens to be standard analyzer and not english analyzer .
For making the above query work you would need to add english analyzer to _all field and re-index
Example:
{
"mappings": {
"block": {
"_all" : {"analyzer" : "english"}
}
}
Also would point out the mapping in OP doesn't seem consistent with the document structure. As #EugZol pointed our the content is within block object so the mapping should be something on these lines :
{
"mappings": {
"block": {
"properties": {
"block": {
"properties": {
"content": {
"type": "string",
"analyzer": "standard",
"fields": {
"content_en": {
"type": "string",
"analyzer": "english"
}
}
},
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"title_en": {
"type": "string",
"analyzer": "english"
}
}
},
"user_id": {
"type": "long"
}
}
}
}
}
}
}

Resources