elasticsearch match_phrase query for exact sub-string search - elasticsearch

I used match_phrase query for search full-text matching.
But it did not work as I thought.
Query:
POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}
Results:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]
expectation:
To only get results where the given string is an exact sub-string in the field. For example:
https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance
Mapping:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.

The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:
GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}
{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.
Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Now if you analyze a URL you'll now see URL paths kept whole:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}
{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}
This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}
EDIT:
Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}

Related

Elasticsearch Completion Suggester - How to discard non-letter characters during indexing?

Here's my index:
PUT autocomplete-food
{
"mappings": {
"properties": {
"suggest": {
"type": "completion"
}
}
}
}
Adding a document to this index:
PUT autocomplete-food/_doc/1?refresh
{
"suggest": [
{
"input": "Starbucks",
"weight": 10
},
{
"input": ["+(Coffee","Latte","Flat White"],
"weight": 5
}
]
}
Search query for suggestions:
POST autocomplete-food/_search?pretty
{
"suggest": {
"suggest": {
"prefix": "coff",
"completion": {
"field": "suggest"
}
}
}
}
Search result:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"suggest" : [
{
"text" : "coff",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "+(Coffee",
"_index" : "autocomplete-food",
"_type" : "_doc",
"_id" : "1",
"_score" : 5.0,
"_source" : {
"suggest" : [
{
"input" : "Starbucks",
"weight" : 10
},
{
"input" : [
"+(Coffee",
"Latte",
"Flat White"
],
"weight" : 5
}
]
}
}
]
}
]
}
}
Notice the "text" value is "+(Coffee". I don't want to index/get the non-letter characters. I was expecting that as the default analyzer is "simple" analyzer, this won't happen. But the "input" field in the response also contains the special characters.
How do I achieve discarding the non-letter characters?
P.S - Elasticsearch version 7.17
I tried changing the analyzer from default (simple) one to standard. But it did not help.
If you look at the output of the simple analyzer for the input +(Coffee you can find this:
POST _analyze
{
"analyzer": "simple",
"text": "+(Coffee"
}
Results =>
{
"tokens" : [
{
"token" : "coffee",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
As you can see, the simple analyzer doesn't index the non-letter characters, which is why you can find the +(Coffee suggestion by inputting just coff otherwise it would not work.
Maybe there's a misconception about how analyzers work, because you cannot expect them to modify the content of your documents.
Regarding suggesters, whatever you add as input are the suggestions you'd like to be returned, so you're in charge of making them look like suggestions you'd like to be returned, but the analyzer will not make those modifications for you, only index those terms in the suggester's FST so you can find them.

Elasticsearch English stemming not working correctly

I've added an english stemmer analyzer and filter to our query but it doesn't seem to be working correctly with plurals stemming from 'y' => 'ies'.
For example, when I search 'raspberry' the results never include 'raspberries' and so on.
I've tried both english and minimal_english but I still get the same result.
Here's the analyzer and settings:
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "english_stemmer"],
},
},
filter: {
english_stemmer: {
type: "stemmer",
language: "english",
},
},
},
}
What am I doing wrong?
Though english should work for the e.g. you mentioned, you can even go for porter_stem instead. This is equivalent to stemmer with language english.
porter_stem in action:
POST /_analyze
{
"tokenizer": "standard",
"filter": ["porter_stem"],
"text": ["raspberry", "raspberries"]
}
Response of above request:
{
"tokens" : [
{
"token" : "raspberri",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "raspberri",
"start_offset" : 10,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 101
}
]
}
You can see both raspberry and raspberries get tokenise to raspberri. Therefore searching for raspberry will also match raspberries and vice-versa.
Make sure that the field against which you are indexing and searching has defined the analyzer as custom_analyzer (according to settings you stated in your question).
Working e.g.
Mapping:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stemmer"
]
}
},
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Indexing:
PUT test/_doc/1
{
"field1": "raspberries"
}
PUT test/_doc/2
{
"field1": "raspberry"
}
Search:
GET test/_search
{
"query": {
"match": {
"field1": {
"query": "raspberry"
}
}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberries"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberry"
}
}
]
}
}
You can also have a look at other stemmer kstem.
Unfortunately, porter_stem doesn't always work, e.g. virus and viruses. Someone suggested snowball - but I haven't tried it yet...

How to do an exact match query in ElasticSearch?

I want to do an exact match query to an ElasticSearch index,
I have the following data -
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.21110919,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.21110919,
"_source" : {
"id" : 1,
"name" : "test"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.160443,
"_source" : {
"id" : 2,
"name" : "test two"
}
}
]
}
}
I want to query the field name,
I am trying to search the name test,
But it returns me both documents.
The expected result is the only document 1.
Mapping is as follows -
{
"test" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
I tried the following -
GET /test/_search
{
"query": {
"bool": {
"must": {
"term" : {
"name": "test"
}
}
}
}
}
GET /test/_search
{
"query": {
"match": {
"name": "test"
}
}
}
In addition to the link to the answer I provided in comment, I would suggest you to define name field as:
{
"name":{
"type": "text",
"fields":{
"keyword":{
"type": "keyword"
}
}
}
}
and then query on field name.keyword whenever you require exact match (case sensitive) and name if you want partial match such as search on first name only.
Looks like you are using text datatype on your name field, which is spitting test two in 2 tokens as test and two, hence it matches your search query test as match query is analyzed and applies the same analyzer to resultant tokens are matched against the documents tokens present in the inverted index.
Solution your using example
Index def
{
"mappings": {
"properties": {
"name": {
"type": "keyword" --> note use of `keyword` type
}
}
}
}
Index you sample docs
{
"name" : "test two"
}
{
"name" : "test"
}
Search query same as yours
{
"query": {
"match": {
"name": "test"
}
}
}
Search results as you want
"hits": [
{
"_index": "so_key",
"_type": "_doc",
"_id": "1",
"_score": 0.6931471,
"_source": {
"name": "test"
}
}
]
Important Note: you can use the analyze API to see how your data is indexed, for example
Using standard(default analyzer) on the text field
POST _analyze
{
"text": "test two",
"analyzer" : "standard" --> Change analyzer to keyword and see diff
}
Tokens
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "two",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}

Elasticsearch custom analyzer not working

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:
Create index and mapping
create index with a custom analyzer named test_lowercase:
curl -XPUT 'localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "pattern",
"pattern": "^.*$"
}
}
}
}
}'
create a mapping using the test_lowercase analyzer for the address field:
curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}'
To verify if the test_lowercase analyzer work:
curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.
To verify if the field 'address' is using the lowercase analyzer:
curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "beijing",
"start_offset" : 1,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "china",
"start_offset" : 9,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.
Insert a document to the test index to see if the analyzer works for documents
curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang", "address": "Beijing China"}'
Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:
curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
"query": {
"wildcard": {
"address": "*beijing ch*"
}
}
}'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
List all terms analyzed for the document:
So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.
curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
"_index" : "test",
"_type" : "Users",
"_id" : "12345",
"_version" : 3,
"found" : true,
"took" : 2,
"term_vectors" : {
"name" : {
"field_statistics" : {
"sum_doc_freq" : 2,
"doc_count" : 1,
"sum_ttf" : 2
},
"terms" : {
"jinshui" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
} ]
},
"tang" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 1,
"start_offset" : 8,
"end_offset" : 12
} ]
}
}
}
}
}
We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.
Can anyone please help? Is there anything missing?
Thanks a lot!
To lowercase the text you don't need a pattern. Use something like this:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
PUT /test/_mapping/Users
{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}
PUT /test/Users/12345
{"name": "Jinshui Tang", "address": "Beijing China"}
And to verify you did the right thing, use this:
GET /test/Users/_search
{
"fielddata_fields": ["name", "address"]
}
And you will see exactly how Elasticsearch is indexing your data:
"fields": {
"name": [
"jinshui",
"tang"
],
"address": [
"beijing",
"china"
]
}

Why isn't my elastic search query returning the text analyzed by english analyzer?

I have an index named test_blocks
{
"test_blocks" : {
"aliases" : { },
"mappings" : {
"block" : {
"dynamic" : "false",
"properties" : {
"content" : {
"type" : "string",
"fields" : {
"content_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"id" : {
"type" : "long"
},
"title" : {
"type" : "string",
"fields" : {
"title_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"user_id" : {
"type" : "long"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1438642440687",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "1070099"
},
"uuid" : "45vkIigXSCyvHN6g-w5kkg"
}
},
"warmers" : { }
}
}
When I do a search for killing, a word in the content, the search results return as expected.
http://localhost:9200/test_blocks/_search?q=killing&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.07431685,
"hits" : [ {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "218",
"_score" : 0.07431685,
"_source":{"block":{"id":218,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":82}}
}, {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "219",
"_score" : 0.07431685,
"_source":{"block":{"id":219,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":83}}
} ]
}
}
However given that I have an english analyzer for the content field (content_en), I would have expected it to return me the same document for the query kill. But it doesn't. I get 0 hits.
http://localhost:9200/test_blocks/_search?q=kill&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
My understanding through this analyze query is that "killing" would have got broken down in to "kill"
http://localhost:9200/_analyze?analyzer=english&text=killing
{
"tokens" : [ {
"token" : "kill",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
So why isn't the query "kill" match that document ? Are my mappings incorrect or is it my search that is incorrect?
I am using elasticsearch v1.7.0
You need to use fuzzysearch (some introduction available here):
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"match": {
"title": {
"query": "kill",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}'
UPD. Having content_en field with content which was given by stemmer, it makes sense to actually query that field:
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "kill",
"fields": ["block.title", "block.title.title_en"]
}
}
}'
The following queries http://localhost:9200/_search?q=kill. ,http://localhost:9200/_search?q=kill. end up searching across
_all field .
_all field uses the default analyzer which unless overridden happens to be standard analyzer and not english analyzer .
For making the above query work you would need to add english analyzer to _all field and re-index
Example:
{
"mappings": {
"block": {
"_all" : {"analyzer" : "english"}
}
}
Also would point out the mapping in OP doesn't seem consistent with the document structure. As #EugZol pointed our the content is within block object so the mapping should be something on these lines :
{
"mappings": {
"block": {
"properties": {
"block": {
"properties": {
"content": {
"type": "string",
"analyzer": "standard",
"fields": {
"content_en": {
"type": "string",
"analyzer": "english"
}
}
},
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"title_en": {
"type": "string",
"analyzer": "english"
}
}
},
"user_id": {
"type": "long"
}
}
}
}
}
}
}

Resources