matching multiple terms using match_phrase - Elasticsearch

matching multiple terms using match_phrase - Elasticsearch - elasticsearch

Trying to fetch two documents that fit on the params searched, searching by each document separately works fine.
The query:
{
"query":{
"bool":{
"should":[
{
"match_phrase":{
"email":"elpaso"
}
},
{
"match_phrase":{
"email":"walker"
}
}
]
}
}
}
Im expecting to retrieve both documents that have these words in their email address field, but the query is only returning the first one elpaso
Is this an issue related to index mapping? I'm using type text for this field.
Any concept I am missing?
Index mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name":{
"type": "text"
},
"email":{
"type" : "text"
}
}
}
}
Sample data:
{
"id":"4a43f351-7b62-42f2-9b32-9832465d271f",
"name":"Walker, Gary (Mr.) .",
"email":"walkergrym#mail.com"
}
{
"id":"1fc18c05-da40-4607-a901-3d78c523cea6",
"name":"Texas Chiropractic Association P.A.C.",
"email":"txchiro#mail.com"
}
{
"id":"9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name":"El Paso Energy Corp. PAC",
"email":"elpaso#mail.com"
}
I also noticed that if I use elpaso and txchiro instead of walker the query works as expected!
noticed that the issue happens, when I use only parts of the field. If i search by the exact entire email address, everything works fine.
is this expected from match_phrase?

You are not getting any result from walker because elasticsearch uses a standard analyzer if no analyzer is specified which will tokenize walkergrym#mail.com as
GET /_analyze
{
"analyzer" : "standard",
"text" : "walkergrym#mail.com"
}
The following token will be generated
{
"tokens": [
{
"token": "walkergrym",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mail.com",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Since there is no token for walker you are not getting "walkergrym#mail.com" in your search result.
Whereas for "txchiro#mail.com", token generated are txchiro and mail.com and for "elpaso#mail.com" tokens are elpaso and mail.com
You can use the edge_ngram tokenizer, to achieve your required result
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
},
"id": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"email": "elpaso"
}
},
{
"match": {
"email": "walker"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66907434",
"_type": "_doc",
"_id": "1",
"_score": 3.9233165,
"_source": {
"id": "4a43f351-7b62-42f2-9b32-9832465d271f",
"name": "Walker, Gary (Mr.) .",
"email": "walkergrym#mail.com"
}
},
{
"_index": "66907434",
"_type": "_doc",
"_id": "3",
"_score": 3.9233165,
"_source": {
"id": "9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name": "El Paso Energy Corp. PAC",
"email": "elpaso#mail.com"
}
}
]

Related

why is shingle token filter with analyser isn't yielding expected results?

Hi here are my index details:
PUT shingle_test
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": false
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
Added 2 docs
PUT shingle_test/legacy/1
{
"name": "Chandni Chowk 2 Banglore"
}
PUT shingle_test/legacy/2
{
"name": "Chandni Chowk"
}
Nothing is being returned if I do this,
GET shingle_test/_search
{
"query": {
"match": {
"name": {
"query": "Chandni Chowk",
"analyzer": "evolutionAnalyzer"
}
}
}
}
Looked at all possible solutions online, didn't get any.
Also, if I do "output_unigrams": true, then it just works like match query and gives results.
The thing I'm trying to achieve:
Having these documents:
Chandni Chowk 2 Bangalore
Chandni Chowk
CCD Bangalore
Istah shawarma and biryani
Istah
So,
searching for "Chandni Chowk 2 Bangalore" should return 1, 2
searching for "Chandni Chowk" should return 1, 2
searching for "Istah shawarma and biryani" should return 4, 5
searching for "Istah" should return 4, 5
searching for "CCD Bangalore" should return 3
note: search keyword will always be exactly equal to value of the name field in the document ex: In this particular index, we can query "Chandni Chowk 2 Bangalore", "Chandni Chowk", "CCD Bangalore", "Istah shawarma and biryani", "Istah". "CCD" won't be queried on this index.

The analyzer parameter specifies the analyzer used for text analysis when indexing or searching a text field.
Modify your index mapping as
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": true // note this
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "evolutionAnalyzer", // note this
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
And, the modified search query will be
{
"query": {
"match": {
"name.shingles": {
"query": "Chandni Chowk"
}
}
}
}
Search Results:
"hits": [
{
"_index": "66127416",
"_type": "_doc",
"_id": "2",
"_score": 0.25759193,
"_source": {
"name": "Chandni Chowk"
}
},
{
"_index": "66127416",
"_type": "_doc",
"_id": "1",
"_score": 0.19363807,
"_source": {
"name": "Chandni Chowk 2 Banglore"
}
}
]

Elasticsearch. How to stem plurals ending in es?

For the EN language I have a custom analyser using the porter_stem. I want queries with the words "virus" and "viruses" to return the same results.
What I am finding is that porter stems virus->viru and viruses->virus. Consequently I get differing results.
How can I handle this?

You can achieve your use case, i.e, queries with the words "virus" and "viruses" should return the same result, by using snowball token filter,
that stems all the words to their root word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_snow"
]
}
},
"filter": {
"my_snow": {
"type": "snowball",
"language": "English"
}
}
}
},
"mappings": {
"properties": {
"desc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analyze
{
"analyzer" : "my_analyzer",
"text" : "viruses"
}
Following tokens are generated -
{
"tokens": [
{
"token": "virus",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Index Data:
{
"desc":"viruses"
}
{
"desc":"virus"
}
Search Query:
{
"query": {
"match": {
"desc": {
"query": "viruses"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65707743",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"desc": "viruses"
}
},
{
"_index": "65707743",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"desc": "virus"
}
}
]

Elastic Search full string match not working

I am using Elastic builder npm
Using esb.termQuery(Email, "test")
Mapping:
"CompanyName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
Database fields:
"Email": "test#mycompany.com",
"CompanyName": "my company"
Query JSON: { term: { CompanyName: 'my' } }. or { term: { Email: 'test' } }
Result :
"Email": "test#mycompany.com",
"CompanyName": "my company"
Expectation:
No result, need a full-text match, Match here is acting like 'like' or queryStringQuery.
I have 3 filters prefix, exact match, include.

The standard analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization
In your example, maybe that you are not specifying any analyzer explicitly in the index mapping, therefore text fields are analyzed by default and the standard analyzer is the default analyzer for them.
Refer this SO answer, to get a detailed explanation on this.
The following tokens are generated if no analyzer is defined.
POST/_analyze
{
"analyzer" : "standard",
"text" : "test#mycompany.com"
}
Tokens are:
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mycompany.com",
"start_offset": 5,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
}
]
}
If you want a full-text search then you can define a custom analyzer with a lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching.
The normalizer property of keyword fields is similar to analyzer
except that it guarantees that the analysis chain produces a single
token.
The uax_url_email tokenizer is like the standard tokenizer except that
it recognises URLs and email addresses as single tokens.
Index Mapping:
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email"
}
}
}
},
"mappings": {
"properties": {
"CompanyName": {
"type": "keyword",
"normalizer": "my_normalizer"
},
"Email": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"Email": "test#mycompany.com",
"CompanyName": "my company"
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"CompanyName": "My Company"
}
},
{
"match": {
"Email": "test"
}
}
],
"minimum_should_match": 1
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64220291",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"Email": "test#mycompany.com",
"CompanyName": "my company"
}
}
]

Filter by Score with Ngrams

I have a search string of Resta and currently my results include:
"Save at any restaurant!",
"Save at any gas station!"
The reason is because of my index:
{
"rewards": {
"aliases": {},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"name": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
},
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "rewards",
"creation_date": "1555542654894",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "Nzf6KNHkQIeKP0HbVFK1lw",
"version": {
"created": "6060299"
}
}
}
}
}
when I look at the document with Save at any gas station! sure enough I see sta as an ngram.
{
"_index": "rewards",
"_type": "_doc",
"_id": "6",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"name": {
"field_statistics": {
"sum_doc_freq": 73,
"doc_count": 3,
"sum_ttf": 73
},
"terms": {
"any": {
"term_freq": 1,
"tokens": [
{
"position": 2,
"start_offset": 8,
"end_offset": 11
}
]
},
"save": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
},
"sta": {
"term_freq": 1,
"tokens": [
{
"position": 4,
"start_offset": 16,
"end_offset": 23
}
]
},
}
}
}
}
(I omitted many others for brevity)
Query used:
{
"bool": {
"should": [
{
"multi_match": {
"query": "restaurant",
"fields": [
"name",
"category",
],
"operator": "and"
}
}
]
}
}
When I search I get back a score
["Save at any restaurant!", 1.1967528]
["Save at any gas station!", 0.7141209]
The user here is in fact looking for Restaurant and I'm wondering how to filter or exclude results by score. I can't seem to find a good definition of score (it seems relative) but how do I not show Save at any gas station! here (eventually).
Even giving it a full search phrase of restaurant, the scores only get slightly better:
["Save at any restaurant!", 1.253743]
["Save at any gas station!", 0.7141209]

You can simply create an Edge-Ngram Analyzer in the mapping and make use of this only and only in the search request.
What edge ngram does is it only creates the below tokens using the starting letters of a word.
For e.g. re, res, rest, resta, restau, restaur, restaura, restauran, restaurant
I've added an edge n-gram analyzer and notice how I'm not using this analyzer in any of the fields. I would make use of this analyzer only during the search query.
That means it would only search the above mentioned tokens of restaurant in inverted index.
Below is a sample mapping and its query.
Mapping
PUT <your_index_name>
{
"mappings":{
"mydocs":{
"properties":{
"name":{
"type":"text",
"fields":{
"name":{
"type":"text",
"analyzer":"ngram_analyzer"
}
}
}
}
}
},
"settings":{
"index":{
"number_of_shards":"5",
"analysis":{
"filter":{
"ngram_filter":{
"type":"ngram",
"min_gram":"2",
"max_gram":"20"
},
"edgengram_filter":{
"type":"edge_ngram",
"min_gram":"2",
"max_gram":"20"
}
},
"analyzer":{
"ngram_analyzer":{
"filter":[
"lowercase",
"ngram_filter"
],
"type":"custom",
"tokenizer":"standard"
},
"edgengram_analyzer":{
"filter":[
"lowercase",
"edgengram_filter"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1"
}
}
}
Below is how your query would be:
Query
POST <your_index_name>/_search
{
"query":{
"bool":{
"should":[
{
"multi_match":{
"query":"restaurant",
"fields":[
"name",
"category"
],
"operator":"and",
"analyzer":"edgengram_analyzer" <---- Added this
}
}
]
}
}
}
You would be able to see the required result.
Hope it helps.

Elasticsearch Ngrams: Unexpected behavior for autocomplete

Here's a simplification of what I have:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
PUT my_index/_doc/1
{
"title": "Quick Foxes"
}
PUT my_index/_doc/2
{
"title": "Quick Fuxes"
}
PUT my_index/_doc/3
{
"title": "Foxes Quick"
}
PUT my_index/_doc/4
{
"title": "Foxes Slow"
}
I am trying to search for Quick Fo to test the autocomplete:
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "Quick Fo",
"operator": "and"
}
}
}
}
The problem is that this query also returns Foxes Quick where I expected 'Quick Foxes'
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "Quick Foxes"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "Foxes Quick" <<<----- WHY???
}
}
]
}
}
What can I tweak so that I can query a classic "autocomplete" where "Quick Fo" surely won't return "Foxes Quick"..... but only "Quick Foxes"?
---- ADDITIONAL INFO -----------------------
This worked for me:
PUT my_index1
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
PUT my_index1/_doc/1
{
"text": "Quick Brown Fox"
}
PUT my_index1/_doc/2
{
"text": "Quick Frown Fox"
}
PUT my_index1/_doc/3
{
"text": "Quick Fragile Fox"
}
GET my_index1/_search
{
"query": {
"match": {
"text": {
"query": "quick br",
"operator": "and"
}
}
}
}

Issue is due to your search analyzer autocomplete_search, in which you are using the lowercase tokenizer, so your search term Quick Fo will be divided into 2 terms, quick and fo (note lowercase) and will be matched against the tokens generated using the autocomplete analyzer on your indexed docs.
Now title Foxes Quick uses autocomplete analyzer and will be having both quick and fo tokens, hence it matches with the search term tokens.
you can simply use the _analyzer API, to check the tokens generated for your documents and as well as for your search term, to understand it better.
Please refer official ES doc https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html on how to implement the autocomplete, they also use different search time analyzer, but there is a certain limitation to it and can't solve all the use-cases(esp. if you have docs like yours), hence I implemented it using some other design, which is based on the business requirements.
Hope I was clear on explaining why it's returning the second doc in your case.
EDIT: Also in your case, IMO Match phrase prefix would be more useful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

matching multiple terms using match_phrase - Elasticsearch - elasticsearch

Related

why is shingle token filter with analyser isn't yielding expected results?

Elasticsearch. How to stem plurals ending in es?

Elastic Search full string match not working

Filter by Score with Ngrams

Elasticsearch Ngrams: Unexpected behavior for autocomplete

Categories

Resources