Elasticsearch match certain fields exactly but not others - elasticsearch

I am needing ElasticSearch to match certain fields exactly, currently using multi_match.
For example, a user types in long beach chiropractor.
I want long beach to match the city field exactly, and not return results for seal beach or glass beach.
At the same time chiropractor should also match chiropractic.
Here is the current query I am using:
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
"location_address_address_1.value",
"location_address_city.value^2",
"location_address_state.value",
"specialty" // e.g. chiropractor
],
"query": "chiropractor long beach",
"boost": 6,
"type": "cross_fields"
}
}
]
}
},

The right approach would be to separate term that is searched and location, and store location as keyword type. If that's not possible then you can use synonym tokenizer to store locations as single tokens, but this will require to have the list of all possible locations. e.g.
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"long beach=>long-beach"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Now if you call
POST /my_index/_analyze?analyzer=my_synonyms
{
"text": ["chiropractor long beach"]
}
the response is
{
"tokens": [
{
"token": "chiropractor",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "long-beach",
"start_offset": 13,
"end_offset": 23,
"type": "SYNONYM",
"position": 1
}
]
}

Related

Synonym token filter

I created a test index with synonym token filter
PUT /synonyms-index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"shares","equity","stock"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Then I ran analyze API ,
post synonyms-index/_analyze
{
"analyzer":"my_synonyms",
"text":"equity awesome"
}
I got the following response to see what token got into inverted index and I was expecting "shares" and "stock" needed to be added as per the synonym rule, but it doesn't seem so. Am I missing anything here ?
{
"tokens": [
{
"token": "equity",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "awesome",
"start_offset": 7,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
posting the answer for the community-
it is common pitfall with JSON ,
We need to make it as ( put everything in a double quotes which consistutues a rule and it follows simple expansion.)
"synonyms": [ "shares,equity,stock" ]
rather than
"synonyms": [
"shares","equity","stock"
]

Elasticsearch - Stop analyzer doesn't allow number

I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.
Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.
Below is the query I am using
{
"from": 0,
"size": 10,
"explain": false,
"stored_fields": [
"_source"
],
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "news24",
"analyzer": "stop",
"fields": [
"title",
"keywords",
"url"
]
}
},
"functions": [
{
"script_score": {
"script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
}
},
{
"script_score": {
"script": "doc['linksCount'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"script_fields": {
"custom_score": {
"script": {
"lang": "painless",
"source": "params._source.linksArray"
}
}
},
"highlight": {
"pre_tags": [
""
],
"post_tags": [
"<\/span>"
],
"fields": {
"title": {
"type": "plain"
},
"keywords": {
"type": "plain"
},
"description": {
"type": "plain"
},
"url": {
"type": "plain"
}
}
}
}
That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).
So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.
This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:
Mapping:
POST sometestindex
{
"settings":{
"analysis":{
"analyzer":{
"my_english_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.
Analysis Query To Test
POST sometestindex/_analyze
{
"analyzer": "my_english_analyzer",
"text": "the name of the channel is news24"
}
Query Result
{
"tokens": [
{
"token": "name",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "channel",
"start_offset": 16,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "news24",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
You can see in the above tokens, that news24 is being preserved as token.
Hope it helps!

Elasticsearch keyword tokenizer doesn't work with phonetic analyzer

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Confusing query_string search results

I've got Elasticsearch set up and am running queries against it, but I'm getting odd results, and can't figure out why:
For example, the here's one relevant portion of my mapping:
"classification": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
And, then here's some of the queries and results. For all of these, there are objects with classification value of "Jewelry & Adornment":
Query:
"query": {
"bool": {
"must": [
{
"match_all": {}
},
{
"query_string": {
"query": "(classification:/jewel.*/)"
}
}
]
}
}
Result:
"hits": {
"total": 2541,
"max_score": 1.4142135,
"hits": [
{
...
Yet if I add "ry":
Query:
"query_string": {
"query": "(classification:/jewelry.*/)"
}
Result:
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
I've also tried running the queries:
"query_string": {
"query": "(classification\\*:/jewelry.*/)"
}
(should match either "classification" or "classification.raw")
And:
"query_string": {
"query": "(classification.raw:/jewelry.*/)"
}
I've also tried cases variations, e.g. "Jewelry" vs. "jewelry", to no effect. All of these return no results. This makes no sense to me. Even when querying "classification.raw" with "Jewelry" (same case and on a completely unanalyzed field), I get no results. Any ideas?
UPDATE
As per request of #keety
{
"tokens": [
{
"token": "jewelri",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "adorn",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I imagine the fact that it's stemming "jewelry" to "jewelri" is my problem, but not sure why it's doing that or how to fix it.
UPDATE #2
These are the analyzers in play:
"analyzer": {
"default_index": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"index_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
},
"default_search": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"search_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
}
}
UPDATE #3
I ran an _explain query on one of the objects that should be matching but isn't and got the following:
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0.70710677,
"description": "ConstantScore(*:*), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 0.70710677,
"description": "queryNorm"
}
]
},
{
"value": 0,
"description": "no match on required clause (ConstantScore())"
}
]
}
I don't know what "required clause (ConstantScore())" is. The only thing I can find related is Constant Score Query, but I'm not employing this particular query anywhere.
UPDATE #4
Okay, this is getting a little long-winded. Sorry about that. However, I just discovered that the problem seems to lie in using the regex syntax. If I just use a basic wildcard (along with "analyze_wildcard": true), then all my queries start working.

Resources