Elasticsearch synonym analyzer not working - filter

EDIT: To add on to this, the synonyms seem to be working with basic querystring queries.
"query_string" : {
"default_field" : "location.region.name.raw",
"query" : "nh"
}
This returns all of the results for New Hampshire, but a "match" query for "nh" returns no results.
I'm trying to add synonyms to my location fields in my Elastic index, so that if I do a location search for "Mass," "Ma," or "Massachusetts" I'll get the same results each time. I added the synonyms filter to my settings and changed the mapping for locations. Here are my settings:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
},
"filter":{
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev",
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
And the mapping for the location.region field:
"region":{
"properties":{
"id":{"type": "long"},
"name":{
"type": "string",
"analyzer": "synonyms",
"fields":{"raw":{"type": "string", "index": "not_analyzed" }}
}
}
}
But the synonyms analyzer doesn't seem to be doing anything. This query for example:
"match" : {
"location.region.name" : {
"query" : "Massachusetts",
"type" : "phrase",
"analyzer" : "synonyms"
}
}
This returns hundreds of results, but if I replace "Massachusetts" with "Ma" or "Mass" I get 0 results. Why isn't it working?

The order of the filters is
filter":[
"lowercase",
"synonym_filter"
]
So, if elasticsearch is "lowercasing" first the tokens, when it executes the second step, synonym_filter, it won't match any of the entries you have defined.
To solve the problem, I would define the synonyms in lower case

You can also define your synonyms filter as case insensitive:
"filter":{
"synonym_filter":{
"type": "synonym",
"ignore_case" : "true",
"synonyms":[
...
]
}
}

Related

How can I index a field using two different analyzers in Elastic search

Say that I have a field "productTitle" which I want to use for my users to search for products.
I also want to apply autocomplete functionality. So I m using an autocomplete_analyzer with the following filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
However, at the same time when users make a search I don't want the "edge_ngram" to be applied, since it produces lot of irrelevant results.
For example when users want to search for "mi" and start typing "m", "mi".. they should get the results starting with m,mi as auto-complete options. However, when they actually make the query, they should only get results with the word "mi". Currently they also see results with "mini" etc..
Therefore, is it possible to have "productTitle" indexed using two different analyzers? Is multi-field type an option for me?
EDIT: Mapping for productTitle
"productTitle" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
,
"second" analyzer
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
}
So when I'm querying for :
"filtered" : {
"query" : {
"match" : {
"productTitle" : {
"query" : "mi",
"type" : "boolean",
"minimum_should_match" : "2<75%"
}
}
}
}
I also get results like "mini". But I need to only get results including just "mi"
Thank you
hmm ... as far as I know, there is no way to apply multiple analyzers for same field ... what You can make is to use "Multi Fields".
here is an example how to apply different analyzers for "subfields":
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers
The correct way of preventing what you describe in your answer is to specify both analyzer and search_analyzer in your field mapping, like this:
"productTitle": {
"type": "string",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
The autocomplete analyzer will kick in at indexing time and tokenize your title according to your edge_ngram configuration and the standard analyzer will kick in at search time without applying the edge_ngram stuff.
In this context, there is no need for multi-fields unless you need to tokenize the productTitle field in different ways.

Exact phrase match in ElasticSearch

I'm trying to achieve exact search by phrase in Elastic, using my existing index (full-text). When user is searching, say, "Sanity Testing", the result should bring all the docs with "Sanity Testing" (case-insensitive), but not "Sanity tested".
My mapping:
{
"doc": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"term_vector":"with_positions_offsets",
"analyzer":"o3analyzer",
"store": true
},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"keywords" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"},
"language" : {"store" : "yes"}
}
}
}
}
}
As I understand, there's a way to add another index with "raw" analyzer, but I'm not sure this will work due to the need to search as case-insensitive. And also I don't want to rebuild indexes, as there are hundreds machines with tons of documents already indexed, so it may take ages.
Is there a way to run such a query? I'm now trying to search using the following query:
{
query: {
match_phrase: {
file: "Sanity Testing"
}
}
and it brings me both "Sanity Testing" and "Sanity Tested".
Any help appreciated!

Elasticsearch - Fuzzy, phrase, completion suggestor and dashes

So I have been asking separate questions trying to achieve the search functionality I would like to achieve but still falling short so thought I would just ask people what they suggest for the optimal Elasticsearch settings, mappings, indexing and query structure to do what I am looking for.
I need a search as you type solution that queries categories. If I typed in "mex" I am looking to get back results like "Mexican Restaurant", "Mexican Grocery Store", "Tex-Mex Restaurant" and "Medical Supplies". The "Medical Supplies" would come back because the fuzzy could think you wanted to type "med". The categories with "Mexican" in it should be listed first though. On the topic of priority if a user typed in "bar" I would expect "Bar" to be in the list before "Barn" or "Barbecue".
On top of this I am also looking for the ability for a user to search "Mexican Store" and "Mexican Grocery Store" would still be returned. Also if a user typed in "Store Mexican" for "Mexican Grocery Store" to still be returned.
As well as the above features I need a way to handle dashes. If a user were to type any variation of "tex mex", "tex-mex", "texmex" I would expect to get "Tex-Mex Restaurant".
If you have read this far I really appreciate it. I have implemented a few solutions already but none of them have been able to to all of what I need described above.
My current configuration:
settings
curl -XPUT http://localhost:9200/objects -d '{
"settings": {
"analysis": {
"analyzer": {
"lower": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
}
}'
mapping
curl -XPUT http://localhost:9200/objects/object/_mapping -d '{
"object" : {
"properties" : {
"objectDescription" : {
"type" : "string",
"fields" : {
"lower": {
"type": "string",
"analyzer": "lower"
}
}
},
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
}'
index
{
"id":6663521500659712,
"objectDescription":"Mexican Grocery Store",
"suggest":{
"input":["Mexican Grocery Store"],
"output":"Mexican Grocery Store",
"payload":{
"id":6663521500659712
}
}
}
query
{
"query":{
"bool":{
"should":[
{
"fuzzy":{
"objectDescription.lower":{"value":"med"}
}
},
{
"term":{
"objectDescription":{"value":"med"}
}
}
]
}
},
"from":0,
"size":20,
"suggest":{
"object-suggest":{
"text":"med",
"completion":{
"field":"suggest",
"fuzzy":{
"fuzzy":true
}
}
}
}
}

Elasticsearch Synonym filter not working

I added a synonyms analyzer and filter to my elastic index so that when searching by state, "Massachusetts," "Ma," and "Mass," would return the same results for example. These are the settings that I have:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
},
"my_analyzer":{
"filter":["standard", "lowercase", "my_soundex" ],
"tokenizer": "standard"}
},
"filter":{
"my_soundex":{
"replace": "false",
"type": "phonetic",
"encoder": "soundex"
},
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev"
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
}
However, the synonyms filter doesn't seem to be working. Here are two queries that I tried:
"match": {
"location.location_raw": {
"type": "boolean",
"operator": "AND",
"query": "Massachusetts",
"analyzer": "synonyms"
}
}
"match": {
"location.location_raw": {
"type": "boolean",
"operator": "AND",
"query": "Mass",
"analyzer": "synonyms"
}
}
With the synonyms filter I should get the same number of results for both queries, but I get 6 results for "Massachusetts" and 2 results for "Mass," and when I look at the results, all of the location_raw fields for the first query contain "Massachusetts" while all of the location_raw fields for the second query contain "Mass" exactly. It seems like the synonyms anazlyer is just being ignored.
What am I missing here?

ElasticSearch nGram filters out punctuation

In my ElasticSearch dataset we have unique IDs that are separated with a period. A sample number might look like c.123.5432
Using an nGram I'd like to be able to search for: c.123.54
This doesn't return any results. I believe the tokenizer is splitting on the period. To account for this I added "punctuation" to the token_chars, but there's no change in results. My analyzer/tokenizer is below.
I've also tried: "token_chars": [] <--Per the documentation this should keep all characters.
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit", "whitespace", "punctuation", "symbol" ]
}
}
}
}
},
Edit(More info):
This is the mapping of the relevant field:
"ProjectID":{"type":"string","store":"yes", "copy_to" : "meta_data"},
And this is the field I'm copying it into(that also has the ngram analyzer):
"meta_data" : { "type" : "string", "store":"yes", "index_analyzer": "my_ngram_analyzer"}
This is the command I'm using in sense to see if my search worked (see that it's searching the "meta_data" field):
GET /_search?pretty=true
{
"query": {
"match": {
"meta_data": "c.123.54"
}
}
}
Solution from s1monw at https://github.com/elasticsearch/elasticsearch/issues/5120
By using an index_analyzer search only uses a standard analyzer. To fix it I modified index_analyzer to analyzer. Keep in mind the number of results will increase greatly, so changing the min_gram to a higher number may be necessary.

Resources