Elasticsearch Synonym filter not working - filter

I added a synonyms analyzer and filter to my elastic index so that when searching by state, "Massachusetts," "Ma," and "Mass," would return the same results for example. These are the settings that I have:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
},
"my_analyzer":{
"filter":["standard", "lowercase", "my_soundex" ],
"tokenizer": "standard"}
},
"filter":{
"my_soundex":{
"replace": "false",
"type": "phonetic",
"encoder": "soundex"
},
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev"
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
}
However, the synonyms filter doesn't seem to be working. Here are two queries that I tried:
"match": {
"location.location_raw": {
"type": "boolean",
"operator": "AND",
"query": "Massachusetts",
"analyzer": "synonyms"
}
}
"match": {
"location.location_raw": {
"type": "boolean",
"operator": "AND",
"query": "Mass",
"analyzer": "synonyms"
}
}
With the synonyms filter I should get the same number of results for both queries, but I get 6 results for "Massachusetts" and 2 results for "Mass," and when I look at the results, all of the location_raw fields for the first query contain "Massachusetts" while all of the location_raw fields for the second query contain "Mass" exactly. It seems like the synonyms anazlyer is just being ignored.
What am I missing here?

Related

Combining terms with synonyms - ElasticSearch

I am new to Elasticsearch and have a synonym analyzer in place which looks like-
{
"settings": {
"index": {
"analysis": {
"filter": {
"graph_synonyms": {
"type": "synonym_graph",
"synonyms": [
"gowns, dresses",
"backpacks, bags",
"coats, jackets"
]
}
},
"analyzer": {
"search_time_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"graph_synonyms"
]
}
}
}
}
}
}
And the mapping looks like-
{
"properties": {
"category": {
"type": "text",
"search_analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
If I search for gowns, it gives me proper results for both gowns as well as dresses.
But the problem is if I search for red gowns, (the system does not have any red gowns) the expected behavior is to search for red dresses and return those results. But instead, it returns results of gowns and dresses irrespective of the color.
I would want to configure the system such that it considers both the terms and their respective synonyms if any and then return the results.
For reference, this is what my search query looks like-
"query":
{
"bool":
{
should:
[
{
"multi_match":
{
"boost": 300,
"query": term,
"type": "cross_fields",
"operator": "or",
"fields": ["bu.keyword^10", "bu^10", "category.keyword^8", "category^8", "category.no_synonyms^8", "brand.keyword^7", "brand^7", "colors.keyword^2", "colors^2", "size.keyword", "size", "hash.keyword^2", "hash^2", "name"]
}
}
]
}
}
Sample document:
_source: {
productId: '12345',
name: 'RUFFLE FLORAL TRIM COTTON MAXI DRESS',
brand: [ 'self-portrait' ],
mainImage: 'http://test.jpg',
description: 'Self-portrait presents this maxi dress, crafted from cotton, to offer your off-duty ensembles an elegant update. Trimmed with ruffled broderie details, this piece is an effortless showcase of modern femininity.',
status: 'active',
bu: [ 'womenswear' ],
category: [ 'dresses', 'gowns' ],
tier1: [],
tier2: [],
colors: [ 'WHITE' ],
size: [ '4', '6', '8', '10' ],
hash: [
'ballgown', 'cotton',
'effortless', 'elegant',
'floral', 'jar',
'maxi', 'modern',
'off-duty', 'ruffle',
'ruffled', '1',
'2', 'crafted'
],
styleCode: '211274856'
}
How can I achieve the desired output? Any help would be appreciated. Thanks
You can configured index time analyzer insted of search time analyzer like below:
{
"properties": {
"category": {
"type": "text",
"analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
Once you done with index mapping change, reindex your data and try below query:
Please note that I have changed operator to and and analyzer to standard:
{
"query": {
"multi_match": {
"boost": 300,
"query": "gowns red",
"analyzer": "standard",
"type": "cross_fields",
"operator": "and",
"fields": [
"category",
"colors"
]
}
}
}
Why your current query is not working:
Inexing:
Your current index mapping indexing data with standard analyzer so it will not index any of your category with synonyms values.
Searching:
Your current query have operator or so if you search for red gowns then it will create query like red OR gowns OR dresses and it will giving you result irrespective of the color. Also, if you change operator to and in existing configuration then it will return zero result as it will create query like red AND gowns AND dresses.
Solution: Once you done changes as i suggsted it will index synonyms for category field as well and it will work with and operator. So if you try query gowns red then it will create query like gowns AND red. It will match because category field have both values gowns and dresses due to synonyms applied at index time.

elasticsearch synonyms & shingle conflict

Let me jump straight to the code.
PUT /test_1
{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"university of tokyo => university_of_tokyo, u_tokyo",
"university" => "college, educational_institute, school"
],
"tokenizer": "whitespace"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
}
}
}
}
}
output
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
},
"status": 400
}
Basically,
Lets Say I have following index_time synonyms
"university => university, college, educational_institute, school"
"tokyo => tokyo, japan_capitol"
"university of tokyo => university_of_tokyo, u_tokyo"
If I search for "college" I expect to match "university of tokyo"
but since index contains only "university of tokyo" => university_of_tokyo, u_tokyo.....the search fails
I was expecting if I use analyzer{'filter': ["single", "synonym"]}
university of tokyo -shingle-> university -synonyms-> college, institue
How do I obtain the desired behaviour?
I was getting a similar error, though I was using synonym graph....
I tried using lenient=true in the synonym graph definition and got rid of the error. Not sure if there is a downside....
"graph_synonyms" : {
"lenient": "true",
"type" : "synonym_graph",
"synonyms_path" : "synonyms.txt"
},
According to this link Tokenizers should produce single tokens before a synonym filter.
But to answer your problem first of all your second rule should be modified to be like this to make all of terms synonyms
university , college, educational_institute, school
Second Because of underline in the tail of first rule (university_of_tokyo) all the occurrences of "university of tokyo" are indexed as university_of_tokyo which is not aware of it's single tokens. To overcome this problem I would suggest a char filter with a rule like this:
university of tokyo => university_of_tokyo university of tokyo
and then in your synonyms rule:
university_of_tokyo , u_tokyo
This a way to handle multi-term synonyms problem as well.

Elasticsearch - Fuzzy, phrase, completion suggestor and dashes

So I have been asking separate questions trying to achieve the search functionality I would like to achieve but still falling short so thought I would just ask people what they suggest for the optimal Elasticsearch settings, mappings, indexing and query structure to do what I am looking for.
I need a search as you type solution that queries categories. If I typed in "mex" I am looking to get back results like "Mexican Restaurant", "Mexican Grocery Store", "Tex-Mex Restaurant" and "Medical Supplies". The "Medical Supplies" would come back because the fuzzy could think you wanted to type "med". The categories with "Mexican" in it should be listed first though. On the topic of priority if a user typed in "bar" I would expect "Bar" to be in the list before "Barn" or "Barbecue".
On top of this I am also looking for the ability for a user to search "Mexican Store" and "Mexican Grocery Store" would still be returned. Also if a user typed in "Store Mexican" for "Mexican Grocery Store" to still be returned.
As well as the above features I need a way to handle dashes. If a user were to type any variation of "tex mex", "tex-mex", "texmex" I would expect to get "Tex-Mex Restaurant".
If you have read this far I really appreciate it. I have implemented a few solutions already but none of them have been able to to all of what I need described above.
My current configuration:
settings
curl -XPUT http://localhost:9200/objects -d '{
"settings": {
"analysis": {
"analyzer": {
"lower": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
}
}'
mapping
curl -XPUT http://localhost:9200/objects/object/_mapping -d '{
"object" : {
"properties" : {
"objectDescription" : {
"type" : "string",
"fields" : {
"lower": {
"type": "string",
"analyzer": "lower"
}
}
},
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
}'
index
{
"id":6663521500659712,
"objectDescription":"Mexican Grocery Store",
"suggest":{
"input":["Mexican Grocery Store"],
"output":"Mexican Grocery Store",
"payload":{
"id":6663521500659712
}
}
}
query
{
"query":{
"bool":{
"should":[
{
"fuzzy":{
"objectDescription.lower":{"value":"med"}
}
},
{
"term":{
"objectDescription":{"value":"med"}
}
}
]
}
},
"from":0,
"size":20,
"suggest":{
"object-suggest":{
"text":"med",
"completion":{
"field":"suggest",
"fuzzy":{
"fuzzy":true
}
}
}
}
}

Elasticsearch synonym analyzer not working

EDIT: To add on to this, the synonyms seem to be working with basic querystring queries.
"query_string" : {
"default_field" : "location.region.name.raw",
"query" : "nh"
}
This returns all of the results for New Hampshire, but a "match" query for "nh" returns no results.
I'm trying to add synonyms to my location fields in my Elastic index, so that if I do a location search for "Mass," "Ma," or "Massachusetts" I'll get the same results each time. I added the synonyms filter to my settings and changed the mapping for locations. Here are my settings:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
},
"filter":{
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev",
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
And the mapping for the location.region field:
"region":{
"properties":{
"id":{"type": "long"},
"name":{
"type": "string",
"analyzer": "synonyms",
"fields":{"raw":{"type": "string", "index": "not_analyzed" }}
}
}
}
But the synonyms analyzer doesn't seem to be doing anything. This query for example:
"match" : {
"location.region.name" : {
"query" : "Massachusetts",
"type" : "phrase",
"analyzer" : "synonyms"
}
}
This returns hundreds of results, but if I replace "Massachusetts" with "Ma" or "Mass" I get 0 results. Why isn't it working?
The order of the filters is
filter":[
"lowercase",
"synonym_filter"
]
So, if elasticsearch is "lowercasing" first the tokens, when it executes the second step, synonym_filter, it won't match any of the entries you have defined.
To solve the problem, I would define the synonyms in lower case
You can also define your synonyms filter as case insensitive:
"filter":{
"synonym_filter":{
"type": "synonym",
"ignore_case" : "true",
"synonyms":[
...
]
}
}

Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4

We have two instances of elastic search, one running 1.2.1 and one 1.4, the settings and the mapping is identical on the indices running on both instances, yet the results are different.
The setting for the default analyzer:
....
analysis: {
filter: {
ourEnglishStopWords: {
type: "stop",
stopwords: "_english_"
},
ourEnglishFilter: {
type: "stemmer",
name: "english"
}
},
analyzer: {
default: {
filter: [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"ourEnglishFilter"
],
tokenizer: "standard"
}
}
},
...
The difference between elastic search versions appears when indexing/searching for possessive forms,
whereas in 1.2.1 "player", "players" and "player's" would return the same results, in 1.4
first two ("player" and "players") have identical result set, while "player's" is not matching the set
Is it a known difference? What is the the right way to get the same behavior in 1.4 and up?
I think this is the change, introduced in 1.3.0:
The StemmerTokenFilter had a number of issues:
english returned the slow snowball English stemmer
porter2 returned the snowball Porter stemmer (v1)
Changes:
english now returns the fast PorterStemmer (for indices created from
v1.3.0 onwards)
porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
According to that github issue, you can either to change your mapping to:
"ourEnglishFilter": {
"type": "stemmer",
"name": "porter2"
}
or try something else:
"filter": {
"ourEnglishStopWords": {
"type": "stop",
"stopwords": "_english_"
},
"ourEnglishFilter": {
"type": "stemmer",
"name": "english"
},
"possesiveEnglish": {
"type": "stemmer",
"name": "possessive_english"
}
},
"analyzer": {
"default": {
"filter": [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"possesiveEnglish",
"ourEnglishFilter"
],
"tokenizer": "standard"
}
}

Resources