elasticsearch synonyms & shingle conflict - elasticsearch

Let me jump straight to the code.
PUT /test_1
{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"university of tokyo => university_of_tokyo, u_tokyo",
"university" => "college, educational_institute, school"
],
"tokenizer": "whitespace"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
}
}
}
}
}
output
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
},
"status": 400
}
Basically,
Lets Say I have following index_time synonyms
"university => university, college, educational_institute, school"
"tokyo => tokyo, japan_capitol"
"university of tokyo => university_of_tokyo, u_tokyo"
If I search for "college" I expect to match "university of tokyo"
but since index contains only "university of tokyo" => university_of_tokyo, u_tokyo.....the search fails
I was expecting if I use analyzer{'filter': ["single", "synonym"]}
university of tokyo -shingle-> university -synonyms-> college, institue
How do I obtain the desired behaviour?

I was getting a similar error, though I was using synonym graph....
I tried using lenient=true in the synonym graph definition and got rid of the error. Not sure if there is a downside....
"graph_synonyms" : {
"lenient": "true",
"type" : "synonym_graph",
"synonyms_path" : "synonyms.txt"
},

According to this link Tokenizers should produce single tokens before a synonym filter.
But to answer your problem first of all your second rule should be modified to be like this to make all of terms synonyms
university , college, educational_institute, school
Second Because of underline in the tail of first rule (university_of_tokyo) all the occurrences of "university of tokyo" are indexed as university_of_tokyo which is not aware of it's single tokens. To overcome this problem I would suggest a char filter with a rule like this:
university of tokyo => university_of_tokyo university of tokyo
and then in your synonyms rule:
university_of_tokyo , u_tokyo
This a way to handle multi-term synonyms problem as well.

Related

Compound synonyms in Elasticsearch

I'm trying to extract synonyms from a sentence. When synonyms are just words it works well. The problem occurs when synonyms are compound words.
For example, I have registered the following synonyms:
car, big car, vehicle
If I run Analyze in the following sentence:
"The car was moving fast" I get the correct synonyms.
If I search only for "big car" I also get the correct set of synonyms.
However, if I search for "The big car moved fast", I don't get the synonyms for "big car".
I'm using the following configuration:
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym_graph",
"synonyms": [
"car, big car, vehicle"
]
}
},
"analyzer": {
"keywords_token": {
"filter": [
"my_synonym",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}
This solution is not ideal.
What I need is to get the synonym words composed to return to the interface.
The solution that returns only tokens also does not answer, as it does not return synonym tokens in order.

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

elasticsearch: how to map common character mistakes?

I would like to map common mistakes in my language, as:
xampu -> shampoo
Shampoo is an english word, but commonly used in Brazil. In Portuguese, "ch" sounds like "x", as sometimes "s" sounds like "z". We also do not have "y" on our language, but it's common on names and foreign words - it sounds like "i".
So I would like to map a character replacement, but also keep the original word on the same position.
So a mapping table would be:
ch -> x
sh -> x
y -> i
ph -> f
s -> z
I have taken a look on the "Character Filters", but it seems to only support replacement:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
I want to form derivative words based on the original so users can find the correct word even if typed wrong. To archive this, the following product name:
SHAMPOO NIVEA MEN
Should be tokenized as:
0: SHAMPOO, XAMPOO
1: NIVEA
2: MEN
I am using the synonym filter, but with synonym I need to may every word.
Any way to do this?
Thanks.
For your usecase, Multi-Field seems to suit the best. You can keep your field analyzed in two ways, one using standard and other using your custom analyzer created using mapping Char Filter.
It would look like:
Index Creation
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ch => x",
"sh => x",
"y => i",
"ph => f",
"s => z"
]
}
}
}
}
}
MultiField creation
POST my_index/_mapping/my_type
{
"properties": {
"field_name": {
"type": "text",
"analyzer": "standard",
"fields": {
"mapped": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Above mapping would create two versions of field_name, one which is analyzed ising standard analyzer, another which is analyzed using your custom analyzer created.
In order to Query Both you can use should on both versions.
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"field_name": "xampoo"
}
},
{
"match": {
"field_name.mapped": "shampoo"
}
}
]
}
}
}
Hope this helps you!!

Elasticsearch combining language and char_filter in an analyzer

I'm trying to combine a language analyzer with a char_filter but when I look at the _termvectors for the field the html/xml tags I can see values in there that are attributes of custom xml tags like "22anchor_titl".
My idea was to extend the german language filter:
settings:
analysis:
analyzer:
node_body_analyzer:
type: 'german'
char_filter: ['html_strip']
mappings:
mappings:
node:
body:
type: 'string'
analyzer: 'node_body_analyzer'
search_analyzer: 'node_search_analyzer'
Is there an error in my configuration or is the concept of deriving a new analyzer from the 'gernam' by adding a char_filter simply not possible. If so, would I have to make a type: 'custom' analyzer, implement the whole thing like this documentation and add the filter?
Cheers
Yes, you need to do that. What if you wanted to add another token filter? Where should have ES placed that one in the list of already existent token filters (since the order matters)? You need something like this:
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["ghj"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"my_analyzer": {
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
],
"char_filter":"html_strip"
}
}
}

Elasticsearch synonym analyzer not working

EDIT: To add on to this, the synonyms seem to be working with basic querystring queries.
"query_string" : {
"default_field" : "location.region.name.raw",
"query" : "nh"
}
This returns all of the results for New Hampshire, but a "match" query for "nh" returns no results.
I'm trying to add synonyms to my location fields in my Elastic index, so that if I do a location search for "Mass," "Ma," or "Massachusetts" I'll get the same results each time. I added the synonyms filter to my settings and changed the mapping for locations. Here are my settings:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
},
"filter":{
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev",
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
And the mapping for the location.region field:
"region":{
"properties":{
"id":{"type": "long"},
"name":{
"type": "string",
"analyzer": "synonyms",
"fields":{"raw":{"type": "string", "index": "not_analyzed" }}
}
}
}
But the synonyms analyzer doesn't seem to be doing anything. This query for example:
"match" : {
"location.region.name" : {
"query" : "Massachusetts",
"type" : "phrase",
"analyzer" : "synonyms"
}
}
This returns hundreds of results, but if I replace "Massachusetts" with "Ma" or "Mass" I get 0 results. Why isn't it working?
The order of the filters is
filter":[
"lowercase",
"synonym_filter"
]
So, if elasticsearch is "lowercasing" first the tokens, when it executes the second step, synonym_filter, it won't match any of the entries you have defined.
To solve the problem, I would define the synonyms in lower case
You can also define your synonyms filter as case insensitive:
"filter":{
"synonym_filter":{
"type": "synonym",
"ignore_case" : "true",
"synonyms":[
...
]
}
}

Resources