Some of the sentences I want to define as syonyms have commas in them, for example:
"Employment Information" and "Your Activity, Your Job" are synonyms.
However if I define them in the following way the result is not what I envisioned, since "," has a special meaning in the Elasticsearch format:
Employment Information=>Your Activity, Your Job
Is the only solution for me to use WordNet synonym format in this case or perhaps I can just ignore the comma entirely and take it out?
I dont think comma will be an issue, If you are using standard analyzer then it will remove comma. This is my test setup
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"employment information=>your activity your job"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
It is good to use lowercase filter to avoid case sensitive issues, so now this query
GET my_index/_analyze?text=employment Information&analyzer=my_synonyms
will give you following tokens
{
"tokens": [
{
"token": "your",
"start_offset": 0,
"end_offset": 10,
"type": "SYNONYM",
"position": 1
},
{
"token": "activity",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 2
},
{
"token": "your",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 3
},
{
"token": "job",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 4
}
]
}
There is a gotcha with multiword synonym, if you analyze the output of
GET my_index/_analyze?text=employment Information is useful&analyzer=my_synonyms, you will get unexpected results like this
{
"tokens": [
{
"token": "your",
"start_offset": 0,
"end_offset": 10,
"type": "SYNONYM",
"position": 1
},
{
"token": "activity",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 2
},
{
"token": "is",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "your",
"start_offset": 23,
"end_offset": 25,
"type": "SYNONYM",
"position": 3
},
{
"token": "useful",
"start_offset": 26,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "job",
"start_offset": 26,
"end_offset": 32,
"type": "SYNONYM",
"position": 4
}
]
}
You can solve this issue with simple contraction, write synonyms like this
"synonyms": [
"employment information,your activity your job=>sentence1"
]
If you are using keyword analyzer then you could use pattern replace char filter to remove comma.
Related
I know that elasicsearch's standard analyzer uses standard tokenizer to generate tokens.
In this elasticsearch docs, they say it does grammar-based tokenization, but the separators used by standard tokenizer are not clear.
My use case is as follows
In my elasticsearch index I have some fields which use the default analyzer standard analyzer
In those fields I want # character searchable and . as one more separator.
Can I achieve my use case with a standard analyzer?
I checked what and all tokens it will generate for string hey john.s #100 is a test name.
POST _analyze
{
"text": "hey john.s #100 is a test name",
"analyzer": "standard"
}
It generated the following tokens
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john.s",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "100",
"start_offset": 12,
"end_offset": 15,
"type": "<NUM>",
"position": 2
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6
}
]
}
So I got a doubt that Only whitespace is used as a separator in standard tokenizer?
Thank you in advance..
Lets first see why it is not breaking token on . for some of the words:
Standard analyzer use standard tokenizer only but standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm. You can read more about algorithm here, here and here. it is not using whitespace tokenizer.
Lets see now, how you can token on . dot and not on #:
You can use Character Group tokenizer and provide list of character on which you want to apply tokenization.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
".",
"\n"
]
},
"text": "hey john.s #100 is a test name"
}
Response:
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "john",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "s",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "#100",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "word",
"position": 4
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 6
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "word",
"position": 7
}
]
}
My use case is to search for edge_ngrams with synonym support where the tokens to match should be in sequence.
While trying out the analysis, I observed 2 different behaviour of the filter chain with respect to position increments.
With filter-chain as lowercase, synonym there is no position increment due to SynonymFilter
With filter-chain as lowercase, edge_ngram, synonym there is position increment due to SynonymFilter
Here are the queries I'm running for each case:
Case 1. No position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 1
}
]
}
Case 2. Position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_edge_ngram",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
},
"custom_edge_ngram": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "60"
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "be",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "beg",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "begi",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "work",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "worki",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "workin",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}
Notice how in Case1 the token begin and start when replaced have the same position and there is no position increment. However in Case 2, when begin token is replaced by start the position got incremented for the subsequent token stream.
Now here are my questions :
Why is it not happening in Case 1 and only happening in Case 2 ?
The main issue that this is causing is when the input query is begi wor with match_phrase query (and default slop as 0) it doesn't matches begin work.
Which is happening since begi and wor are 2 positions away. Any suggestions on how can I achieve this behaviour without impacting my use case ?
I'm using ElasticSearch version 5.6.8 having lucene version 6.6.1.
I've read several documentation links and articles but I couldn't find any proper one explaining why is this happening and is there some settings to get my desired behaviour.
Below is the elastic search mapping with one field called hostname and other field called catch_all which is basically copy_to field(there will be many more fields copying values to this)
{
"settings": {
"analysis": {
"filter": {
"myNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "myNGramFilter"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"catch_all": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"store": true,
"ignore_above": 256
},
"grams": {
"type": "text",
"store": true,
"analyzer": "myNGramAnalyzer"
}
}
},
"hostname": {
"type": "text",
"copy_to": "catch_all"
}
}
}
}
}
When I do the
GET index/_analyze
{
"analyzer": "myNGramAnalyzer",
"text": "Dell PowerEdge R630"
}
{
"tokens": [
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "de",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "del",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dell",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "p",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "po",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pow",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powe",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "power",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powere",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powered",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredg",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredge",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "r",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r6",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r63",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r630",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
There is a token called "poweredge".
Right now we search with below query
{
"query": {
"multi_match": {
"fields": ["catch_all.grams"],
"query": "poweredge",
"operator": "and"
}
}
}
When we query with "poweredge" we get 1 result. But when we search by only "edge" there is no result.
Even the match query does not yield results for search word "edge".
Can somebody help here ?
I suggest to don't query with multi_match api for your use case, but to use a match query. The edgengram works in that way: it try to make ngram on the tokens generated by a whitespace tokenizer on you text. As written in documentation - read here:
The edge_ngram tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of the N-gram is anchored to the
beginning of the word.
As you have tested in your query to analyze API, it doesn't product "edge" - from poweredge - as ngram because it products ngram from the beginning of the word - look at the output of you analyze API call. Take a look here: https://www.elastic.co/guide/en/elasticsearch/guide/master/ngrams-compound-words.html
I have a problem with some turkish character and need like alternative character to fix it.
Example : "İzmir" of our city but some user searching it "ızmır" some Of "Izmır" and sometimes "izmir" .
How can I act like i or "ı" or "İ" or "ı " "I" or "i" whenever user using this chars?
You just need to correct your mapping from english to Turkish, if you done so can you paste here your mapping.
Because mapping holds the power to search.
The Turkish analyzer doesn't do that out of the box for you. Example:
GET _analyze
{
"analyzer": "turkish",
"text": "ızmır Izmır izmir"
}
{
"tokens": [
{
"token": "ızmır",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "ızmır",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "izmir",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
But you could use ASCII folding for that purpose — creating an index with a custom analyzer and testing the example against it:
PUT /asciifold_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "asciifolding"]
}
}
}
}
}
GET asciifold_example/_analyze
{
"analyzer": "my_analyzer",
"text": "ızmır Izmır izmir"
}
{
"tokens": [
{
"token": "izmir",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "izmir",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "izmir",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I have the following text:
Lurasidone is a dopamine D<sub>2</sub>
I would like to tokenize it such that I get the following tokens:
Lurasidone
dopamine
D2
How do I achieve this using a tokenizer or filter? I've attempted to to use the html filter however D<sub>2</sub> is tokenized as:
D
2
whereas I need it to tokenize as:
D2
You can use Pattern Replace Char Filter
This is what I did.
"char_filter": {
"html_pattern": {
"type": "pattern_replace",
"pattern": "<.*>(.*)<\\/.*>",
"replacement": "$1"
}
}
I included that in my custom analyzer like this
"my_custom_analyzer": {
"tokenizer": "standard",
"char_filter": [
"html_pattern"
],
"filter": ["stop"]
}
These are the tokens generated for your text
{
"tokens": [
{
"token": "Lurasidone",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dopamine",
"start_offset": 16,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "D2",
"start_offset": 25,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 5
}
]
}
I hope this helps.