I have a string property called summary that has analyzer set to trigrams and search_analyzer set to words.
"filter": {
"words_splitter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"english_words_filter": {
"type": "stop",
"stop_words": "_english_"
},
"trigrams_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"words": {
"filter": [
"lowercase",
"words_splitter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"trigrams": {
"filter": [
"lowercase",
"words_splitter",
"trigrams_filter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
I need that query strings given in input like React and HTML (or React, html) are being matched to documents that contain in the summary the words React, reactjs, react.js, html, html5. As more matching keywords they have, an higher score they have (I would expect lower scores on documents that have just a word matching not even at 100%, ideally).
The thing is, I guess at the moment react.js is split in both react and js since I get all the documents that contain js as well. On the other hand, Reactjs returns nothing. I also think to need words_splitter in order to ignore the comma.
You can solve the problem with names like react.js with a keyword marker filter and by defining the analyzer so that it uses the keyword filter. This will prevent react.js from being split into react and js tokens.
Here is an example configuration for the filter:
"filter": {
"keywords": {
"type": "keyword_marker",
"keywords": [
"react.js",
]
}
}
And the analyzer:
"analyzer": {
"main_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keywords",
"synonym_filter",
"german_stop",
"german_stemmer"
]
}
}
You can see whether your analyzer behaves as required using the analyze command:
GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"
This should return the following tokens where react.js is not tokenized:
{
"tokens": [
{
"token": "react.js",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 13,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "nice",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "library",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
For the words that are similar but not exactly the same as: React.js and Reactjs you could use a synonym filter. Do you have a fixed set of keywords that you want to match?
I found a solution.
Basically I'm going to define the word_delimiter filter with catenate_all active
"words_splitter": {
"catenate_all": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
giving it to the words analyzer with a keyword tokenizer
"words": {
"filter": [
"words_splitter"
],
"type": "custom",
"tokenizer": "keyword"
}
Calling http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js I get the following tokens:
{
"tokens": [
{
"token": "react.js",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "react",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "reactjs",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "js",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
Related
In my situation, my field is like "abc,123", I want it can be searched either "abc" or "123".
my index mapping is just like the code below
{
"myfield": {
"type": "text",
"analyzer": "stop",
"search_analyzer": "stop" }
But when I use es _analyzer API to test, I got the result
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}
]
}
"123" was lost.
If I want to meet my situation, do I need to choose some other analyzer or just to add some special configs?
You need to choose standard analyzer instead as stop analyzer breaks text into terms whenever it encounters a character which is not a letter and removes stop words like 'the'. In your case "abc,123" results in token abc when using stop analyzer. Using standard analyzer it returns abc and 123 as shown below
POST _analyze
{
"analyzer": "standard",
"text": "abc, 123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "123",
"start_offset": 5,
"end_offset": 8,
"type": "<NUM>",
"position": 1
}
]
}
EDIT1 Using Simple Pattern Split Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": ","
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "abc,123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "123",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
The problem is any character sequence having boost operator "^(caret symbol)" does not returning any search results.
But as per the below elastic search documentation
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters
&& || ! ( ) { } [ ] ^ " ~ * ? : \ characters can be escaped with \ symbol.
Have a requirement to do a contains search using n-gram analyser in elastic search.
Below is the mapping structure of the sample use case and the
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nGram_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "ngram_tokenizer"
},
"whitespace_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "whitespace"
}
},
"tokenizer": {
"ngram_tokenizer": {
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
],
"min_gram": "2",
"type": "nGram",
"max_gram": "20"
}
}
}
}
},
"mappings": {
"employee": {
"properties": {
"employeeName": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
Have a employee name like below with special characters included
xyz%^&*
Also the sample query used for the contains search as below
GET
{
"query": {
"bool": {
"must": [
{
"match": {
"employeeName": {
"query": "xyz%^",
"type": "boolean",
"operator": "or"
}
}
}
]
}
}
}
Even if we try to escape as "query": "xyz%\^" its errors out. So not able to search any character contains search having "^(caret symbol)"
Any help is greatly appreciated.
There is a bug in ngram tokenizer related to issue.
Essentially ^ is not considered either Symbol |Letter |Punctuation by ngram-tokenizer.
As a result it tokenizes the input on ^.
Example: (url encoded xyz%^):
GET <index_name>/_analyze?tokenizer=ngram_tokenizer&text=xyz%25%5E
The above result of analyze api shows there is no ^ as shown in the response below :
{
"tokens": [
{
"token": "xy",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "xyz",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "xyz%",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "yz",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "yz%",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "z%",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
}
]
}
Since '^' is not indexed therefore there are no matches
I'm looking into supporting folding of non standard ASCII characters like this guide suggests.
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
Strangely enough, I'm not able to replicate the sample in the first snippet of code.
When I execute
GET /my_index/_analyze?analyzer=folding&text=My œsophagus caused a débâcle
the following tokens are returned:
sophagus, caused, a, d, b, cle
What I want to achieve is:
Variations of the spelling of words like "école" (e.g. ecole, ècole) should be treated as the same word.
Right now, if I execute
GET /my_index/_analyze?analyzer=folding&text=école ecole
I get the tokens cole, ecole
These are the settings I currently use for the text analysis of the documents
"analysis": {
"filter": {
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_elision": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c",
"jusqu",
"quoiqu",
"lorsqu",
"puisqu"
]
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"analyzer": {
"index_French": {
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_stemmer"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
},
"sort_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
My idea was to change the filters of the index_French analyzer so that the list is the following:
"filter": ["french_elision","lowercase","asciifolding","french_stop","french_stemmer"]
Thanks for your help.
In Sense you need to call the _analyze endpoint like this and it will work:
POST /foldings/_analyze
{
"text": "My œsophagus caused a débâcle",
"analyzer": "folding"
}
You'll get
{
"tokens": [
{
"token": "my",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "oesophagus",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "caused",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 20,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "debacle",
"start_offset": 22,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 4
}
]
}
I'm using Elasticsearch 2.2.0 and I'm trying to use the lowercase + asciifolding filters on a field.
This is the output of http://localhost:9200/myindex/
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
And when I try to test the folding custom filter using the _analyze API, this is what I get as an output of http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
As you can see, the returned tokens are: Ésta, está, loca instead of esta, esta, loca. What's going on? it seems that this folding analyzer is being ignored.
Looks like a simple typo when you are creating your index.
In your "analysis":{"analyzer":{...}} block, this:
"token_filters": [...]
Should be
"filter": [...]
Check the documentation for confirmation of this. Because your filter array wasn't named correctly, ES completely ignored it, and just decided to use the standard analyzer. Here is a small example written using the Sense chrome plugin. Execute them in order:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
And the results of the last GET /test/_analyze:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
I have the following text:
Lurasidone is a dopamine D<sub>2</sub>
I would like to tokenize it such that I get the following tokens:
Lurasidone
dopamine
D2
How do I achieve this using a tokenizer or filter? I've attempted to to use the html filter however D<sub>2</sub> is tokenized as:
D
2
whereas I need it to tokenize as:
D2
You can use Pattern Replace Char Filter
This is what I did.
"char_filter": {
"html_pattern": {
"type": "pattern_replace",
"pattern": "<.*>(.*)<\\/.*>",
"replacement": "$1"
}
}
I included that in my custom analyzer like this
"my_custom_analyzer": {
"tokenizer": "standard",
"char_filter": [
"html_pattern"
],
"filter": ["stop"]
}
These are the tokens generated for your text
{
"tokens": [
{
"token": "Lurasidone",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dopamine",
"start_offset": 16,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "D2",
"start_offset": 25,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 5
}
]
}
I hope this helps.