Using Elasticsearch 2.2, as a simple experiment, I want to remove the last character from any word that ends with the lowercase character "s". For example, the word "sounds" would be indexed as "sound".
I'm defining my analyzer like this:
{
"template": "document-index-template",
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)([s]( |$))",
"replacement": "$2"
}
},
"analyzer": {
"tight": {
"type": "standard",
"filter": [
"sFilter",
"lowercase"
]
}
}
}
}
}
Then when I analyze the term "sounds of silences" using this request:
<index>/_analyze?analyzer=tight&text=sounds%20of%20silences
I get:
{
"tokens": [
{
"token": "sounds",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "of",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "silences",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I am expecting "sounds" to be "sound" and "silences" to be "silence"
The above analyzer setting is invalid .I think what you intended to use is an analyzer of type custom with tokenizer set to standard
Example:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)s$",
"replacement": "$1"
}
},
"analyzer": {
"tight": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"sFilter"
]
}
}
}
}
}
Related
In elasticsearch, I am trying to use an analyzer on a field which will use a filter to replace all characters after a ? is encountered into a whitespace. To do so, I am using the following filter.
"filter_name":{
"type": "pattern_replace",
"pattern": "\\?(.*)",
"replacement": ""
}
But this is not working as expected. Is there something I'm missing?
Use pattern: "(?<=\\?)(.*)" and replacement: ""
Please see the below. I've created a sample mapping and a sample _analyze query to see how tokens are created.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(?=.*)\\?(.*)",
"replacement": ""
}
}
}
}
}
Query
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Do you know? Life is crazy"
}
Analysis Result
{
"tokens": [
{
"token": "Do",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "you",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "know",
"start_offset": 7,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Hope this helps!
I have one field in which I am storing values like O5467508 (Starting with alphabet "O")
Below is my query.
{"from":0,"size":10,"query":{"bool":{"must":[{"match":{"field_LIST_105":{"query":"o5467508","type":"phrase_prefix"}}},{"bool":{"must":[{"match":{"RegionName":"Virginia"}}]}}]}}}
it is giving me correct result, But when i am searching for only numeric value "5467508", query result is empty.
Thanks in advance.
One of the possible solution that could help you, use word_delimiter filter, with the option preserve_original, which will save original token.
Something like this:
{
"settings": {
"analysis": {
"analyzer": {
"so_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"my_word_delimiter"
]
}
},
"filter": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"field_LIST_105": {
"type": "text",
"analyzer": "so_analyzer"
}
}
}
}
}
I did a quick test of analysis, and this is the tokens that it give to me.
{
"tokens": [
{
"token": "o5467508",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "o",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "5467508",
"start_offset": 1,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
For more information - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
The problem is any character sequence having boost operator "^(caret symbol)" does not returning any search results.
But as per the below elastic search documentation
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters
&& || ! ( ) { } [ ] ^ " ~ * ? : \ characters can be escaped with \ symbol.
Have a requirement to do a contains search using n-gram analyser in elastic search.
Below is the mapping structure of the sample use case and the
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nGram_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "ngram_tokenizer"
},
"whitespace_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "whitespace"
}
},
"tokenizer": {
"ngram_tokenizer": {
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
],
"min_gram": "2",
"type": "nGram",
"max_gram": "20"
}
}
}
}
},
"mappings": {
"employee": {
"properties": {
"employeeName": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
Have a employee name like below with special characters included
xyz%^&*
Also the sample query used for the contains search as below
GET
{
"query": {
"bool": {
"must": [
{
"match": {
"employeeName": {
"query": "xyz%^",
"type": "boolean",
"operator": "or"
}
}
}
]
}
}
}
Even if we try to escape as "query": "xyz%\^" its errors out. So not able to search any character contains search having "^(caret symbol)"
Any help is greatly appreciated.
There is a bug in ngram tokenizer related to issue.
Essentially ^ is not considered either Symbol |Letter |Punctuation by ngram-tokenizer.
As a result it tokenizes the input on ^.
Example: (url encoded xyz%^):
GET <index_name>/_analyze?tokenizer=ngram_tokenizer&text=xyz%25%5E
The above result of analyze api shows there is no ^ as shown in the response below :
{
"tokens": [
{
"token": "xy",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "xyz",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "xyz%",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "yz",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "yz%",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "z%",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
}
]
}
Since '^' is not indexed therefore there are no matches
I'm using Elasticsearch 2.2.0 and I'm trying to use the lowercase + asciifolding filters on a field.
This is the output of http://localhost:9200/myindex/
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
And when I try to test the folding custom filter using the _analyze API, this is what I get as an output of http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
As you can see, the returned tokens are: Ésta, está, loca instead of esta, esta, loca. What's going on? it seems that this folding analyzer is being ignored.
Looks like a simple typo when you are creating your index.
In your "analysis":{"analyzer":{...}} block, this:
"token_filters": [...]
Should be
"filter": [...]
Check the documentation for confirmation of this. Because your filter array wasn't named correctly, ES completely ignored it, and just decided to use the standard analyzer. Here is a small example written using the Sense chrome plugin. Execute them in order:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
And the results of the last GET /test/_analyze:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
I need it so that words with periods inside them are equal to the non-period variant.
I see there's a section in the docs about analyzers and token filters but I'm finding rather terse and am not sure how to go about it.
Use a char filter to eliminate the dots, like this for example:
PUT /no_dots
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
".=>"
]
}
},
"analyzer": {
"my_no_dots_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_mapping"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_no_dots_analyzer"
}
}
}
}
}
And to test it GET /no_dots/_analyze?analyzer=my_no_dots_analyzer&text=J.J Abrams returns:
{
"tokens": [
{
"token": "JJ",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Abrams",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
}
]
}