How to use pattern replace in elasticsearch to replace "— " with "–" - elasticsearch

Some of my documents includes "—" em dash and I would like to replace it with "–" en dash. From what I read in elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html, I can use a pattern replace which uses a regular expression.
Something like this:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
What should I specify in pattern and replacement? Or is there any other way to replace "—" em dash with "–" en dash in elasticsearch other then pattern match in all the documents. Any help would be appreciated.

Related

How to remove spaces inbetween words before indexing

How do I remove spaces between words before indexing?
For example:
I want to be able to search for 0123 7784 9809 7893
when I query "0123 7784 9809 7893", "0123778498097893", or "0123-7784-9809-7893"
My idea is to remove all spaces and dashes and combine the partial into a whole string (0123 7784 9809 7893 to 0123778498097893) before indexing, and also adding an analyzer in the query part so as to find my desired result.
I have tried
"char_filter" : {
"neglect_dash_and_space_filter" : {
"type" : "mapping",
"mappings" : [
"- => ",
"' ' => "
]
}
It seems that only dash is removed but not spaces. Tested custom shingle, but still not working. Kindly advice. Thanks.
You can use pattern replace filter
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": "[^0-9]", ---> it will replace anything other than digits
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"whitespace_remove"
]
}
}
}
}
}
You can use \uXXXX notation for spaces:
EDIT1:
PUT index41
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"\\u0020 => ",
"- => "
]
}
}
}
}
}

Elasticsearch - Custom stem override with wildcard character

I have implemented light English stemming in Elasticsearch.
I'm able to add a custom stem override so that "Guitarist" => "Guitar", for example, but I would like to add this as a general rule, so that "Guitarist" => "Guitar", "Violinist => Violin" etc.
Can I achieve this without using regex?
For anyone looking at a similar problem, it appears that regex is the only solution. Example below specifically for words ending "ist".
{
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"ist_filter"
],
"filter": [
"lowercase",
"my_stem"
]
}
},
"filter": {
"my_stem": {
"type": "stemmer",
"language": "light_english"
}
},
"char_filter": {
"ist_filter": {
"type": "pattern_replace",
"pattern": "(.*)ist$",
"replacement": "$1"
}
}
}
}
Exclusions can be added to the pattern e.g. the below would ignore the words "mist" and "twist", but this would only be practical for a (very) limited number of exclusions.
"pattern": "^(?!m|tw)(.*)ist$"

Search a list of names and categorizing each letter type

I want to index a large list of names using ES.
I want to distinguish between consonants and vowels in each word, and be able to search based on the position of each letter and if it is a consonant or a vowel.
So say the name like:
JOHN
I want to enter this:
CVCC
and when I run the search, JOHN should be in the result set.
Is it possible somehow to index names in elastic search such that I could index and then search them using the tokens C and V for vowel?
So somehow Elasticsearch will have to index the character types for each position for each word, how can this be done?
You can do it with pattern_replace char filters in a custom analyzer. Also, in my solution I have used a sub-field for the custom analyzer, thinking maybe that you will want other kinds of searches on the name field, the consonants-vowels one being only one of them.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"replace_filter_lowercase_CONS",
"replace_filter_uppercase_CONS",
"replace_filter_lowercase_VOW",
"replace_filter_uppercase_VOW"
]
}
},
"char_filter": {
"replace_filter_lowercase_CONS": {
"type": "pattern_replace",
"pattern": "[b-df-hj-np-tv-z]{1}",
"replacement": "c"
},
"replace_filter_uppercase_CONS": {
"type": "pattern_replace",
"pattern": "[B-DF-HJ-NP-TV-Z]{1}",
"replacement": "C"
},
"replace_filter_lowercase_VOW": {
"type": "pattern_replace",
"pattern": "[aeiou]{1}",
"replacement": "v"
},
"replace_filter_uppercase_VOW": {
"type": "pattern_replace",
"pattern": "[AEIOU]{1}",
"replacement": "V"
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"fields": {
"cons_vow": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST /test/test/1
{"name":"JOHN"}
POST /test/test/2
{"name":"Andrew"}
POST /test/test/3
{"name":"JOhn DOE"}
GET /test/_search
{
"query": {
"term": {
"name.cons_vow": {
"value": "CVCC"
}
}
}
}

Elasticsearch replace whitespace

I'm trying to find a tokenizer in elasticsearch that would replace all the whitespaces with a blank and convert multiple words into a single word.
For example: Abd al Qadir ===> Abdalqadir
A way to achieve that would be to create a custom filter using the pattern_replace filter, and create a custom analyzer with that filter and the lowercase one.
Here's an example of how the configuration would look like:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}

Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.)
This is my index settings:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
Instead of "pattern": " ", I tried "pattern": "\\u0020" and \\s , too.
But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web", instead of one single "belezanaweb".
The analyzer analyzes a string by tokenizing it first then applying a series of token filters. You have specified tokenizer as standard means the input is already tokenized using standard tokenizer which created the tokens separately. Then pattern replace filter is applied to the tokens.
Use keyword tokenizer instead of your standard tokenizer. Rest of the mapping is fine.
You can change your mapping as below
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove",
"nGram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}

Resources