Elasticsearch shingle token filter not working - elasticsearch

I'm trying this on a local 1.7.5 elasticsearch installation
http://localhost:9200/_analyze?filter=shingle&tokenizer=keyword&text=alkis stack
I see this
{
"tokens":[
{
"token":"alkis stack",
"start_offset":0,
"end_offset":11,
"type":"word",
"position":1
}
]
}
And I expected to see something like this
{
"tokens":[
{
"token":"alkis stack",
"start_offset":0,
"end_offset":11,
"type":"word",
"position":1
},
{
"token":"stack alkis",
"start_offset":0,
"end_offset":11,
"type":"word",
"position":1
}
]
}
Am I missing something?
Update
{
"number_of_shards": 2,
"number_of_replicas": 0,
"analysis": {
"char_filter": {
"map_special_chars": {
"type": "mapping",
"mappings": [
"- => \\u0020",
". => \\u0020",
"? => \\u0020",
", => \\u0020",
"` => \\u0020",
"' => \\u0020",
"\" => \\u0020"
]
}
},
"filter": {
"permutate_fullname": {
"type": "shingle",
"max_shingle_size": 4,
"min_shingle_size": 2,
"output_unigrams": true,
"token_separator": " ",
"filler_token": "_"
}
},
"analyzer": {
"fullname_analyzer_search": {
"char_filter": [
"map_special_chars"
],
"filter": [
"asciifolding",
"lowercase",
"trim"
],
"type": "custom",
"tokenizer": "keyword"
},
"fullname_analyzer_index": {
"char_filter": [
"map_special_chars"
],
"filter": [
"asciifolding",
"lowercase",
"trim",
"permutate_fullname"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
And I'm trying to test like this
http://localhost:9200/INDEX_NAME/_analyze?analyzer=fullname_analyzer_index&text=alkis stack

Index first name and last name in two separate fields in ES, just as you have them in the DB. The text received as query can be analyzed (match does it for example, query_string does it). And there are ways to search both fields at the same time with all the terms in the search string. I think you are over-complicating the use case with single name in one go and creating names permutations at indexing time.

Related

elasticsearch Updating Index settings analyzer

I have a Books Index which contains multiple subjects
chemistry
biology
etc
Each subject have there own set of synonyms and a global synonyms
PUT /books/_settings
{
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonym": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"filter": [
"global_synonyms", "chemistry_synonyms"
]
},
"biology_analyzer": {
"filter": [
"global_synonyms", "biology_synonyms"
]
}
}
}
}
Let's say at any point in time, I want to add new subject named "Astronomy"
Now the problem is how do I Update the index settings to add new "Astronomy_synonyms" and "Astronomy_analyzer"
my application requires me to append settings with existing filters and analyzers, I don't want to overwrite(replace settings)
You can definitely append new token filters and analyzers, however you need to close your index before updating the settings and reopen it when done. In what follows, I assume the index already exists.
Let's say you create your index with the following initial settings:
PUT /books
{
"settings": {
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"chemistry_synonyms"
]
},
"biology_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"biology_synonyms"
]
}
}
}
}
}
Then you need to close your index:
POST books/_close
Then you can append new analyzers and token filters:
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"astronomy_synonyms"
]
}
}
}
}
And finally reopen your index
POST books/_open
If you then check your index settings, you'll see that everything has been properly merged.
You can only define new analyzers on closed indices.
To add an analyzer, you must close the index, define the analyzer, and reopen the index.
POST /books/_close
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa=>a"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"tokenizer" : "whitespace",
"filter": [
"global_synonyms", "astronomy_synonyms"
]
}
}
}
}
POST /books/_open

How to remove spaces inbetween words before indexing

How do I remove spaces between words before indexing?
For example:
I want to be able to search for 0123 7784 9809 7893
when I query "0123 7784 9809 7893", "0123778498097893", or "0123-7784-9809-7893"
My idea is to remove all spaces and dashes and combine the partial into a whole string (0123 7784 9809 7893 to 0123778498097893) before indexing, and also adding an analyzer in the query part so as to find my desired result.
I have tried
"char_filter" : {
"neglect_dash_and_space_filter" : {
"type" : "mapping",
"mappings" : [
"- => ",
"' ' => "
]
}
It seems that only dash is removed but not spaces. Tested custom shingle, but still not working. Kindly advice. Thanks.
You can use pattern replace filter
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": "[^0-9]", ---> it will replace anything other than digits
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"whitespace_remove"
]
}
}
}
}
}
You can use \uXXXX notation for spaces:
EDIT1:
PUT index41
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"\\u0020 => ",
"- => "
]
}
}
}
}
}

Preventing tokenisation of certain alpha characters in ElasticSearch

I would like to prevent - and / from being tokenised or stemmed for a particular field.
I thought I had some code to achieve such behavior:
"char_filters": {
"type": "word_delimiter",
"type_table": [
"- => ALPHA",
"/ => ALPHA"
]
},
However, it errors:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [char_filters] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [char_filters] cannot be used to parse synonyms"
},
"status": 400
}
Looking online I've found PatternReplaceFilterFactory and a few other methods, however these substitute the characters. I wish for the interpreter to handle the two chars as strings.
So I would like the string 5/3mm to be tokenised as such. Not split into 5 and 3mm.
Please could someone advise the correct way to achieve this? Here's a simplified PUT and some POST/Analyse requests.
// doc 1 contains what I would like to match
POST /products_example/_doc/1
{
"ProductDescription_stripped":"RipCurl 5/3mm wetsuit omega",
"ProductDescription_da_stripped": "RipCurl 5/3mm wetsuit omega"
}
// doc 2 contains only 3mm. Should be prioritised below 5/3mm (match 1)
POST /products_example/_doc/2
{
"ProductDescription_stripped":"RipCurl 3mm wetsuit omega",
"ProductDescription_da_stripped": "RipCurl 5/3mm wetsuit omega"
}
// here you can see 3mm have been tokenised where as 5/3mm should have been preserved
POST /products_example/_analyze
{
"tokenizer": "standard",
"filter": [ "lowercase","asciifolding","synonym","stop","kstem"],
"text": "5/3mm ripcurl wetsuit omega"
}
PUT /products/
{
"settings": {
"index.mapping.total_fields.limit": 1000000,
"index.max_ngram_diff" : 2,
"analysis": {
"filter": {
"char_filters": {
"type": "word_delimiter",
"type_table": [
"- => ALPHA",
"/ => ALPHA"
]
},
"description_stemmer_da" : {"type" : "stemmer","name" : "danish"},
"stop_da" : {"type" : "stop","stopwords": "_danish_"},
"synonym" : {
"type" : "synonym",
"synonyms" : ["ripcurl, ripccurl => rip curl"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram", "min_gram": 3, "max_gram": 5,
"token_chars": ["letter","digit"]
}
},
"analyzer": {
"description" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"char_filters",
"lowercase",
"asciifolding",
"synonym",
"stop",
"kstem"
]
},
"description_da": {
"type":"custom", "tokenizer":"standard",
"filter": [
"char_filters",
"lowercase",
"asciifolding",
"synonym",
"stop_da",
"description_stemmer_da"
]
}
}
}
},
"mappings": {
"properties": {
"ProductDescription_stripped": {
"type": "text",
"analyzer" : "description"
},
"ProductDescription_da_stripped": {
"type": "text",
"analyzer": "danish"
}
}
}
}

custom char filter of elastic search is not working

I have the custom analyzer shown below
{
"settings": {
"analysis": {
"analyzer": {
"query_logging_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
}
}
}
}
basically I want an analyzer that takes in a json string and emits out the keys of that json. For that purpose, I need to tidy up the json by removing json entities such as [ ] { } , : "
This somehow is not working. The char_filter isnt taking effect at all
I had a similar char_filter using pattern_replace which didn't work too
"char_filter": [
{
"type": "pattern_replace",
"pattern": "\"|\\,|\\:|\\'|\\{|\\}",
"replacement": ""
}
]
What exactly is 'not working'?
Because your custom Analyzer seems to work just fine?
GET _analyze?filter_path=tokens.token
{
"tokenizer": "whitespace",
"text": [
"""{ "names": [ "Foo", "Bar" ] }"""
],
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
Returns:
{
"tokens": [
{
"token": """"names""""
},
{
"token": """"Foo""""
},
{
"token": """"Bar""""
}
]
}

Elastic search cluster level analyzer

How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks
There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.

Resources