How to use Smart Chinese Analysis for Elasticsearch? - elasticsearch

I have installed Smart Chinese Analysis for Elasticsearch on our ES cluster, but I do not find documentation on how to specify the correct analyzer. I would except that I need to set a tokenizer and a filter specifying stopwords and stemmer ...
For example in dutch:
"dutch": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": ["lowercase", "asciifolding", "dutch_stemmer_filter", "dutch_stop_filter"]
}
with:
"dutch_stemmer_filter": {
"type": "stemmer",
"name": "dutch"
},
"dutch_stop_filter": {
"type": "stop",
"stopwords": ["_dutch_"]
}
How to configure my analyzer for Chinese ?

Try this for a certain index (the analyzer is 'smartcn' and the tokenizer is 'smartcn_tokenizer'):
PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}
GET /test_chinese/_analyze?text='叻出色'
It should output two tokens (test taken from the plugin test classes):
{
"tokens": [
{
"token": "叻",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "出色",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 3
}
]
}

Related

Elastic Search Query with # (at sign) bring the same as without

I'm trying to match text with an "#" prefix, e.g. "#stackoverflow" on ElasticSearch. I'm using a boolean query, and both these return the exact same results and actually ignore my # sign:
Query 1 with #:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"#stackoverflow"}}]}},"size":20}
Query 2 without:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"stackoverflow"}}]}},"size":20}
My Mapping:
{"posts":{"mappings":{"post":{"properties":{"upvotes":{"type":"long"},"created_time":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"ratings":{"type":"long"},"link":{"type":"string"},"pic":{"type":"string"},"text":{"type":"string"},"id":{"type":"string"}}}}}}
I've tried encoding it to \u0040 but that didn't do any difference.
Your text field is of type text and is analyzed by default by the standard analyzer, which means that #stackoverflow will be indexed as stackoverflow after the analysis process, as can be seen below
GET /_analyze?analyzer=standard&text=#stackoverflow
{
"tokens": [
{
"token": "stackoverflow",
"start_offset": 1,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You probably want to either use the keyword type if you need exact matching or specify a different analyzer, such as whitespace, which will preserve the # sign in your data:
GET /_analyze?analyzer=whitespace&text=#stackoverflow
{
"tokens": [
{
"token": "#stackoverflow",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
UPDATE:
Then I suggest using a custom analyzer for that field so you can control how the values are indexed. Recreate your index like this and then you should be able to do your searches:
PUT posts
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase" ]
}
}
}
}
},
"mappings": {
"post": {
"properties": {
"upvotes": {
"type": "long"
},
"created_time": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"ratings": {
"type": "long"
},
"link": {
"type": "string"
},
"pic": {
"type": "string"
},
"text": {
"type": "string",
"analyzer": "my_analyzer"
},
"id": {
"type": "string"
}
}
}
}
}

ElasticSearch - search using abbreviations

I am trying to setup an existing/custom analyzer that enable search using abbreviations. For example, if the text field is "Bank Of America", searching for BOfA or BOA, BofA etc should match this record.
How can I do that?
You can probably use synonym filter token for a custom analyzer.
For example the following mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"bank of america,boa"
],
"expand": true
}
}
}
},
"mappings": {
"document": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": true
}
}
}
}
}
Definitely you can add more to the list or use a synonym file.
For query usecases BOfA or BOA, BofA - two approaches can be worked.
1) More synonyms with these possible combination
"synonyms": [
"bank of america,boa"
"bank of america,bofa"
]
2) or keep the abrevations intact and use fuzzy query
{
"query": {
"match": {
"text" : {
"query": "bofa",
"fuzziness": 2
}
}
}
}
You will need synoyms to supply abrevations to ES.
I figure out something approaching using pattern_replace:
GET /_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "pattern_replace",
"pattern": "(\\B.)",
"replacement": ""
},
{
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
},
"uppercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
],
"text": "foxes jump lazy dogs"
}
which produces:
{
"tokens": [
{
"token": "FJL",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "FJLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "JLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}

No result when using search analyzer

I'm trying to search string using query_string in elasticsearch with accented characters.
When I use query_string without analyzer for query I get result only on exact match (I'm searching for string "Ředitel kvality" so When I dot "Reditel kvality" I get no results)
When I use same analyzer as it's used in mappings I get no results with both string with or without ascended characters.
analyzers & filters:
"analysis": {
"filter": {
"cs_CZ": {
"recursion_level": "0",
"locale": "cs_CZ",
"type": "hunspell",
"dedup": "true"
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
}
},
"analyzer": {
"cz": {
"filter": [
"standard",
"lowercase",
"czech_stop",
"icu_folding",
"cs_CZ",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"folding": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
mappings:
"index1": {
"mappings": {
"type1": {
"properties": {
"revisions": {
"type": "nested",
"properties": {
"title": {
"type": "text",
"boost": 10.0,
"fields": {
"folded": {
"type": "text",
"boost": 6.0,
"analyzer": "folding"
}
},
"analyzer": "cz"
here are term vectors which looks fine:
"term_vectors": {
"revisions.title": {
"field_statistics": {
"sum_doc_freq": 764,
"doc_count": 201,
"sum_ttf": 770
},
"terms": {
"kvalita": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 15
}
]
},
"reditel": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
and when I run analyze on my query index1/_analyze?field=type1.revisions.title&text=Ředitel%20kvality
I get same tokens.
{
"tokens": [
{
"token": "reditel",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "kvalita",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I can't find out what is wrong and why ES will not match "Reditel kvality" with "Ředitel kvality".
this is query which I'm using:
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\u0158editel kvality*",
"rewrite":"scoring_boolean",
"analyzer":"cz",
"default_operator":"AND"
}
}
]
}
},
"size":10,
"from":0
}
my ES version is 5.2.2
Found out what's wrong.
_all field must be defined also in mappings with analyzer.
I get from docs feeling that this is automatic, and all field is magically created from analyzed fields.
so now in fields I have
_all": {
"enabled": true,
"analyzer": "cz"
},
And it's working.
Thank's a lot Xylakant on IRC for guiding me.

Elasticsearch - how do I remove s from end of words

Using Elasticsearch 2.2, as a simple experiment, I want to remove the last character from any word that ends with the lowercase character "s". For example, the word "sounds" would be indexed as "sound".
I'm defining my analyzer like this:
{
"template": "document-index-template",
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)([s]( |$))",
"replacement": "$2"
}
},
"analyzer": {
"tight": {
"type": "standard",
"filter": [
"sFilter",
"lowercase"
]
}
}
}
}
}
Then when I analyze the term "sounds of silences" using this request:
<index>/_analyze?analyzer=tight&text=sounds%20of%20silences
I get:
{
"tokens": [
{
"token": "sounds",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "of",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "silences",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I am expecting "sounds" to be "sound" and "silences" to be "silence"
The above analyzer setting is invalid .I think what you intended to use is an analyzer of type custom with tokenizer set to standard
Example:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)s$",
"replacement": "$1"
}
},
"analyzer": {
"tight": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"sFilter"
]
}
}
}
}
}

How to normalize periods in elastic search query (such that JJ Abrams == J.J Abrams)?

I need it so that words with periods inside them are equal to the non-period variant.
I see there's a section in the docs about analyzers and token filters but I'm finding rather terse and am not sure how to go about it.
Use a char filter to eliminate the dots, like this for example:
PUT /no_dots
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
".=>"
]
}
},
"analyzer": {
"my_no_dots_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_mapping"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_no_dots_analyzer"
}
}
}
}
}
And to test it GET /no_dots/_analyze?analyzer=my_no_dots_analyzer&text=J.J Abrams returns:
{
"tokens": [
{
"token": "JJ",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Abrams",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Resources