I am trying to setup an existing/custom analyzer that enable search using abbreviations. For example, if the text field is "Bank Of America", searching for BOfA or BOA, BofA etc should match this record.
How can I do that?
You can probably use synonym filter token for a custom analyzer.
For example the following mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"bank of america,boa"
],
"expand": true
}
}
}
},
"mappings": {
"document": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": true
}
}
}
}
}
Definitely you can add more to the list or use a synonym file.
For query usecases BOfA or BOA, BofA - two approaches can be worked.
1) More synonyms with these possible combination
"synonyms": [
"bank of america,boa"
"bank of america,bofa"
]
2) or keep the abrevations intact and use fuzzy query
{
"query": {
"match": {
"text" : {
"query": "bofa",
"fuzziness": 2
}
}
}
}
You will need synoyms to supply abrevations to ES.
I figure out something approaching using pattern_replace:
GET /_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "pattern_replace",
"pattern": "(\\B.)",
"replacement": ""
},
{
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
},
"uppercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
],
"text": "foxes jump lazy dogs"
}
which produces:
{
"tokens": [
{
"token": "FJL",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "FJLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "JLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}
Related
I have the Following index settings
{
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
}
},
"analyzer": {
"dutch_search": {
"filter": [
"lowercase",
"dutch_stop"
],
"char_filter": [
"special_char_filter"
],
"tokenizer": "whitespace"
},
"dutch_index": {
"filter": [
"lowercase",
"dutch_stop"
],
"char_filter": [
"special_char_filter"
],
"tokenizer": "whitespace"
}
},
"char_filter": {
"special_char_filter": {
"pattern": "/",
"type": "pattern_replace",
"replacement": " "
}
}
}}
Mapping
{
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "dutch_search",
"search_analyzer": "dutch_search"
}
}}
Here is one document which I have inserted
{
"title": "This is test data."
}
now I'm searching for the word "data" and my query for this is
{
"query": {
"multi_match": {
"query": "data",
"fields": [
"title"
]
}
}
but it returned zero records.
I know this is because of the whitespace analyzer but I need that also so can anyone suggest any solution for this. How can I use a whitespace analyzer and can search the word that is before a dot(.)?
You need to use the mapping char filter where you can remove the . character and this should solve your issue.
Below is the working example:
GET http://localhost:9200/_analyze
{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
".=>"
]
}
],
"text": "This is test data."
}
returns below tokens
{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "test",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "data",
"start_offset": 13,
"end_offset": 18,
"type": "word",
"position": 3
}
]
}
Or you can modify your current pattern replace character filter as
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\.", // note this
"replacement": ""
}
}
This question is based on the "Tidying up Punctuation" section at https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html
Specifically that this:
"char_filter": {
"quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
will turn "weird" apostrophes into a normal one.
But it doesn't seem to work.
I create this index:
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"char_filter": {
"char_filter_quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
},
"analyzer": {
"analyzer_Text": {
"type": "standard",
"char_filter": [ "char_filter_quotes" ]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"Text": {
"type": "text",
"analyzer": "analyzer_Text",
"search_analyzer": "analyzer_Text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Add this document:
{
"Text": "Fred's Jim‘s Pete’s Mark‘s"
}
Run this search and get a hit (on "Fred's" with "Fred's" highlighted):
{
"query":
{
"match":
{
"Text": "Fred's"
}
},
"highlight":
{
"fragment_size": 200,
"pre_tags": [ "<span class='search-hit'>" ],
"post_tags": [ "</span>" ],
"fields": { "Text": { "type": "fvh" } }
}
}
If I change the above search like this:
"Text": "Fred‘s"
I get no hits. Why not? I thought the search_analyzer would turn the "Fred‘s" into "Fred's" which should hit. Also, if I search on
"Text": "Mark's"
I get nothing but
"Text": "Mark‘s"
does hit. The whole point of the exercise was to keep apostrophes but allow for the fact that, occasionally, non-standard apostrophes slip through and still get a hit.
Even more confusingly if I analyze this at http://127.0.0.1:9200/esidx_json_gs_entry/_analyze:
{
"char_filter": [ "char_filter_quotes" ],
"tokenizer" : "standard",
"filter" : [ "lowercase" ],
"text" : "Fred's Jim‘s Pete’s Mark‛s"
}
I get exactly what I would expect:
{
"tokens": [
{
"token": "fred's",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jim's",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pete's",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "mark's",
"start_offset": 20,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 3
}
]
}
In the search, the search analyzer appears to do nothing. What am I missing?
TVMIA,
Adam (Editors - yes I know that saying "Thank you" is "fluff" but I wish to be polite so please leave it in.)
There is a small mistake in your analyzer. It should be
"tokenizer": "standard"
Not
"type": "standard"
also once you have indexed a document, you can check the actual terms by using _termvectors
So in your example you can do a GET on
http://127.0.0.1:9200/esidx_json_gs_entry/_doc/1/_termvectors
I'm trying to search string using query_string in elasticsearch with accented characters.
When I use query_string without analyzer for query I get result only on exact match (I'm searching for string "Ředitel kvality" so When I dot "Reditel kvality" I get no results)
When I use same analyzer as it's used in mappings I get no results with both string with or without ascended characters.
analyzers & filters:
"analysis": {
"filter": {
"cs_CZ": {
"recursion_level": "0",
"locale": "cs_CZ",
"type": "hunspell",
"dedup": "true"
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
}
},
"analyzer": {
"cz": {
"filter": [
"standard",
"lowercase",
"czech_stop",
"icu_folding",
"cs_CZ",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"folding": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
mappings:
"index1": {
"mappings": {
"type1": {
"properties": {
"revisions": {
"type": "nested",
"properties": {
"title": {
"type": "text",
"boost": 10.0,
"fields": {
"folded": {
"type": "text",
"boost": 6.0,
"analyzer": "folding"
}
},
"analyzer": "cz"
here are term vectors which looks fine:
"term_vectors": {
"revisions.title": {
"field_statistics": {
"sum_doc_freq": 764,
"doc_count": 201,
"sum_ttf": 770
},
"terms": {
"kvalita": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 15
}
]
},
"reditel": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
and when I run analyze on my query index1/_analyze?field=type1.revisions.title&text=Ředitel%20kvality
I get same tokens.
{
"tokens": [
{
"token": "reditel",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "kvalita",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I can't find out what is wrong and why ES will not match "Reditel kvality" with "Ředitel kvality".
this is query which I'm using:
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\u0158editel kvality*",
"rewrite":"scoring_boolean",
"analyzer":"cz",
"default_operator":"AND"
}
}
]
}
},
"size":10,
"from":0
}
my ES version is 5.2.2
Found out what's wrong.
_all field must be defined also in mappings with analyzer.
I get from docs feeling that this is automatic, and all field is magically created from analyzed fields.
so now in fields I have
_all": {
"enabled": true,
"analyzer": "cz"
},
And it's working.
Thank's a lot Xylakant on IRC for guiding me.
Using Elasticsearch 2.2, as a simple experiment, I want to remove the last character from any word that ends with the lowercase character "s". For example, the word "sounds" would be indexed as "sound".
I'm defining my analyzer like this:
{
"template": "document-index-template",
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)([s]( |$))",
"replacement": "$2"
}
},
"analyzer": {
"tight": {
"type": "standard",
"filter": [
"sFilter",
"lowercase"
]
}
}
}
}
}
Then when I analyze the term "sounds of silences" using this request:
<index>/_analyze?analyzer=tight&text=sounds%20of%20silences
I get:
{
"tokens": [
{
"token": "sounds",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "of",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "silences",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I am expecting "sounds" to be "sound" and "silences" to be "silence"
The above analyzer setting is invalid .I think what you intended to use is an analyzer of type custom with tokenizer set to standard
Example:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"sFilter": {
"type": "pattern_replace",
"pattern": "([a-zA-Z]+)s$",
"replacement": "$1"
}
},
"analyzer": {
"tight": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"sFilter"
]
}
}
}
}
}
I need it so that words with periods inside them are equal to the non-period variant.
I see there's a section in the docs about analyzers and token filters but I'm finding rather terse and am not sure how to go about it.
Use a char filter to eliminate the dots, like this for example:
PUT /no_dots
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
".=>"
]
}
},
"analyzer": {
"my_no_dots_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_mapping"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_no_dots_analyzer"
}
}
}
}
}
And to test it GET /no_dots/_analyze?analyzer=my_no_dots_analyzer&text=J.J Abrams returns:
{
"tokens": [
{
"token": "JJ",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Abrams",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
}
]
}