This question is based on the "Tidying up Punctuation" section at https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html
Specifically that this:
"char_filter": {
"quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
will turn "weird" apostrophes into a normal one.
But it doesn't seem to work.
I create this index:
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"char_filter": {
"char_filter_quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
},
"analyzer": {
"analyzer_Text": {
"type": "standard",
"char_filter": [ "char_filter_quotes" ]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"Text": {
"type": "text",
"analyzer": "analyzer_Text",
"search_analyzer": "analyzer_Text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Add this document:
{
"Text": "Fred's Jim‘s Pete’s Mark‘s"
}
Run this search and get a hit (on "Fred's" with "Fred's" highlighted):
{
"query":
{
"match":
{
"Text": "Fred's"
}
},
"highlight":
{
"fragment_size": 200,
"pre_tags": [ "<span class='search-hit'>" ],
"post_tags": [ "</span>" ],
"fields": { "Text": { "type": "fvh" } }
}
}
If I change the above search like this:
"Text": "Fred‘s"
I get no hits. Why not? I thought the search_analyzer would turn the "Fred‘s" into "Fred's" which should hit. Also, if I search on
"Text": "Mark's"
I get nothing but
"Text": "Mark‘s"
does hit. The whole point of the exercise was to keep apostrophes but allow for the fact that, occasionally, non-standard apostrophes slip through and still get a hit.
Even more confusingly if I analyze this at http://127.0.0.1:9200/esidx_json_gs_entry/_analyze:
{
"char_filter": [ "char_filter_quotes" ],
"tokenizer" : "standard",
"filter" : [ "lowercase" ],
"text" : "Fred's Jim‘s Pete’s Mark‛s"
}
I get exactly what I would expect:
{
"tokens": [
{
"token": "fred's",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jim's",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pete's",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "mark's",
"start_offset": 20,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 3
}
]
}
In the search, the search analyzer appears to do nothing. What am I missing?
TVMIA,
Adam (Editors - yes I know that saying "Thank you" is "fluff" but I wish to be polite so please leave it in.)
There is a small mistake in your analyzer. It should be
"tokenizer": "standard"
Not
"type": "standard"
also once you have indexed a document, you can check the actual terms by using _termvectors
So in your example you can do a GET on
http://127.0.0.1:9200/esidx_json_gs_entry/_doc/1/_termvectors
Related
I have the Following index settings
{
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
}
},
"analyzer": {
"dutch_search": {
"filter": [
"lowercase",
"dutch_stop"
],
"char_filter": [
"special_char_filter"
],
"tokenizer": "whitespace"
},
"dutch_index": {
"filter": [
"lowercase",
"dutch_stop"
],
"char_filter": [
"special_char_filter"
],
"tokenizer": "whitespace"
}
},
"char_filter": {
"special_char_filter": {
"pattern": "/",
"type": "pattern_replace",
"replacement": " "
}
}
}}
Mapping
{
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "dutch_search",
"search_analyzer": "dutch_search"
}
}}
Here is one document which I have inserted
{
"title": "This is test data."
}
now I'm searching for the word "data" and my query for this is
{
"query": {
"multi_match": {
"query": "data",
"fields": [
"title"
]
}
}
but it returned zero records.
I know this is because of the whitespace analyzer but I need that also so can anyone suggest any solution for this. How can I use a whitespace analyzer and can search the word that is before a dot(.)?
You need to use the mapping char filter where you can remove the . character and this should solve your issue.
Below is the working example:
GET http://localhost:9200/_analyze
{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
".=>"
]
}
],
"text": "This is test data."
}
returns below tokens
{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "test",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "data",
"start_offset": 13,
"end_offset": 18,
"type": "word",
"position": 3
}
]
}
Or you can modify your current pattern replace character filter as
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\.", // note this
"replacement": ""
}
}
I am trying to setup an existing/custom analyzer that enable search using abbreviations. For example, if the text field is "Bank Of America", searching for BOfA or BOA, BofA etc should match this record.
How can I do that?
You can probably use synonym filter token for a custom analyzer.
For example the following mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"bank of america,boa"
],
"expand": true
}
}
}
},
"mappings": {
"document": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": true
}
}
}
}
}
Definitely you can add more to the list or use a synonym file.
For query usecases BOfA or BOA, BofA - two approaches can be worked.
1) More synonyms with these possible combination
"synonyms": [
"bank of america,boa"
"bank of america,bofa"
]
2) or keep the abrevations intact and use fuzzy query
{
"query": {
"match": {
"text" : {
"query": "bofa",
"fuzziness": 2
}
}
}
}
You will need synoyms to supply abrevations to ES.
I figure out something approaching using pattern_replace:
GET /_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "pattern_replace",
"pattern": "(\\B.)",
"replacement": ""
},
{
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
},
"uppercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
],
"text": "foxes jump lazy dogs"
}
which produces:
{
"tokens": [
{
"token": "FJL",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "FJLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "JLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}
I'm trying to search string using query_string in elasticsearch with accented characters.
When I use query_string without analyzer for query I get result only on exact match (I'm searching for string "Ředitel kvality" so When I dot "Reditel kvality" I get no results)
When I use same analyzer as it's used in mappings I get no results with both string with or without ascended characters.
analyzers & filters:
"analysis": {
"filter": {
"cs_CZ": {
"recursion_level": "0",
"locale": "cs_CZ",
"type": "hunspell",
"dedup": "true"
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
}
},
"analyzer": {
"cz": {
"filter": [
"standard",
"lowercase",
"czech_stop",
"icu_folding",
"cs_CZ",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"folding": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
mappings:
"index1": {
"mappings": {
"type1": {
"properties": {
"revisions": {
"type": "nested",
"properties": {
"title": {
"type": "text",
"boost": 10.0,
"fields": {
"folded": {
"type": "text",
"boost": 6.0,
"analyzer": "folding"
}
},
"analyzer": "cz"
here are term vectors which looks fine:
"term_vectors": {
"revisions.title": {
"field_statistics": {
"sum_doc_freq": 764,
"doc_count": 201,
"sum_ttf": 770
},
"terms": {
"kvalita": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 15
}
]
},
"reditel": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
and when I run analyze on my query index1/_analyze?field=type1.revisions.title&text=Ředitel%20kvality
I get same tokens.
{
"tokens": [
{
"token": "reditel",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "kvalita",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I can't find out what is wrong and why ES will not match "Reditel kvality" with "Ředitel kvality".
this is query which I'm using:
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\u0158editel kvality*",
"rewrite":"scoring_boolean",
"analyzer":"cz",
"default_operator":"AND"
}
}
]
}
},
"size":10,
"from":0
}
my ES version is 5.2.2
Found out what's wrong.
_all field must be defined also in mappings with analyzer.
I get from docs feeling that this is automatic, and all field is magically created from analyzed fields.
so now in fields I have
_all": {
"enabled": true,
"analyzer": "cz"
},
And it's working.
Thank's a lot Xylakant on IRC for guiding me.
I have one field in which I am storing values like O5467508 (Starting with alphabet "O")
Below is my query.
{"from":0,"size":10,"query":{"bool":{"must":[{"match":{"field_LIST_105":{"query":"o5467508","type":"phrase_prefix"}}},{"bool":{"must":[{"match":{"RegionName":"Virginia"}}]}}]}}}
it is giving me correct result, But when i am searching for only numeric value "5467508", query result is empty.
Thanks in advance.
One of the possible solution that could help you, use word_delimiter filter, with the option preserve_original, which will save original token.
Something like this:
{
"settings": {
"analysis": {
"analyzer": {
"so_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"my_word_delimiter"
]
}
},
"filter": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"field_LIST_105": {
"type": "text",
"analyzer": "so_analyzer"
}
}
}
}
}
I did a quick test of analysis, and this is the tokens that it give to me.
{
"tokens": [
{
"token": "o5467508",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "o",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "5467508",
"start_offset": 1,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
For more information - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
I need it so that words with periods inside them are equal to the non-period variant.
I see there's a section in the docs about analyzers and token filters but I'm finding rather terse and am not sure how to go about it.
Use a char filter to eliminate the dots, like this for example:
PUT /no_dots
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
".=>"
]
}
},
"analyzer": {
"my_no_dots_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_mapping"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_no_dots_analyzer"
}
}
}
}
}
And to test it GET /no_dots/_analyze?analyzer=my_no_dots_analyzer&text=J.J Abrams returns:
{
"tokens": [
{
"token": "JJ",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Abrams",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
}
]
}