How to make elastic search more flexible? - elasticsearch

I am currently using this elasticsearch DSL query:
{
"_source": [
"title",
"bench",
"id_",
"court",
"date"
],
"size": 15,
"from": 0,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title",
"content"
]
}
},
"filter": [],
"should": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title.standard^16",
"content.standard"
]
}
}
}
},
"highlight": {
"pre_tags": [
"<tag1>"
],
"post_tags": [
"</tag1>"
],
"fields": {
"content": {}
}
}
}
Here's what's happening. If I search for I.r coelhoit returns the correct results. But, if I search for I R coelho (without the period) then it returns a different result. How do I prevent this from happening? I want the search to behave the same even if there are extra periods, spaces, commas etc.
Mapping
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Settings:
{
"courts_2": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "19000000"
},
"number_of_shards": "5",
"provided_name": "courts_2",
"creation_date": "1581094116992",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "true",
"type": "phonetic",
"encoder": "metaphone"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"my_metaphone"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "MZSecLIVQy6jiI6YmqOGLg",
"version": {
"created": "7010199"
}
}
}
}
}
EDIT
Here are the results for I.R coelho from my analyzer - {
"tokens": [
{
"token": "IR",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "KLH",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Standard analyzer:
{
"tokens": [
{
"token": "i.r",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "coelho",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}

the reason why you have a different behaviour when searching for I.r coelho and I R coelho is that you are using different analyzers on the same fields, i.e., my_analyzer for title and content (must block), and standard (the default) for title.standard and content.standard (should block).
The two analyzers generate different tokens, thus determining a different score when you're searching for I.r coelho (e.g., 2 tokens with the standard analyzer) or I R coelho (e.g., 3 tokens with the standard analyzer). You can test the behaviour of your analyzers by using the analyze API (see the Elastic Documentation).
You have to decide whether this is your desired behaviour.
Updates (after requested clarifications from OP)
The results of the _analyze query confirmed the hypothesis: the two analyzers lead to a different score contribution, and, subsequently, to different results depending on whether your query includes symbol chars or not.
If you don't want the results of your query to be affected by symbols such as dots or upper/lower case, you will need to reconsider what analyzers you want to apply. The ones currently used will never satisfy your requirements. If I understood your requirements correctly, the simple built-in analyzer should be the right one for your use case.
In a nutshell, (1) you should consider to replace the standard built-in analyzer with the simple one, (2) you should decide whether you want that your query applies different scores to the hits based on different analyzers (i.e., the phonetic custom one on the value of the title and content fields, and the simple one on their respective subfield).

Related

Mapping search analyzer (with apostrophes) not working

This question is based on the "Tidying up Punctuation" section at https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html
Specifically that this:
"char_filter": {
"quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
will turn "weird" apostrophes into a normal one.
But it doesn't seem to work.
I create this index:
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"char_filter": {
"char_filter_quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
},
"analyzer": {
"analyzer_Text": {
"type": "standard",
"char_filter": [ "char_filter_quotes" ]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"Text": {
"type": "text",
"analyzer": "analyzer_Text",
"search_analyzer": "analyzer_Text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Add this document:
{
"Text": "Fred's Jim‘s Pete’s Mark‘s"
}
Run this search and get a hit (on "Fred's" with "Fred's" highlighted):
{
"query":
{
"match":
{
"Text": "Fred's"
}
},
"highlight":
{
"fragment_size": 200,
"pre_tags": [ "<span class='search-hit'>" ],
"post_tags": [ "</span>" ],
"fields": { "Text": { "type": "fvh" } }
}
}
If I change the above search like this:
"Text": "Fred‘s"
I get no hits. Why not? I thought the search_analyzer would turn the "Fred‘s" into "Fred's" which should hit. Also, if I search on
"Text": "Mark's"
I get nothing but
"Text": "Mark‘s"
does hit. The whole point of the exercise was to keep apostrophes but allow for the fact that, occasionally, non-standard apostrophes slip through and still get a hit.
Even more confusingly if I analyze this at http://127.0.0.1:9200/esidx_json_gs_entry/_analyze:
{
"char_filter": [ "char_filter_quotes" ],
"tokenizer" : "standard",
"filter" : [ "lowercase" ],
"text" : "Fred's Jim‘s Pete’s Mark‛s"
}
I get exactly what I would expect:
{
"tokens": [
{
"token": "fred's",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jim's",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pete's",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "mark's",
"start_offset": 20,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 3
}
]
}
In the search, the search analyzer appears to do nothing. What am I missing?
TVMIA,
Adam (Editors - yes I know that saying "Thank you" is "fluff" but I wish to be polite so please leave it in.)
There is a small mistake in your analyzer. It should be
"tokenizer": "standard"
Not
"type": "standard"
also once you have indexed a document, you can check the actual terms by using _termvectors
So in your example you can do a GET on
http://127.0.0.1:9200/esidx_json_gs_entry/_doc/1/_termvectors

Elastic Search Query with # (at sign) bring the same as without

I'm trying to match text with an "#" prefix, e.g. "#stackoverflow" on ElasticSearch. I'm using a boolean query, and both these return the exact same results and actually ignore my # sign:
Query 1 with #:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"#stackoverflow"}}]}},"size":20}
Query 2 without:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"stackoverflow"}}]}},"size":20}
My Mapping:
{"posts":{"mappings":{"post":{"properties":{"upvotes":{"type":"long"},"created_time":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"ratings":{"type":"long"},"link":{"type":"string"},"pic":{"type":"string"},"text":{"type":"string"},"id":{"type":"string"}}}}}}
I've tried encoding it to \u0040 but that didn't do any difference.
Your text field is of type text and is analyzed by default by the standard analyzer, which means that #stackoverflow will be indexed as stackoverflow after the analysis process, as can be seen below
GET /_analyze?analyzer=standard&text=#stackoverflow
{
"tokens": [
{
"token": "stackoverflow",
"start_offset": 1,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You probably want to either use the keyword type if you need exact matching or specify a different analyzer, such as whitespace, which will preserve the # sign in your data:
GET /_analyze?analyzer=whitespace&text=#stackoverflow
{
"tokens": [
{
"token": "#stackoverflow",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
UPDATE:
Then I suggest using a custom analyzer for that field so you can control how the values are indexed. Recreate your index like this and then you should be able to do your searches:
PUT posts
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase" ]
}
}
}
}
},
"mappings": {
"post": {
"properties": {
"upvotes": {
"type": "long"
},
"created_time": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"ratings": {
"type": "long"
},
"link": {
"type": "string"
},
"pic": {
"type": "string"
},
"text": {
"type": "string",
"analyzer": "my_analyzer"
},
"id": {
"type": "string"
}
}
}
}
}

ElasticSearch - search using abbreviations

I am trying to setup an existing/custom analyzer that enable search using abbreviations. For example, if the text field is "Bank Of America", searching for BOfA or BOA, BofA etc should match this record.
How can I do that?
You can probably use synonym filter token for a custom analyzer.
For example the following mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"bank of america,boa"
],
"expand": true
}
}
}
},
"mappings": {
"document": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": true
}
}
}
}
}
Definitely you can add more to the list or use a synonym file.
For query usecases BOfA or BOA, BofA - two approaches can be worked.
1) More synonyms with these possible combination
"synonyms": [
"bank of america,boa"
"bank of america,bofa"
]
2) or keep the abrevations intact and use fuzzy query
{
"query": {
"match": {
"text" : {
"query": "bofa",
"fuzziness": 2
}
}
}
}
You will need synoyms to supply abrevations to ES.
I figure out something approaching using pattern_replace:
GET /_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "pattern_replace",
"pattern": "(\\B.)",
"replacement": ""
},
{
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
},
"uppercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
],
"text": "foxes jump lazy dogs"
}
which produces:
{
"tokens": [
{
"token": "FJL",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "FJLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "JLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}

Filtering with term on tag key with a value containing a dot do not return any result

I'm using ElasticSearch 5.2
My query is:
POST /_search
{
"query": {
"bool": {
"filter": [
{ "term": { "tag": "server-dev.user-log" }}
]
}
}
}
I can filter with a tag value like abcd but it seems I cannot with ab.cd
I guess this is because of the Tokenizer. Is there a way to say like strict equivalence? or if it comes from the . to escape it?
The tag mapping is:
"tag": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
Most likely you have a Standard analyzer for your field tag, and for a token server-dev.user-log, following tokens will be provided:
{
"tokens": [
{
"token": "server",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dev.user",
"start_offset": 7,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "log",
"start_offset": 16,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
and, that's the reason, you do not have match, so things that should fix it, is to add mapping for a field tag, with tokenizer which will preserve whole token. The simplest choice is KeywordAnalyzer, with settings for an index like this:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "keyword"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "tag",
"analyzer": "my_analyzer"
}
}
}
}
}
Finally I've been able to make it work like:
POST /_search
{
"query": {
"bool": {
"filter": [
{ "terms": { "tag": ["server", "dev.user", "log"] }}
]
}
}
}
It seems the - is a token delimiter
I just want to add that my configuration is very standard. I didn't modify the mapping. The mapping is created by fluentd.
=======> EDIT <=======
If you replace tag by tag.keyword you don't need to do the above solution anymore (which btw do not work with any value)
POST /_search
{
"query": {
"bool": {
"filter": [
{ "term": { "tag.keyword": "server-dev.user-log" }}
]
}
}
}

No result when using search analyzer

I'm trying to search string using query_string in elasticsearch with accented characters.
When I use query_string without analyzer for query I get result only on exact match (I'm searching for string "Ředitel kvality" so When I dot "Reditel kvality" I get no results)
When I use same analyzer as it's used in mappings I get no results with both string with or without ascended characters.
analyzers & filters:
"analysis": {
"filter": {
"cs_CZ": {
"recursion_level": "0",
"locale": "cs_CZ",
"type": "hunspell",
"dedup": "true"
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
}
},
"analyzer": {
"cz": {
"filter": [
"standard",
"lowercase",
"czech_stop",
"icu_folding",
"cs_CZ",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"folding": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
mappings:
"index1": {
"mappings": {
"type1": {
"properties": {
"revisions": {
"type": "nested",
"properties": {
"title": {
"type": "text",
"boost": 10.0,
"fields": {
"folded": {
"type": "text",
"boost": 6.0,
"analyzer": "folding"
}
},
"analyzer": "cz"
here are term vectors which looks fine:
"term_vectors": {
"revisions.title": {
"field_statistics": {
"sum_doc_freq": 764,
"doc_count": 201,
"sum_ttf": 770
},
"terms": {
"kvalita": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 15
}
]
},
"reditel": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
and when I run analyze on my query index1/_analyze?field=type1.revisions.title&text=Ředitel%20kvality
I get same tokens.
{
"tokens": [
{
"token": "reditel",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "kvalita",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I can't find out what is wrong and why ES will not match "Reditel kvality" with "Ředitel kvality".
this is query which I'm using:
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\u0158editel kvality*",
"rewrite":"scoring_boolean",
"analyzer":"cz",
"default_operator":"AND"
}
}
]
}
},
"size":10,
"from":0
}
my ES version is 5.2.2
Found out what's wrong.
_all field must be defined also in mappings with analyzer.
I get from docs feeling that this is automatic, and all field is magically created from analyzed fields.
so now in fields I have
_all": {
"enabled": true,
"analyzer": "cz"
},
And it's working.
Thank's a lot Xylakant on IRC for guiding me.

Resources