I'm trying to match text with an "#" prefix, e.g. "#stackoverflow" on ElasticSearch. I'm using a boolean query, and both these return the exact same results and actually ignore my # sign:
Query 1 with #:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"#stackoverflow"}}]}},"size":20}
Query 2 without:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"stackoverflow"}}]}},"size":20}
My Mapping:
{"posts":{"mappings":{"post":{"properties":{"upvotes":{"type":"long"},"created_time":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"ratings":{"type":"long"},"link":{"type":"string"},"pic":{"type":"string"},"text":{"type":"string"},"id":{"type":"string"}}}}}}
I've tried encoding it to \u0040 but that didn't do any difference.
Your text field is of type text and is analyzed by default by the standard analyzer, which means that #stackoverflow will be indexed as stackoverflow after the analysis process, as can be seen below
GET /_analyze?analyzer=standard&text=#stackoverflow
{
"tokens": [
{
"token": "stackoverflow",
"start_offset": 1,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You probably want to either use the keyword type if you need exact matching or specify a different analyzer, such as whitespace, which will preserve the # sign in your data:
GET /_analyze?analyzer=whitespace&text=#stackoverflow
{
"tokens": [
{
"token": "#stackoverflow",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
UPDATE:
Then I suggest using a custom analyzer for that field so you can control how the values are indexed. Recreate your index like this and then you should be able to do your searches:
PUT posts
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase" ]
}
}
}
}
},
"mappings": {
"post": {
"properties": {
"upvotes": {
"type": "long"
},
"created_time": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"ratings": {
"type": "long"
},
"link": {
"type": "string"
},
"pic": {
"type": "string"
},
"text": {
"type": "string",
"analyzer": "my_analyzer"
},
"id": {
"type": "string"
}
}
}
}
}
Related
I am currently indexing the element field as "element" : "dog,cat,mouse" with the following configuration:
ES config:
"settings": {
"analysis": {
"analyzer": {
"search_synonyms": {
"tokenizer": "whitespace",
"filter": [
"graph_synonyms",
"lowercase",
"asciifolding"
],
},
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
},
"filter": {
"graph_synonyms": {
...
}
},
"normalizer": {
"normalizer_1": {
...
}
},
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
}
},
Fields mapping:
"mappings": {
"properties": {
"element": {
"type": "keyword",
"normalizer": "normalizer_1"
}
................
}
}
The value dog,cat,mouse is used afterwards as a filter but I want to separate them and use each value as a separated filter. I tried to use multi-fields feature and made the following changes but I'm still not sure what else should I do.
"element": {
"type": "keyword",
"normalizer": "normalizer_1",
"fields": {
"separated": {
"type": "text",
"analyzer": "comma"
}
}
},
If I understand correctly, you have a field where you are storing the value as dog,cat,mouse and you need them separately like dog, cat and mouse for that you can simply use the text field to store them which uses default standard analyzer, which split tokens on comma ,.
analyze API to show the tokens
{
"text": "dog,cat,mouse",
"analyzer": "standard"
}
tokens generated
{
"tokens": [
{
"token": "dog",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "cat",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "mouse",
"start_offset": 8,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
As per comment, Adding a sample on how to define element field so that standard analyzer is used, note currently its defined as keyword with normalizer, hence standard analyzer is not used.
Index mapping
PUT /your-index/
{
"mappings": {
"properties": {
"name": {
"element": "text"
}
}
}
}
I am currently using this elasticsearch DSL query:
{
"_source": [
"title",
"bench",
"id_",
"court",
"date"
],
"size": 15,
"from": 0,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title",
"content"
]
}
},
"filter": [],
"should": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title.standard^16",
"content.standard"
]
}
}
}
},
"highlight": {
"pre_tags": [
"<tag1>"
],
"post_tags": [
"</tag1>"
],
"fields": {
"content": {}
}
}
}
Here's what's happening. If I search for I.r coelhoit returns the correct results. But, if I search for I R coelho (without the period) then it returns a different result. How do I prevent this from happening? I want the search to behave the same even if there are extra periods, spaces, commas etc.
Mapping
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Settings:
{
"courts_2": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "19000000"
},
"number_of_shards": "5",
"provided_name": "courts_2",
"creation_date": "1581094116992",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "true",
"type": "phonetic",
"encoder": "metaphone"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"my_metaphone"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "MZSecLIVQy6jiI6YmqOGLg",
"version": {
"created": "7010199"
}
}
}
}
}
EDIT
Here are the results for I.R coelho from my analyzer - {
"tokens": [
{
"token": "IR",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "KLH",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Standard analyzer:
{
"tokens": [
{
"token": "i.r",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "coelho",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
the reason why you have a different behaviour when searching for I.r coelho and I R coelho is that you are using different analyzers on the same fields, i.e., my_analyzer for title and content (must block), and standard (the default) for title.standard and content.standard (should block).
The two analyzers generate different tokens, thus determining a different score when you're searching for I.r coelho (e.g., 2 tokens with the standard analyzer) or I R coelho (e.g., 3 tokens with the standard analyzer). You can test the behaviour of your analyzers by using the analyze API (see the Elastic Documentation).
You have to decide whether this is your desired behaviour.
Updates (after requested clarifications from OP)
The results of the _analyze query confirmed the hypothesis: the two analyzers lead to a different score contribution, and, subsequently, to different results depending on whether your query includes symbol chars or not.
If you don't want the results of your query to be affected by symbols such as dots or upper/lower case, you will need to reconsider what analyzers you want to apply. The ones currently used will never satisfy your requirements. If I understood your requirements correctly, the simple built-in analyzer should be the right one for your use case.
In a nutshell, (1) you should consider to replace the standard built-in analyzer with the simple one, (2) you should decide whether you want that your query applies different scores to the hits based on different analyzers (i.e., the phonetic custom one on the value of the title and content fields, and the simple one on their respective subfield).
I am trying to setup an existing/custom analyzer that enable search using abbreviations. For example, if the text field is "Bank Of America", searching for BOfA or BOA, BofA etc should match this record.
How can I do that?
You can probably use synonym filter token for a custom analyzer.
For example the following mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"bank of america,boa"
],
"expand": true
}
}
}
},
"mappings": {
"document": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": true
}
}
}
}
}
Definitely you can add more to the list or use a synonym file.
For query usecases BOfA or BOA, BofA - two approaches can be worked.
1) More synonyms with these possible combination
"synonyms": [
"bank of america,boa"
"bank of america,bofa"
]
2) or keep the abrevations intact and use fuzzy query
{
"query": {
"match": {
"text" : {
"query": "bofa",
"fuzziness": 2
}
}
}
}
You will need synoyms to supply abrevations to ES.
I figure out something approaching using pattern_replace:
GET /_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "pattern_replace",
"pattern": "(\\B.)",
"replacement": ""
},
{
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
},
"uppercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
],
"text": "foxes jump lazy dogs"
}
which produces:
{
"tokens": [
{
"token": "FJL",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "FJLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
},
{
"token": "JLD",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}
I'm trying to search string using query_string in elasticsearch with accented characters.
When I use query_string without analyzer for query I get result only on exact match (I'm searching for string "Ředitel kvality" so When I dot "Reditel kvality" I get no results)
When I use same analyzer as it's used in mappings I get no results with both string with or without ascended characters.
analyzers & filters:
"analysis": {
"filter": {
"cs_CZ": {
"recursion_level": "0",
"locale": "cs_CZ",
"type": "hunspell",
"dedup": "true"
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
}
},
"analyzer": {
"cz": {
"filter": [
"standard",
"lowercase",
"czech_stop",
"icu_folding",
"cs_CZ",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"folding": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
mappings:
"index1": {
"mappings": {
"type1": {
"properties": {
"revisions": {
"type": "nested",
"properties": {
"title": {
"type": "text",
"boost": 10.0,
"fields": {
"folded": {
"type": "text",
"boost": 6.0,
"analyzer": "folding"
}
},
"analyzer": "cz"
here are term vectors which looks fine:
"term_vectors": {
"revisions.title": {
"field_statistics": {
"sum_doc_freq": 764,
"doc_count": 201,
"sum_ttf": 770
},
"terms": {
"kvalita": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 15
}
]
},
"reditel": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
and when I run analyze on my query index1/_analyze?field=type1.revisions.title&text=Ředitel%20kvality
I get same tokens.
{
"tokens": [
{
"token": "reditel",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "kvalita",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I can't find out what is wrong and why ES will not match "Reditel kvality" with "Ředitel kvality".
this is query which I'm using:
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\u0158editel kvality*",
"rewrite":"scoring_boolean",
"analyzer":"cz",
"default_operator":"AND"
}
}
]
}
},
"size":10,
"from":0
}
my ES version is 5.2.2
Found out what's wrong.
_all field must be defined also in mappings with analyzer.
I get from docs feeling that this is automatic, and all field is magically created from analyzed fields.
so now in fields I have
_all": {
"enabled": true,
"analyzer": "cz"
},
And it's working.
Thank's a lot Xylakant on IRC for guiding me.
I have installed Smart Chinese Analysis for Elasticsearch on our ES cluster, but I do not find documentation on how to specify the correct analyzer. I would except that I need to set a tokenizer and a filter specifying stopwords and stemmer ...
For example in dutch:
"dutch": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": ["lowercase", "asciifolding", "dutch_stemmer_filter", "dutch_stop_filter"]
}
with:
"dutch_stemmer_filter": {
"type": "stemmer",
"name": "dutch"
},
"dutch_stop_filter": {
"type": "stop",
"stopwords": ["_dutch_"]
}
How to configure my analyzer for Chinese ?
Try this for a certain index (the analyzer is 'smartcn' and the tokenizer is 'smartcn_tokenizer'):
PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}
GET /test_chinese/_analyze?text='叻出色'
It should output two tokens (test taken from the plugin test classes):
{
"tokens": [
{
"token": "叻",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "出色",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 3
}
]
}