elasticsearch fuzzy query seems to ignore brazilian stopwords

elasticsearch fuzzy query seems to ignore brazilian stopwords - elasticsearch

I have stopwords for brazilian portuguese configured at my index. but if I made a search for the term "ios" (it's a ios course), a bunch of other documents are returned, because the term "nos" (brazilian stopword) seems to be identified as a valid term for the fuzzy query.
But if I search just by the term "nos", nothing is returned. I would be not expected ios course to be returned by fuzzy query? I'm confused.
Is there any alternative to this. The main purpose here is that when user search for ios, the documents with stopword like "nos" won't be returned, while I can mantain the fuzziness for other more complex search made by users.
An example of query:
GET /index/_search
{
"explain": true,
"query": {
"bool" : {
"must" : [
{
"terms" : {
"document_type" : [
"COURSE"
],
"boost" : 1.0
}
},
{
"multi_match" : {
"query" : "ios",
"type" : "best_fields",
"operator" : "OR",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
part of explain query:
"description": "weight(corpo:nos in 52) [PerFieldSimilarity], result of:",
image with the config of stopwords
thanks
I tried to add the prefix length, but I want that stopwords to be ignored.

I believe that correctly way to work stopwords by language is below:
PUT idx_teste
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop_filter": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"teste_analyzer": {
"tokenizer": "standard",
"filter": ["brazilian_stop_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "teste_analyzer"
}
}
}
}
POST idx_teste/_analyze
{
"analyzer": "teste_analyzer",
"text":"course nos advanced"
}
Look term "nos" was removed.
{
"tokens": [
{
"token": "course",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "advanced",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Related

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren.
I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!

I got it ! The problem is certainly around the word_delimiter token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
Cf documentation
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

Does Elasticsearch support nested or object fields in MultiMatch?

I have some object field named "FullTitleFts". It has field "text" inside. This query works fine (and returns some entries):
GET index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullTitleFts.text": "Ivan"
}
}
]
}
}
}
But this query returns nothing:
GET index/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Ivan",
"fields": [
"fullTitleFts.text"
]
}
}
]
}
}
}
Mapping of the field:
"fullTitleFts": {
"copy_to": [
"text"
],
"type": "keyword",
"fields": {
"text": {
"analyzer": "analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "text"
}
}
}
"analyzer": {
"filter": [
"lowercase",
"hypocorisms",
"protect_kw"
],
"char_filter": [
"replace_char_filter",
"e_char_filter"
],
"expand": "true",
"type": "custom",
"tokenizer": "standard"
}
e_char_filter is for replacing Cyrillic char "ё" to "е", replace_char_filter is for removing "�" from text. protect_kw is keyword_marker for some Russian unions. hypocorisms is synonym_graph for making another forms of names.
Example of analyzer output:
GET index/_analyze
{
"analyzer": "analyzer",
"text": "Алёна�"
}
{
"tokens" : [
{
"token" : "аленка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "аленушка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "алена",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I've also found this question. And it seems that the answer didn't really work - author had to add "include_in_root" option in mapping. So i wondering if multi match supports nested or object fields at all. I am also can't find anything about it in docs.

As you have provided index mapping, your field is defined as multi-field and not as nested or object field. So both match and multi_match should work without providing path. you can just use field name as fullTitleFts.text when need to search on text type and fullTitleFts when need to search on keyword type.

how to match with multiple inputs elasticearch

I'm trying to query all possible logs on 3 environments (dev,test,prod) with the below query using terms: Tried must and should.
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}'
gives:
{
"took" : 871,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 389,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
However, if i replace terms with match it works. but can't query with the other inputs like query WARN messages, query logs related to ParserService class etc:
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"should":
[{"match": {"can.deployment": "can-prod"}}],
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-03-20T17:22:29.069Z",
"lt": "2020-05-01T17:23:29.069Z"
}
}
},{
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.class": "MTMessage"
}
}
]
}
}
}'
How do i accomplish this with or without terms/match.
Tried this, no luck. I get 0 search results:
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.level": "WARN"
}
},{
"match": {
"can.class": "MTMessage"
}
}
Any hints will certainly help. TIA!
[EDIT]
Addings mappings (/_mapping?pretty=true):
"can" : {
"properties" : {
"class" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"deployment" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"level" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Adding sample docs:
{
"took" : 50,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 387,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 5.44714,
"hits" : [
{
"_index" : "filebeat-6.1.2-2020.05.21",
"_type" : "doc",
"_id" : "AXI9K_cggA4T9jvjZc03",
"_score" : 5.44714,
"_source" : {
"#timestamp" : "2020-05-21T02:59:25.373Z",
"offset" : 34395681,
"beat" : {
"hostname" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"name" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"version" : "6.1.2"
},
"prospector" : {
"type" : "log"
},
"source" : "/var/logs/packages/gateway_mt/1a27957180c2b57a53e76dd686a06f4983bf233f/logs/gateway_mt.log",
"message" : "[2020-05-21 02:59:25.373] ERROR can_gateway_mt [ActiveMT SNAP Worker 18253] --- ClientIdAuthenticationFilter: Cannot authorize publishing from client ThingPayload_4
325334a89c9 : not authorized",
"fileset" : {
"module" : "can",
"name" : "services"
},
"fields" : { },
"can" : {
"component" : "can_gateway_mt",
"instancename" : "canservices/0",
"level" : "ERROR",
"thread" : "ActiveMT SNAP Worker 18253",
"message" : "Cannot authorize publishing from client ThingPayload_4325334a89c9 : not authorized",
"class" : "ClientIdAuthenticationFilter",
"timestamp" : "2020-05-21 02:59:25.373",
"deployment" : "can-prod"
}
}
}
]
}
}
Expected output:
trying to get a dump of the whole document that matches the criteria. something like a above sample doc.

"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
I suppose, the above search query din't worked because your fields can.deployement, can.level and can.class is a text field . If these were text field Elasticsearch analyzes these kind of fields by default standard analyzer, where it divides the text by stop words and converts all text in lowercase. You can refer more about it from here.
For your case , for example can.deployement field value can-prod would be analyzed as
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "prod",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Terms query matches exact word(case sensitive search), but since elasticsearch analyzes your text and divide and converts into lowercase you are not able to find exact search text.
In order to solve this issue ,while creating your mapping of the index for these 3 fields (can.deployement, can.level and can.class) , you can create a keyword type of field which basically says to Elasticsearch to not to analyze this field and store it as it is.
You can create mapping for these 3 fields something like :
Mapping :
"mappings": {
"properties": {
"can.class": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.deployment": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.level": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and now you can perform terms search using these keyword field :
Search Query :
{ "query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment.keyword": ["can-prod", "can-test", "can-dev"]
}
},
"filter": [ {
"terms": {
"can.level.keyword": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class.keyword": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}
This, terms query will only work for case sensitive searches. You can refer more about it from here.

If you want to do case insensitive search you can use match query to do the same :
Search Query :
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "warn error"
}
},
{
"match": {
"class": "MTMessage ParserService JsonParser"
}
},
{
"match": {
"deployment": "can-test can-prod can-dev"
}
}
]
}
}
}
This works because Elasticsearch by default analyzes your match query text with same analyzer as your index analyzer. Since in your case it is standard analyzer , it will convert this match query text in lowercase and remove stop words. You can read more about it from here.
For example for search value MTMessage ParserService JsonParser it will get analyzed internally as :
{
"tokens": [
{
"token": "mtmessage",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "parserservice",
"start_offset": 10,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jsonparser",
"start_offset": 24,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 2
}
]
}
and since your values of document with this field also got analyzed in this way they will match.
Here one issue for this value can-test can-prod can-dev , it will get analyzed as :
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "test",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "can",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "prod",
"start_offset": 13,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "can",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "dev",
"start_offset": 22,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Now, if in your index this kind of document is there :
{
"can.deployment": "can",
"can.level": "WARN",
"can.class": "JsonParser"
}
Then this document will also be shown in your search result.
So, based on what kind of search you want to perform and what kind of search data you have you can decide whether to use terms query or match query.

elasticsearch query string probplem with lowercase and underscore

I create a custom analyzer in my Elasticsearch, I want to seperate from only white space of my words in my defined field, is "my_field". And my search needs to be case insensitive, for this feature, I used lowercase filter.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"trim"
]
}
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"my_field" : {
"type" : "string",
"analyzer" : "my_custom_analyzer"
}
}
}
}
After that this creation, I analyze my sample data:
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "my_Sample_TEXT"
}
and the output is
{
"tokens": [
{
"token": "my_sample_text",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
I have many data in my documents and in "my_field" that contains "my_Sample_TEXT" but when I search for this text using query string, result returns 0:
GET my_index/_search
{
"query": {
"query_string" : {
"default_field" : "my_type",
"query" : "*my_sample_text*",
"analyzer" : "my_custom_analyzer",
"enable_position_increments": true,
"default_operator": "AND"
}
}
}
My result is:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
I found that, this problem happens when my text has underscore and uppercase text, can anyone help me to fix this problem?

Could you try to change your mapping part of "my_filed" as below:
"my_field" : {
"type" : "string",
"analyzer" : "my_custom_analyzer"
"search_analyzer": "my_custom_analyzer"
}
Because, ES use standart analyzer when you do not set any analyzer. And standart analyzer can create multiple tokens from your underscored text.

elasticsearch 1.6 field norm calculation with shingle filter

I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
This is the mapping used:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
And I posted a few documents:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
When using the following query with the explain API,
{
"query": {
"match": {
"text" : "the"
}
}
}
I get the following fieldnorms (other details omitted for brevity):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?

You would need to customize the default similarity by disabling the discount overlap flag.
Example:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
Mapping:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
To expand further:
By default overlaps i.e Tokens with 0 position increment are ignored when computing norm
Example below shows the postion of tokens generated by the "my_analyzer" described in OP :
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
According to lucene documentation the length norm calculation for default similarity is implemented as follows :
state.getBoost()*lengthNorm(numTerms)
where numTerms is
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

elasticsearch fuzzy query seems to ignore brazilian stopwords - elasticsearch

Related

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Does Elasticsearch support nested or object fields in MultiMatch?

how to match with multiple inputs elasticearch

elasticsearch query string probplem with lowercase and underscore

elasticsearch 1.6 field norm calculation with shingle filter

Categories

Resources