how to match with multiple inputs elasticearch - elasticsearch

I'm trying to query all possible logs on 3 environments (dev,test,prod) with the below query using terms: Tried must and should.
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}'
gives:
{
"took" : 871,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 389,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
However, if i replace terms with match it works. but can't query with the other inputs like query WARN messages, query logs related to ParserService class etc:
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"should":
[{"match": {"can.deployment": "can-prod"}}],
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-03-20T17:22:29.069Z",
"lt": "2020-05-01T17:23:29.069Z"
}
}
},{
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.class": "MTMessage"
}
}
]
}
}
}'
How do i accomplish this with or without terms/match.
Tried this, no luck. I get 0 search results:
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.level": "WARN"
}
},{
"match": {
"can.class": "MTMessage"
}
}
Any hints will certainly help. TIA!
[EDIT]
Addings mappings (/_mapping?pretty=true):
"can" : {
"properties" : {
"class" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"deployment" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"level" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Adding sample docs:
{
"took" : 50,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 387,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 5.44714,
"hits" : [
{
"_index" : "filebeat-6.1.2-2020.05.21",
"_type" : "doc",
"_id" : "AXI9K_cggA4T9jvjZc03",
"_score" : 5.44714,
"_source" : {
"#timestamp" : "2020-05-21T02:59:25.373Z",
"offset" : 34395681,
"beat" : {
"hostname" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"name" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"version" : "6.1.2"
},
"prospector" : {
"type" : "log"
},
"source" : "/var/logs/packages/gateway_mt/1a27957180c2b57a53e76dd686a06f4983bf233f/logs/gateway_mt.log",
"message" : "[2020-05-21 02:59:25.373] ERROR can_gateway_mt [ActiveMT SNAP Worker 18253] --- ClientIdAuthenticationFilter: Cannot authorize publishing from client ThingPayload_4
325334a89c9 : not authorized",
"fileset" : {
"module" : "can",
"name" : "services"
},
"fields" : { },
"can" : {
"component" : "can_gateway_mt",
"instancename" : "canservices/0",
"level" : "ERROR",
"thread" : "ActiveMT SNAP Worker 18253",
"message" : "Cannot authorize publishing from client ThingPayload_4325334a89c9 : not authorized",
"class" : "ClientIdAuthenticationFilter",
"timestamp" : "2020-05-21 02:59:25.373",
"deployment" : "can-prod"
}
}
}
]
}
}
Expected output:
trying to get a dump of the whole document that matches the criteria. something like a above sample doc.

"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
I suppose, the above search query din't worked because your fields can.deployement, can.level and can.class is a text field . If these were text field Elasticsearch analyzes these kind of fields by default standard analyzer, where it divides the text by stop words and converts all text in lowercase. You can refer more about it from here.
For your case , for example can.deployement field value can-prod would be analyzed as
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "prod",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Terms query matches exact word(case sensitive search), but since elasticsearch analyzes your text and divide and converts into lowercase you are not able to find exact search text.
In order to solve this issue ,while creating your mapping of the index for these 3 fields (can.deployement, can.level and can.class) , you can create a keyword type of field which basically says to Elasticsearch to not to analyze this field and store it as it is.
You can create mapping for these 3 fields something like :
Mapping :
"mappings": {
"properties": {
"can.class": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.deployment": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.level": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and now you can perform terms search using these keyword field :
Search Query :
{ "query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment.keyword": ["can-prod", "can-test", "can-dev"]
}
},
"filter": [ {
"terms": {
"can.level.keyword": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class.keyword": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}
This, terms query will only work for case sensitive searches. You can refer more about it from here.

If you want to do case insensitive search you can use match query to do the same :
Search Query :
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "warn error"
}
},
{
"match": {
"class": "MTMessage ParserService JsonParser"
}
},
{
"match": {
"deployment": "can-test can-prod can-dev"
}
}
]
}
}
}
This works because Elasticsearch by default analyzes your match query text with same analyzer as your index analyzer. Since in your case it is standard analyzer , it will convert this match query text in lowercase and remove stop words. You can read more about it from here.
For example for search value MTMessage ParserService JsonParser it will get analyzed internally as :
{
"tokens": [
{
"token": "mtmessage",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "parserservice",
"start_offset": 10,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jsonparser",
"start_offset": 24,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 2
}
]
}
and since your values of document with this field also got analyzed in this way they will match.
Here one issue for this value can-test can-prod can-dev , it will get analyzed as :
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "test",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "can",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "prod",
"start_offset": 13,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "can",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "dev",
"start_offset": 22,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Now, if in your index this kind of document is there :
{
"can.deployment": "can",
"can.level": "WARN",
"can.class": "JsonParser"
}
Then this document will also be shown in your search result.
So, based on what kind of search you want to perform and what kind of search data you have you can decide whether to use terms query or match query.

Related

elasticsearch fuzzy query seems to ignore brazilian stopwords

I have stopwords for brazilian portuguese configured at my index. but if I made a search for the term "ios" (it's a ios course), a bunch of other documents are returned, because the term "nos" (brazilian stopword) seems to be identified as a valid term for the fuzzy query.
But if I search just by the term "nos", nothing is returned. I would be not expected ios course to be returned by fuzzy query? I'm confused.
Is there any alternative to this. The main purpose here is that when user search for ios, the documents with stopword like "nos" won't be returned, while I can mantain the fuzziness for other more complex search made by users.
An example of query:
GET /index/_search
{
"explain": true,
"query": {
"bool" : {
"must" : [
{
"terms" : {
"document_type" : [
"COURSE"
],
"boost" : 1.0
}
},
{
"multi_match" : {
"query" : "ios",
"type" : "best_fields",
"operator" : "OR",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
part of explain query:
"description": "weight(corpo:nos in 52) [PerFieldSimilarity], result of:",
image with the config of stopwords
thanks
I tried to add the prefix length, but I want that stopwords to be ignored.
I believe that correctly way to work stopwords by language is below:
PUT idx_teste
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop_filter": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"teste_analyzer": {
"tokenizer": "standard",
"filter": ["brazilian_stop_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "teste_analyzer"
}
}
}
}
POST idx_teste/_analyze
{
"analyzer": "teste_analyzer",
"text":"course nos advanced"
}
Look term "nos" was removed.
{
"tokens": [
{
"token": "course",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "advanced",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}

ElasticSearch - Search without apostrophe

I'm trying to allow users to search without entering an apostrophe.
E.G type Johns and still bring up results for John's
I've tried multiple things including adding the stemmer filter but with no luck.
I thought I could potentially do something manual such as
GET /_analyze
{
"char_filter": [{
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}],
"tokenizer": "standard",
"text": "john's dog jumped"
}
And i get the following response
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "johns",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "john's",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "dog",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "jumped",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
However I still don't get a match when I search for "johns" with out the '
My settings look like:
"analyzer" : {
"my_custom_search" : {
"char_filter" : [ "flexible_plurals" ],
"tokenizer" : "standard"
}
},
"char_filter" : {
"flexible_plurals" : {
"pattern" : """\s*([a-zA-Z0-9]+)\'s""",
"type" : "pattern_replace",
"replacement" : " $1 $1s $1's "
}
}
My mappings like
"search-terms" : {
"type" : "text",
"analyzer" : "my_custom_search"
}
I am using the match query to query the data
You are almost correct, Hope you are using the match query and you have defined your field as text with the custom analyzer, if you use the text field without your custom analyzer which uses your char_filter it will simply use the standard analyzer and won't generate the johns token hence no match.
Complete Working example
Index setting and mapping
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_filter": {
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}
},
"analyzer": {
"custom_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"apostrophe_filter"
],
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Index sample document
{
"title" : "john's"
}
And search for johns
{
"query": {
"match": {
"title": "johns"
}
}
}
Search results
"hits": [
{
"_index": "72937076",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "john's" --> note `john's`
}
}
]

elasticsearch match_phrase query for exact sub-string search

I used match_phrase query for search full-text matching.
But it did not work as I thought.
Query:
POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}
Results:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]
expectation:
To only get results where the given string is an exact sub-string in the field. For example:
https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance
Mapping:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.
The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:
GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}
{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.
Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Now if you analyze a URL you'll now see URL paths kept whole:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}
{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}
This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}
EDIT:
Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}

Why this elasticsearch query doesnt return anything

I performed below elasticsearch query.
GET amasyn/_search
{
"query": {
"bool" : {
"filter" : {
"term": {"ordernumber": "112-9550919-9141020"}
}
}
}
}
But it does not get any hits
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
But I have a document having this ordernumber in the index.
ordernumber is a text field.
When I change the above query by replacing term with match, I get total number of hits as the no of hits for the given query.
Please explain what's happening here and how to solve this.
This is because since you used ordernumber field with type as text, so it is getting analyzed. Please refer difference between text and keyword through this answer Difference between keyword and text in ElasticSearch.
In this way you can define both text and keyword for your ordernumber field.
Mapping
{
"mappings": {
"properties": {
"ordernumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
and then you can use term query as below :
{
"query": {
"bool" : {
"filter" : {
"term": {"ordernumber.keyword": "112-9550919-9141020"}
}
}
}
}
Please see, how text and keyword fields are tokenized for your text.
Standard analyzer
This analyzer is used when you were defining your field as text.
{
"analyzer": "standard",
"text" : "112-9550919-9141020"
}
result :
{
"tokens": [
{
"token": "112",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
},
{
"token": "9550919",
"start_offset": 4,
"end_offset": 11,
"type": "<NUM>",
"position": 1
},
{
"token": "9141020",
"start_offset": 12,
"end_offset": 19,
"type": "<NUM>",
"position": 2
}
]
}
Keyword Analyzer
This analyzer is used when you are defining your field as keyword.
{
"analyzer": "keyword",
"text" : "112-9550919-9141020"
}
Result
{
"tokens": [
{
"token": "112-9550919-9141020",
"start_offset": 0,
"end_offset": 19,
"type": "word",
"position": 0
}
]
}

elasticsearch 1.6 field norm calculation with shingle filter

I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
This is the mapping used:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
And I posted a few documents:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
When using the following query with the explain API,
{
"query": {
"match": {
"text" : "the"
}
}
}
I get the following fieldnorms (other details omitted for brevity):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?
You would need to customize the default similarity by disabling the discount overlap flag.
Example:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
Mapping:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
To expand further:
By default overlaps i.e Tokens with 0 position increment are ignored when computing norm
Example below shows the postion of tokens generated by the "my_analyzer" described in OP :
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
According to lucene documentation the length norm calculation for default similarity is implemented as follows :
state.getBoost()*lengthNorm(numTerms)
where numTerms is
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()

Resources