Elastichsearch range query does not work with icu_collation for Turkish words - elasticsearch

I've document which has Turkish words like "şa, za, sb, şc, sd, şe" etc. as customer_address property.
I've indexed my documents as documented below because I want to order documents according to the customer_address field. Sorting is working well.
Sorting and Collations
Now I'm trying to apply range query over "customer_address" field. When I sent the query below, I've got an empty result. (expected result: sb, sd, şa, şd)
curl -XGET http://localhost:9200/sampleindex/_search?pretty -d '{"query":{"bool":{"filter":[{"range":{"customer_address.sort":{"from":"plaj","to":"şcam","include_lower":true,"include_upper":true,"boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}}}'
When I've queried I saw that my fields are encrypted as specified in the document.
curl -XGET http://localhost:9200/sampleindex/_search?pretty -d '{"aggs":{"myaggregation":{"terms":{"field":"customer_address.sort","size":10000}}},"size":0}'
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
}
"aggregations" : {
"a" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "⚕䁁䀠怀\u0001",
"doc_count" : 1
},
{
"key" : "⚗䁁䀠怀\u0001",
"doc_count" : 1
},
{
"key" : "✁ੀ⃀ၠ\u0000\u0000",
"doc_count" : 1
},
{
"key" : "✁ୀ⃀ၠ\u0000\u0000",
"doc_count" : 1
},
{
"key" : "✁ీ⃀ၠ\u0000\u0000",
"doc_count" : 1
},
{
"key" : "ⶔ䁁䀠怀\u0001",
"doc_count" : 1
}
]
}
}
}
So, How should I send my parameters in the range query to be able to get the successful result?
Thanks in advance.
My Mapping:
curl -XGET http://localhost:9200/sampleindex?pretty
{
"sampleindex" : {
"aliases" : { },
"mappings" : {
"invoice" : {
"properties" : {
"customer_address" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
},
"sort" : {
"type" : "text",
"analyzer" : "turkish",
"fielddata" : true
}
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : "5",
"provided_name" : "sampleindex",
"max_result_window" : "2147483647",
"creation_date" : "1521732167023",
"analysis" : {
"filter" : {
"turkish_phonebook" : {
"variant" : "#collation=phonebook",
"country" : "TR",
"language" : "tr",
"type" : "icu_collation"
},
"turkish_lowercase" : {
"type" : "lowercase",
"language" : "turkish"
}
},
"analyzer" : {
"turkish" : {
"filter" : [
"turkish_lowercase",
"turkish_phonebook"
],
"tokenizer" : "keyword"
}
}
},
"number_of_replicas" : "1",
"uuid" : "ChNGX459TUi8VnBLTMn-Ng",
"version" : {
"created" : "5020099"
}
}
}
}
}

I've solved my problem by defining an analyzer with char filter during index creation. I don't know whether it is a good solution or not, but I've could not solve by "turkish_phonebook" of ICU, so the solution seems working for now.
Firstly, I created an index with "turkish_collation_analyzer". And then for my properties which needs this, I created a field "property.tr" to use this defined analyzer. And for last, during range queries, I converted my values as expected by this field.
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "sampleindex",
"max_result_window": "2147483647",
"creation_date": "1522050241730",
"analysis": {
"analyzer": {
"turkish_collation_analyzer": {
"char_filter": [
"turkish_char_filter"
],
"tokenizer": "keyword"
}
},
"char_filter": {
"turkish_char_filter": {
"type": "mapping",
"mappings": [
"a => x01",
"b => x02",
.,
.,
.,
]
}
}
},
"number_of_replicas": "1",
"uuid": "hiEqIpjYTLePjF142B8WWQ",
"version": {
"created": "5020099"
}
}
}

Related

Why does elastic search wildcard query return no results?

Query #1 in Kibana returns results, however Query #2 returns no results. I search for only "bob" and get results, but when searching for "bob smith", no results, even though "Bob Smith" exists in the index. Any reason why?
Query #1: returns results
GET people/_search
{
"query": {
"wildcard" : {
"name" : "*bob*"
}
}
}
Results:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 23,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "xxxxx",
"_score" : 1.0,
"_source" : {
"name" : "Bob Smith",
...
Query #2: returns nothing.. why(?)
GET people/_search
{
"query": {
"wildcard" : {
"name" : "*bob* *smith*"
}
}
}
results...nothing
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Look like the reason of the empty result is your index mapping. If you use "text" type field, you actually search in the inverted index, mean you search in the token "bob" and token "smith" (standard analyzer) and not in the "Bob Smith". If you want to search in "Bob Smith" as one token, you need to use "keyword" type (maybe with lowercase normalizer, if you want to use not key sensetive search)
For example:
PUT test
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "lowercase_normalizer"
}
}
}
}
PUT test/_doc/1
{
"name" : "Bob Smith"
}
GET test/_search
{
"query": {
"wildcard": {
"name": "*bob* *Smith*"
}
}
}

ElasticSearch join data within the same index

I am quite new with ElasticSearch and I am collecting some application logs within the same index which have this format
{
"_index" : "app_logs",
"_type" : "_doc",
"_id" : "JVMYi20B0a2qSId4rt12",
"_source" : {
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
I can have different event types. In this case I am interested in STARTED and FINISHED. I would like to query ES in order to get all the app that started in a certain day and enrich them with their end time. Basically I want to create couples of start/end (an end might also be missing, but that's fine).
I have realized join relations in sql cannot be used in ES and I was wondering if I can exploit some other feature in order to get this result in one query.
Edit: these are the details of the index mapping
{
“app_logs" : {
"mappings" : {
"_doc" : {
"properties" : {
"event_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
“app_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ts" : {
"type" : "date"
},
“event_type” : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}}}}
What I understood is that you would want to collate list of documents having same app_id along with the status as either STARTED or FINISHED.
I do not think Elasticsearch is not meant to perform JOIN operations. I mean you can but then you have to design your documents as mentioned in this link.
What you would need is an Aggregation query.
Below is the sample mapping, documents, the aggregation query and the response as how it appears, which would actually help you get the desired result.
Mapping:
PUT mystatusindex
{
"mappings": {
"properties": {
"username":{
"type": "keyword"
},
"app_id":{
"type": "keyword"
},
"event_type":{
"type":"keyword"
},
"ts":{
"type": "date"
}
}
}
}
Sample Documents
POST mystatusindex/_doc/1
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
POST mystatusindex/_doc/2
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "FINISHED",
"ts" : "2019-10-02T08:12:53Z"
}
POST mystatusindex/_doc/3
{
"username" : "mapred",
"app_id" : "application_1569623930006_490201",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:30:53Z"
}
POST mystatusindex/_doc/4
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/5
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "FINISHED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/6
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "STARTED",
"ts" : "2019-10-03T09:30:53Z"
}
POST mystatusindex/_doc/7
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "FINISHED",
"ts" : "2019-10-03T09:45:53Z"
}
Query:
POST mystatusindex/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"ts": {
"gte": "2019-10-02T00:00:00Z",
"lte": "2019-10-02T23:59:59Z"
}
}
}
],
"should": [
{
"match": {
"event_type": "STARTED"
}
},
{
"match": {
"event_type": "FINISHED"
}
}
]
}
},
"aggs": {
"application_IDs": {
"terms": {
"field": "app_id"
},
"aggs": {
"ids": {
"top_hits": {
"size": 10,
"_source": ["event_type", "app_id"],
"sort": [
{ "event_type": { "order": "desc"}}
]
}
}
}
}
}
}
Notice that for filtering I've made use of Range Query as you only want to filter documents for that date and also added a bool should logic to filter based on STARTED and FINISHED.
Once I have the documents, I've made use of Terms Aggregation and Top Hits Aggregation to get the desired result.
Result
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"application_IDs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "application_1569623930006_490200", <----- APP ID
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "1", <--- Document with STARTED status
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "2", <--- Document with FINISHED status
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490202",
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "4",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "5",
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490201",
"doc_count" : 1,
"ids" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490201"
},
"sort" : [
"STARTED"
]
}
]
}
}
}
]
}
}
}
Note that the last document with only STARTED appears in the aggregation result as well.
Updated Answer
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"ts":{
"gte":"2019-10-02T00:00:00Z",
"lte":"2019-10-02T23:59:59Z"
}
}
}
],
"should":[
{
"term":{
"event_type.keyword":"STARTED" <----- Changed this
}
},
{
"term":{
"event_type.keyword":"FINISHED" <----- Changed this
}
}
]
}
},
"aggs":{
"application_IDs":{
"terms":{
"field":"app_id.keyword" <----- Changed this
},
"aggs":{
"ids":{
"top_hits":{
"size":10,
"_source":[
"event_type",
"app_id"
],
"sort":[
{
"event_type.keyword":{ <----- Changed this
"order":"desc"
}
}
]
}
}
}
}
}
}
Note the changes I've made. Whenever you would need exact matches or want to make use of aggregation, you would need to make use of keyword type.
In the mapping you've shared, there is no username field but two event_type fields. I'm assuming its just a human err and that one of the field should be username.
Now if you notice carefully, the field event_type has a text and its sibling keyword field. I've just modified the query to make use of the keyword field and when I am doing that, I'm use Term Query.
Try this out and let me know if it helps!

Elasticsearch wildcard query with spaces

I'm trying to do a wildcard query with spaces. It easily matches the words on term basis but not on field basis.
I've read the documentation which says that I need to have the field as not_analyzed but with this type set, it returns nothing.
This is the mapping with which it works on term basis:
{
"denshop" : {
"mappings" : {
"products" : {
"properties" : {
"code" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "string"
},
"price" : {
"type" : "long"
},
"url" : {
"type" : "string"
}
}
}
}
}
}
This is the mapping with which the exact same query returns nothing:
{
"denshop" : {
"mappings" : {
"products" : {
"properties" : {
"code" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "string",
"index" : "not_analyzed"
},
"price" : {
"type" : "long"
},
"url" : {
"type" : "string"
}
}
}
}
}
}
The query is here:
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*test*"}}}'
Response with the not_analyzed property:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Response without not_analyzed:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [ {
...
EDIT: Adding requested info
Here is the list of documents:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [ {
"_index" : "denshop",
"_type" : "products",
"_id" : "3L1",
"_score" : 1.0,
"_source" : {
"id" : 3,
"name" : "Testovací produkt 2",
"code" : "",
"price" : 500,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-2/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "4L1",
"_score" : 1.0,
"_source" : {
"id" : 4,
"name" : "Testovací produkt 3",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-3/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "2L1",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "Testovací produkt",
"code" : "",
"price" : 500,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "5L1",
"_score" : 1.0,
"_source" : {
"id" : 5,
"name" : "Testovací produkt 4",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-4/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "6L1",
"_score" : 1.0,
"_source" : {
"id" : 6,
"name" : "Testovací produkt 5",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/tricka-tilka-tuniky/testovaci-produkt-5/"
}
} ]
}
}
Without the not_analyzed it returns with this:
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*testovací*"}}}'
But not with this (notice the space before asterisk):
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*testovací *"}}}'
When I add the not_analyzed to mapping, it returns no hits no matter what I put in the wildcard query.
Add a custom analyzer that should lowercase the text. Then in your search query, before passing the text to it have it lowercased in your client application.
To, also, keep the original analysis chain, I've added a sub-field to your name field that will use the custom analyzer.
PUT /denshop
{
"settings": {
"analysis": {
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"products": {
"properties": {
"name": {
"type": "string",
"fields": {
"lowercase": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
And the query will work on the sub-field:
GET /denshop/products/_search
{
"query": {
"wildcard": {
"name.lowercase": "*testovací *"
}
}
}

Why isn't my elastic search query returning the text analyzed by english analyzer?

I have an index named test_blocks
{
"test_blocks" : {
"aliases" : { },
"mappings" : {
"block" : {
"dynamic" : "false",
"properties" : {
"content" : {
"type" : "string",
"fields" : {
"content_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"id" : {
"type" : "long"
},
"title" : {
"type" : "string",
"fields" : {
"title_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"user_id" : {
"type" : "long"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1438642440687",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "1070099"
},
"uuid" : "45vkIigXSCyvHN6g-w5kkg"
}
},
"warmers" : { }
}
}
When I do a search for killing, a word in the content, the search results return as expected.
http://localhost:9200/test_blocks/_search?q=killing&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.07431685,
"hits" : [ {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "218",
"_score" : 0.07431685,
"_source":{"block":{"id":218,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":82}}
}, {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "219",
"_score" : 0.07431685,
"_source":{"block":{"id":219,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":83}}
} ]
}
}
However given that I have an english analyzer for the content field (content_en), I would have expected it to return me the same document for the query kill. But it doesn't. I get 0 hits.
http://localhost:9200/test_blocks/_search?q=kill&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
My understanding through this analyze query is that "killing" would have got broken down in to "kill"
http://localhost:9200/_analyze?analyzer=english&text=killing
{
"tokens" : [ {
"token" : "kill",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
So why isn't the query "kill" match that document ? Are my mappings incorrect or is it my search that is incorrect?
I am using elasticsearch v1.7.0
You need to use fuzzysearch (some introduction available here):
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"match": {
"title": {
"query": "kill",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}'
UPD. Having content_en field with content which was given by stemmer, it makes sense to actually query that field:
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "kill",
"fields": ["block.title", "block.title.title_en"]
}
}
}'
The following queries http://localhost:9200/_search?q=kill. ,http://localhost:9200/_search?q=kill. end up searching across
_all field .
_all field uses the default analyzer which unless overridden happens to be standard analyzer and not english analyzer .
For making the above query work you would need to add english analyzer to _all field and re-index
Example:
{
"mappings": {
"block": {
"_all" : {"analyzer" : "english"}
}
}
Also would point out the mapping in OP doesn't seem consistent with the document structure. As #EugZol pointed our the content is within block object so the mapping should be something on these lines :
{
"mappings": {
"block": {
"properties": {
"block": {
"properties": {
"content": {
"type": "string",
"analyzer": "standard",
"fields": {
"content_en": {
"type": "string",
"analyzer": "english"
}
}
},
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"title_en": {
"type": "string",
"analyzer": "english"
}
}
},
"user_id": {
"type": "long"
}
}
}
}
}
}
}

How to get Elasticsearch boolean match working for multiple fields

I need some expert guidance on trying to get a bool match working. I'd like the query to only return a successful search result if both 'message' matches 'Failed password for', and 'path' matches '/var/log/secure'.
This is my query:
curl -s -XGET 'http://localhost:9200/logstash-2015.05.07/syslog/_search?pretty=true' -d '{
"filter" : { "range" : { "#timestamp" : { "gte" : "now-1h" } } },
"query" : {
"bool" : {
"must" : [
{ "match_phrase" : { "message" : "Failed password for" } },
{ "match_phrase" : { "path" : "/var/log/secure" } }
]
}
}
} '
Here is the start of the output from the search:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 46,
"max_score" : 13.308596,
"hits" : [ {
"_index" : "logstash-2015.05.07",
"_type" : "syslog",
"_id" : "AU0wzLEqqCKq_IPSp_8k",
"_score" : 13.308596,
"_source":{"message":"May 7 16:53:50 s_local#logstash-02 sshd[17970]: Failed password for fred from 172.28.111.200 port 43487 ssh2","#version":"1","#timestamp":"2015-05-07T16:53:50.554-07:00","type":"syslog","host":"logstash-02","path":"/var/log/secure"}
}, ...
The problem is if I change '/var/log/secure' to just 'var' say, and run the query, I still get a result, just with a lower score. I understood the bool...must construct meant both match terms here would need to be successful. What I'm after is no result if 'path' doesn't exactly match '/var/log/secure'...
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 46,
"max_score" : 10.354593,
"hits" : [ {
"_index" : "logstash-2015.05.07",
"_type" : "syslog",
"_id" : "AU0wzLEqqCKq_IPSp_8k",
"_score" : 10.354593,
"_source":{"message":"May 7 16:53:50 s_local#logstash-02 sshd[17970]: Failed password for fred from 172.28.111.200 port 43487 ssh2","#version":"1","#timestamp":"2015-05-07T16:53:50.554-07:00","type":"syslog","host":"logstash-02","path":"/var/log/secure"}
},...
I checked the mappings for these fields to check that they are not analyzed :
curl -X GET 'http://localhost:9200/logstash-2015.05.07/_mapping?pretty=true'
I think these fields are non analyzed and so I believe the search will not be analyzed too (based on some training documentation I read recently from elasticsearch). Here is a snippet of the output _mapping for this index below.
....
"message" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
},
"path" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
},
....
Where am I going wrong, or what am I misunderstanding here?
As mentioned in the OP you would need to use the "not_analyzed" view of the fields but as per the OP mapping the non-analyzed version of the field is message.raw, path.raw
Example:
{
"filter" : { "range" : { "#timestamp" : { "gte" : "now-1h" } } },
"query" : {
"bool" : {
"must" : [
{ "match_phrase" : { "message.raw" : "Failed password for" } },
{ "match_phrase" : { "path.raw" : "/var/log/secure" } }
]
}
}
}
.The link alongside gives more insight to multi-fields
.To expand further
The mapping in the OP for path is as follows:
"path" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
}
This specifies that the path field uses the default analyzer and field.raw is not analyzed.
If you want to set the path field to be not analyzed instead of raw it would be something on these lines:
"path" : {
"type" : "string",
"index" : "not_analyzed",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : <whatever analyzer you want>,
"ignore_above" : 256
}
}
}

Resources