Keyword is tokenized and exact match does not work - elasticsearch

I have a field named id, that looks like that:
ventures.something.123
It's mapping:
{
"id":{
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
},
"type":"text"
}
}
My understanding is that a keyword only allows for EXACT matching - which is what I want.
However, the analyzer tells me it's tokenized:
> http http://localhost:9200/my_index/_analyze field=id text='ventures.house.1137'
{
"tokens": [
{
"end_offset": 14,
"position": 0,
"start_offset": 0,
"token": "ventures.house",
"type": "<ALPHANUM>"
},
{
"end_offset": 19,
"position": 1,
"start_offset": 15,
"token": "1137",
"type": "<NUM>"
}
]
}
... and a search for an id returns indeed ALL ids that start with ventures.house.
Why is that and how can I come to the EXACT matching?
It's ES 5.2.

From https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2
not_analyzed:
Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.
{
"tag": {
"type": "string",
"index": "not_analyzed"
}
}

I misread the mapping, it looks like my elasticsearch-dsl library does not create a keyword directly, but adds it as a subfield.

Have you tried defining the field 'id' as keyword ?
In this case it does not get analyzed but stored as is.
When I understand your question correctly this is what you want.
{
"id":{
"type":"keyword"
}
}
See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html
I hope this helped. Christian

Related

ElasticSearch inconsistent wildcard search

I have a strange issue with my wildcard search. I've created an index with the following mapping:
I have the following document there:
When I'm performing the following query, I'm getting the document:
{
"query": {
"wildcard" : { "email" : "*asdasd*" }
},
"size": "10",
"from": 0
}
But when I'm doing the next request, I'm not getting anything:
{
"query": {
"wildcard" : { "email" : "*one-v*" }
},
"size": "10",
"from": 0
}
Can you please explain the reason for it?
Thank you
Elasticsearch uses a standard analyzer if no analyzer is specified. Assuming that the email field is of text type, so "asdasd#one-v.co.il" will get tokenized into
{
"tokens": [
{
"token": "asdasd",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "one",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "v.co.il",
"start_offset": 11,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Now, when you are doing a wildcard query on the email field, then it will search for the tokens, created above. Since there is no token that matches one-v, you are getting empty results for the second query.
It is better to use a keyword field for wildcard queries. If you have not explicitly defined any index mapping then you need to add .keyword to the email field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after the email field).
Modify your query as shown below
{
"query": {
"wildcard": {
"email.keyword": "*one-v*"
}
}
}
Search Result will be
"hits": [
{
"_index": "67688032",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"email": "asdasd#one-v.co.il"
}
}
]
Otherwise you need to change the data type of the email field from text to keyword type
This has to do with how text fields are saved. By default standard analyzer is used.
This is an example from the documentation which fits your case too :
The text "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." is broken into terms
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
As you can see Brown-foxes is not a single token. The same will go for one-v, it will break into one and v.

Elasticsearch : Problem with querying document where "." is included in field

I have an index where some entries are like
{
"name" : " Stefan Drumm"
}
...
{
"name" : "Dr. med. Elisabeth Bauer"
}
The mapping of the name field is
{
"name": {
"type": "text",
"analyzer": "index_name_analyzer",
"search_analyzer": "search_cross_fields_analyzer"
}
}
When I use the below query
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Stefan Drumm","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
It returns the first document.
But when I try to get the second document using the query below
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Dr. med. Elisabeth Bauer","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
it is not returning anything.
Things I can't do
can't change the index
can't use the term query.
change the operator to 'OR', because in that case it will return multiple entries, which I don't want.
What I am doing wrong and how can I achieve this by modifying the query?
You have configured different analyzers for indexing and searching (index_name_analyzer and search_cross_fields_analyzer). If these analyzers process the input Dr. med. Elisabeth Bauer in an incompatible way, the search isn't going to match. This is described in more detail in Index and search analysis, as well as in Controlling Analysis.
You don't provide the definition of these two analyzers, so it's hard to guess from your question what they are doing. Depending on the analyzers, it may be possible to preprocess your query string (e.g. by removing .) before executing the search so that the search will match.
You can investigate how analysis affects your search by using the _analyze API, as described in Testing analyzers. For your example, the commands
GET my_index/_analyze
{
"analyzer": "index_name_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
and
GET my_index/_analyze
{
"analyzer": "search_cross_fields_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
should show you how the two analyzers configured for your index treats the target string, which might provide you with a clue about what's wrong. The response will be something like
{
"tokens": [
{
"token": "dr",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "med",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elisabeth",
"start_offset": 9,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "bauer",
"start_offset": 19,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 3
}
]
}
For the example output above, the analyzer has split the input into one token per word, lowercased each word, and discarded all punctuation.
My guess would be that index_name_analyzer preserves punctuation, while search_cross_fields_analyzer discards it, so that the tokens won't match. If this is the case, and you can't change the index configuration (as you state in your question), one other option would be to specify a different analyzer when running the query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "Dr. med. Elisabeth Bauer",
"operator": "AND",
"analyzer": "index_name_analyzer"
}
}
}
],
"boost": 1
}
},
"min_score": 0
}
In the query above, the analyzer parameter has been set to override the search analysis to use the same analyzer (index_name_analyzer) as the one used when indexing. What analyzer might make sense to use depends on your setup. Ideally, you should configure the analyzers to align so that you don't have to override at search time, but it sounds like you are not living in an ideal world.

Elastic Query accepting only 4 characters

I am running a terms query in elastic search version 7.2, when I have 4 characters in my query, it works and if I add or remove any characters it's not working.
Working query:
{
"query": {
"bool": {
"must": [{
"terms": {
"GEP_PN": ["6207"]
}
},
{
"match": {
"GEP_MN.keyword": "SKF"
}
}
]
}
}
}
Result :
Query that is failing :
Its not failing, its not finding the result for your search-term, please note that terms query are not analyzed as mention in the docs.
Returns documents that contain one or more exact terms in a provided
field.
Please provide the mapping of your index and if its using the text field and you are not using custom-analyzer it will use standard analyzer which would split tokens on -, hence your terms query is not matching the tokens present in inverted index.
Please see the analyze API o/p for your search-term, which explains the probable root-cause.
{
"text" : "6207-R"
}
Tokens
{
"tokens": [
{
"token": "6207",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 0
},
{
"token": "r",
"start_offset": 5,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
}
]
}

Match Substring email address at specific location in ELK

I am trying to find out data matching emails from a message field in ELK Kibana discover section, I am getting the results using:
#message:"abc#email.com"
However, the results produced contains some other messages where email should not be matched, I am unable to build solution for this.
Results are(data has been sanitized for security reasons):
#message:[INF] [2020-07-07 12:54:51.105] [PID-1] : [abcdefg] [JID-5c]
[data] LIST_LOOKUP: abc#email.com | User List from Profiles | name |
user_name #id:355502086986714
#message:[INF] [2020-07-07 12:38:36.755] [PID-2] : [abcdefg] [JID-ed2]
[data] LIST_LOOKUP: abc#email.com | User List from Profiles | name |
user_name #id:355501869671304
#message:[INF] [2020-07-07 12:19:48.141] [PID-3] [abc#email.com] :
[c5] [data] Completed 200 OK in 11ms #id:355501617979964834
#message:[INF] [2020-07-07 11:19:48.930] [PID-5] [abc#email.com] :
[542] [data] Completed 200 OK in 9ms #id:35550081535
while I want it to be:
#message:[INF] [2020-07-07 12:19:48.141] [PID-3] [abc#email.com] :
[c5] [data] Completed 200 OK in 11ms #id:355501617979964834
#message:[INF] [2020-07-07 11:19:48.930] [PID-5] [abc#email.com] :
[542] [data] Completed 200 OK in 9ms #id:35550081535
I've tried using #message: "[PID-*] [abc#email.com]",#message: "\[PID-*\] \[abc#email.com\] \:", #message: "[abc#email.com]", #message: *abc#email.com* and some more similar searches to no success.
Please let me know what I am missing here and how to make efficient subtext searches in ELK kibana using discover and KQL/Lucene.
Here is the mapping for my index(I am getting data from cloudwatch logs):
{
"cwl-*":{
"mappings":{
"properties":{
"#id":{
"type":"string"
},
"#log_stream":{
"type":"string"
},
"#log_group":{
"type":"string"
},
"#message":{
"type":"string"
},
"#owner":{
"type":"string"
},
"#timestamp":{
"type":"date"
}
}
}
}
}
As #Gibbs already mentioned the cause all your data contains the string abc#email.com and by seeing your mapping now its confirmed that your are using the string field without explicit analyzer will uses the default standard analyzer
Instead of this you should map your field which gets the mail id to custom analyzer which uses the UAX URL Email tokenizer which doesn't split the text.
Example on how to create this analyzer with example
Mapping with custom email analyzer
{
"settings": {
"analysis": {
"analyzer": {
"email_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email"
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "email_analyzer"
}
}
}
}
Analyze api response
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "email_analyzer",
"text": "abc#email.com"
}
{
"tokens": [
{
"token": "abc#email.com",
"start_offset": 0,
"end_offset": 13,
"type": "<EMAIL>",
"position": 0
}
]
}
All of your results contain abc#gmail.com. So it is expected.
[abc#gmail.com] is tokenised as
{
"tokens": [
{
"token": "abc",
"start_offset": 1,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gmail.com",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
If you have an email field, you can make use of it. Or you need to alter your mapping for that field.
If it doesn't answer your question, can you add mapping for that field using http://host:port/indexName/_mapping

Elasticsearch Query String Query with # symbol and wildcards

I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?
See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.

Resources