Elastic Query accepting only 4 characters - elasticsearch

I am running a terms query in elastic search version 7.2, when I have 4 characters in my query, it works and if I add or remove any characters it's not working.
Working query:
{
"query": {
"bool": {
"must": [{
"terms": {
"GEP_PN": ["6207"]
}
},
{
"match": {
"GEP_MN.keyword": "SKF"
}
}
]
}
}
}
Result :
Query that is failing :

Its not failing, its not finding the result for your search-term, please note that terms query are not analyzed as mention in the docs.
Returns documents that contain one or more exact terms in a provided
field.
Please provide the mapping of your index and if its using the text field and you are not using custom-analyzer it will use standard analyzer which would split tokens on -, hence your terms query is not matching the tokens present in inverted index.
Please see the analyze API o/p for your search-term, which explains the probable root-cause.
{
"text" : "6207-R"
}
Tokens
{
"tokens": [
{
"token": "6207",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 0
},
{
"token": "r",
"start_offset": 5,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
}
]
}

Related

Adding Elasticsearch sort returns incorrect results?

My queries are successfully returning the exact results that I am looking for.
{"size": 100,"from": 0, "query": {"bool": {"must": [{"bool":{"should":[{"match":{"ProcessId":"from-cn"}}]}}]}}}
This returns only items with ProcessId "from-cn"
However, when I add a sort query like this:
{"size": 100,"from": 0,"sort": [{"CreatedTimeStamp": {"order": "desc"}}], "query": {"bool": {"must": [{"bool":{"should":[{"match":{"ProcessId":"from-cn"}}]}}]}}}
This is now returning all "from-cn", but it is also returning several other results that do NOT have ProcessId "from-cn".
I know it is the sort that is causing the issue because when I remove sort, it returns perfectly.
Why is this happening here? How can I fix?
Try this query instead. What does it yield?
{
"size": 100,
"from": 0,
"sort": [
{
"CreatedTimeStamp": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"match": {
"ProcessId": "from-cn"
}
}
]
}
}
}
match query performs full-text search.
It means that it analyzes the provided text producing tokens that will be used when doing actual matching against the document field.
Unless you defined a custom search analyzer for ProcessId field, Elasticsearch will use standard analyzer here.
You can verify what tokens it produces for "from-cn" text using Analyze API, in this case:
POST http://localhost:9200/_analyze
{
"analyzer" : "standard",
"text" : "from-cn"
}
The response:
{
"tokens": [
{
"token": "from",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "cn",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
You can see that it produces two tokens: "from" and "cn". So the documents having only one of them will also match the query. In your case, I believe, they simply fell out of the first 100 results that you requested, so you don't see them when searching without custom sort.
When you don't use custom sorting, documents are sorted by score and the documents that are more relevant to the query are higher on the list. In your case, documents matching both tokens will have higher score than those matching only one. But with custom sorting you don't rely on the score anymore, so less relevant documents can be higher.
Solution:
If you want to match the contents of the field exactly, define that field as non-analyzed in your mapping (e.g. using keyword type instead of text) and use a query that doesn't analyze provided text (e.g. term query instead of match).
Recreate index with ProcessId field as keyword.
POST http://localhost:9200/my-index
{
"mappings": {
"properties": {
"ProcessId": {
"type": "keyword"
},
... other fields
}
}
}
After reindexing data, use that field for searching with term query.
{
"size": 100,
"from": 0,
"sort": [
{
"CreatedTimeStamp": {
"order": "desc"
}
}
],
"query": {
"term": {
"ProcessId": "from-cn"
}
}
}

ElasticSearch inconsistent wildcard search

I have a strange issue with my wildcard search. I've created an index with the following mapping:
I have the following document there:
When I'm performing the following query, I'm getting the document:
{
"query": {
"wildcard" : { "email" : "*asdasd*" }
},
"size": "10",
"from": 0
}
But when I'm doing the next request, I'm not getting anything:
{
"query": {
"wildcard" : { "email" : "*one-v*" }
},
"size": "10",
"from": 0
}
Can you please explain the reason for it?
Thank you
Elasticsearch uses a standard analyzer if no analyzer is specified. Assuming that the email field is of text type, so "asdasd#one-v.co.il" will get tokenized into
{
"tokens": [
{
"token": "asdasd",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "one",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "v.co.il",
"start_offset": 11,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Now, when you are doing a wildcard query on the email field, then it will search for the tokens, created above. Since there is no token that matches one-v, you are getting empty results for the second query.
It is better to use a keyword field for wildcard queries. If you have not explicitly defined any index mapping then you need to add .keyword to the email field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after the email field).
Modify your query as shown below
{
"query": {
"wildcard": {
"email.keyword": "*one-v*"
}
}
}
Search Result will be
"hits": [
{
"_index": "67688032",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"email": "asdasd#one-v.co.il"
}
}
]
Otherwise you need to change the data type of the email field from text to keyword type
This has to do with how text fields are saved. By default standard analyzer is used.
This is an example from the documentation which fits your case too :
The text "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." is broken into terms
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
As you can see Brown-foxes is not a single token. The same will go for one-v, it will break into one and v.

Spring Data JPA IN clause returning more than expected values, when any element of list, to be passed is having hyphen in it

While fetching records using IN clause, the below query is returning more than expected values.
List`<Object>` findAllByCameraIdIn(List`<String>` cameraIds);
I have records associated with two cameras in elastic db - [uk05-smoking-shelter-carpark, uk05-stairway-in]
If List cameraIds = ["uk05-smoking-shelter-carpark"], it's giving values associated with camera -> uk05-stairway-in also (both cameras), Any idea/suggestion why this is happing ?
Even if I'm making db call to filter the records, expected result should have been only 7, corresponding to uk05-smoking-shelter-carpark but it is giving me results for uk05-stairway-in also.
My Findings
When I replaced the - with _ for few records i.e., (uk05-smoking-shelter-carpark with uk05_smoking_shelter_carpark) in the cameraId, the query is working fine.
I believe the query starts searching for all the records with the given value but once it enconters - , it's ignoring all the letters after the - . Any suggestion or insights why it is like this?
Elasticsearch uses a standard analyzer if no analyzer is specified. Assuming cameraId field is of text type, so uk05-smoking-shelter-carpark will get tokenized into
{
"tokens": [
{
"token": "uk05",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "smoking",
"start_offset": 5,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "shelter",
"start_offset": 13,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "carpark",
"start_offset": 21,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
}
]
}
So when searching for "uk05-smoking-shelter-carpark" will match all the documents that have any of the tokens shown above.
If you want to return the documents that match exactly with the search query then you need to change the data type of cameraId to keyword type
OR if you have not explicitly defined any mapping then you need to add .keyword to the cameraId field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after cameraId field).
It is better to use a term query if you are searching for an exact term match.
Search Query using match query
{
"query":{
"match":{
"cameraId.keyword":"uk05_smoking_shelter_carpark"
}
}
}
Search Query using term query
{
"query":{
"term":{
"cameraId.keyword":"uk05_smoking_shelter_carpark"
}
}
}
When you replace - with _, i.e "uk05_smoking_shelter_carpark", this will get tokenized into
GET /_analyze
{
"analyzer" : "standard",
"text" : "uk05_smoking_shelter_carpark"
}
Token generated will be
{
"tokens": [
{
"token": "uk05_smoking_shelter_carpark",
"start_offset": 0,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 0
}
]
}
In this case, the search query will only return the documents that match uk05_smoking_shelter_carpark

Elasticsearch : Problem with querying document where "." is included in field

I have an index where some entries are like
{
"name" : " Stefan Drumm"
}
...
{
"name" : "Dr. med. Elisabeth Bauer"
}
The mapping of the name field is
{
"name": {
"type": "text",
"analyzer": "index_name_analyzer",
"search_analyzer": "search_cross_fields_analyzer"
}
}
When I use the below query
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Stefan Drumm","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
It returns the first document.
But when I try to get the second document using the query below
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Dr. med. Elisabeth Bauer","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
it is not returning anything.
Things I can't do
can't change the index
can't use the term query.
change the operator to 'OR', because in that case it will return multiple entries, which I don't want.
What I am doing wrong and how can I achieve this by modifying the query?
You have configured different analyzers for indexing and searching (index_name_analyzer and search_cross_fields_analyzer). If these analyzers process the input Dr. med. Elisabeth Bauer in an incompatible way, the search isn't going to match. This is described in more detail in Index and search analysis, as well as in Controlling Analysis.
You don't provide the definition of these two analyzers, so it's hard to guess from your question what they are doing. Depending on the analyzers, it may be possible to preprocess your query string (e.g. by removing .) before executing the search so that the search will match.
You can investigate how analysis affects your search by using the _analyze API, as described in Testing analyzers. For your example, the commands
GET my_index/_analyze
{
"analyzer": "index_name_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
and
GET my_index/_analyze
{
"analyzer": "search_cross_fields_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
should show you how the two analyzers configured for your index treats the target string, which might provide you with a clue about what's wrong. The response will be something like
{
"tokens": [
{
"token": "dr",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "med",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elisabeth",
"start_offset": 9,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "bauer",
"start_offset": 19,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 3
}
]
}
For the example output above, the analyzer has split the input into one token per word, lowercased each word, and discarded all punctuation.
My guess would be that index_name_analyzer preserves punctuation, while search_cross_fields_analyzer discards it, so that the tokens won't match. If this is the case, and you can't change the index configuration (as you state in your question), one other option would be to specify a different analyzer when running the query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "Dr. med. Elisabeth Bauer",
"operator": "AND",
"analyzer": "index_name_analyzer"
}
}
}
],
"boost": 1
}
},
"min_score": 0
}
In the query above, the analyzer parameter has been set to override the search analysis to use the same analyzer (index_name_analyzer) as the one used when indexing. What analyzer might make sense to use depends on your setup. Ideally, you should configure the analyzers to align so that you don't have to override at search time, but it sounds like you are not living in an ideal world.

Elasticsearch wildcard character is not matching numbers

I am searching elasticsearch index by using following query string:
curl -XGET 'http://localhost:9200/index/type/_search' -d '{
"query": {
"query_string" : {
"default_field" : "keyword",
"query" : "file*.tif"
}
}
}'
Schema for keyword field is as follows:
"keyword" : {"type" : "string", "store" : "yes", "index" : "analyzed" }
The problem with above query is it doesn't retrieve results for keyword like file001.tif while file001_copy.tif is retrieved. Match query is retrieving results correctly. Is this a limitation of Query_String or am I missing something?
You can see your problem by analyzing the string that you're indexing
curl "localhost:9200/_analyze" -d "file001.tif" | python -mjson.tool
{
"tokens": [
{
"end_offset": 7,
"position": 1,
"start_offset": 0,
"token": "file001",
"type": "<ALPHANUM>"
},
{
"end_offset": 11,
"position": 2,
"start_offset": 8,
"token": "tif",
"type": "<ALPHANUM>"
}
]
}
curl "localhost:9200/_analyze" -d "file001_copy.tif" | python -mjson.tool
{
"tokens": [
{
"end_offset": 16,
"position": 1,
"start_offset": 0,
"token": "file001_copy.tif",
"type": "<ALPHANUM>"
}
]
}
The standard analyzer file001.tif is splitting the tokens up to file001 and tif
but file001_copy.tif is not. so when you go search for file its only hitting file001_copy.tif because its the only thing that fits your criteria (has to have a token that has 'file' + 0 or more characters AND 'tif' in it)
You probably want to use a whitespace or keyword analyzer in tandem with a lowercase filter, to make it work the way you want to.

Resources