Elasticsearch search result relevance issue - elasticsearch

Why does match query return less relevant results first? I have an index field named normalized. Its mapping is:
normalized: {
type: "text"
analyzer: "autocomplete"
}
settings for this field are:
analysis; {
filter: {
autocomplete_filter: {
type: "edge_ngram",
min_gram => "1",
max_gram => "20"
}
analyzer: {
autocomplete: {
filter: [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
type: "custom",
tokenizer: "standard"
}
}
so as I know it makes an ascii, lowercase, tokens e.g. MOUSE = m, mo, mou, mous, mouse.
The problem is that request like:
{
'query': {
'bool': {
'must': {
'match': {
'normalized': 'simag'
}
}
}
}
}
returns results like
"siman siman service"
"mgr simona simunkova simiki"
"Siman - SIMANS"
"simunek simunek a simunek"
.....
But there is no SIMAG which contains all the letters of the match phrase.
How to achieve that most relevant result will be the words which contains all the letters before the tokens which does not contain all letters.
Hope somebody understand what I need.
Thanks.
PS: I am not sure but what about this query:
{
'query': {
'bool': {
'should': [
{'term': {'normalized': 'simag'}},
{'match': {'normalized': 'simag'}}
]
}
}
}
Does it make sense in comparison to previous code?

Please note that match query is analyzed, which means the same analyzer is used at the query time, which was used at the index time for the field you mentioned in your query.
In your case, you applied autocomplete analyzer on your normalized field and as you mentioned, it generates below token for MOUSE :
MOUSE = m, mo, mou, mous, mouse.
In similar way, if you search for mouse using the match query on the same field, it would search for below query strings :-
m, mo, mou, mous, mouse .. hence results which contain the words like mousee or mouser will also come as during index .. it created tokens which matches with the tokens generated on the search term.
Read more about match query on Elastic site https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html first line itself explains your search results
match queries accept text/numerics/dates, analyzes them, and
constructs a query:
If you want to go deep and understand, how your search query is matching the documents and its score use explain API
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Related

Atlas Search Index partial match

I have a test collection with these two documents:
{ _id: ObjectId("636ce11889a00c51cac27779"), sku: 'kw-lids-0009' }
{ _id: ObjectId("636ce14b89a00c51cac2777a"), sku: 'kw-fs66-gre' }
I've created a search index with this definition:
{
"analyzer": "lucene.standard",
"searchAnalyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"sku": {
"type": "string"
}
}
}
}
If I run this aggregation:
[{
$search: {
index: 'test',
text: {
query: 'kw-fs',
path: 'sku'
}
}
}]
Why do I get 2 results? I only expected the one with sku: 'kw-fs66-gre' 😬
During indexing, the standard anlyzer breaks the string "kw-lids-0009" into 3 tokens [kw][lids][0009], and similarly tokenizes "kw-fs66-gre" as [kw][fs66][gre]. When you query for "kw-fs", the same analyzer tokenizes the query as [kw][fs], and so Lucene matches on both documents, as both have the [kw] token in the index.
To get the behavior you're looking for, you should index the sku field as type autocomplete and use the autocomplete operator in your $search stage instead of text
You're still getting 2 results because of the tokenization, i.e., you're still matching on [kw] in two documents. If you search for "fs66", you'll get a single match only. Results are scored based on relevance, they are not filtered. You can add {$project: {score: { $meta: "searchScore" }}} to your pipeline and see the difference in score between the matching documents.
If you are looking to get exact matches only, you can look to using the keyword analyzer or a custom analyzer that will strip the dashes, so you deal w/ a single token per field and not 3

Elastic query bool must match issue

Below is the query part in Elastic GET API via command line inside openshift pod , i get all the match query as well as unmatch element in the fetch of 2000 documents. how can i limit to only the match element.
i want to specifically get {\"kubernetes.container_name\":\"xyz\"}} only.
any suggestions will be appreciated
-d ' {\"query\": { \"bool\" :{\"must\" :{\"match\" :{\"kubernetes.container_name\":\"xyz\"}},\"filter\" : {\"range\": {\"#timestamp\": {\"gte\": \"now-2m\",\"lt\": \"now-1m\"}}}}},\"_source\":[\"#timestamp\",\"message\",\"kubernetes.container_name\"],\"size\":2000}'"
For exact matches there are two things you would need to do:
Make use of Term Queries
Ensure that the field is of type keyword datatype.
Text datatype goes through Analysis phase.
For e.g. if you data is This is a beautiful day, during ingestion, text datatype would break down the words into tokens, lowercase them [this, is, a, beautiful, day] and then add them to the inverted index. This process happens via Standard Analyzer which is the default analyzer applied on text field.
So now when you query, it would again apply the analyzer at querying time and would search if the words are present in the respective documents. As a result you see documents even without exact match appearing.
In order to do an exact match, you would need to make use of keyword fields as it does not goes through the analysis phase.
What I'd suggest is to create a keyword sibling field for text field that you have in below manner and then re-ingest all the data:
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"kubernetes":{
"type": "object",
"properties": {
"container_name": {
"type": "text",
"fields":{ <--- Note this
"keyword":{ <--- This is container_name.keyword field
"type": "keyword"
}
}
}
}
}
}
}
}
Note that I'm assuming you are making use of object type.
Request Query:
POST my_sample_index
{
"query":{
"bool": {
"must": [
{
"term": {
"kubernetes.container_name.keyword": {
"value": "xyz"
}
}
}
]
}
}
}
Hope this helps!

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?
Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

How to match parts of a word in elasticsearch?

How can I match parts of a word to the parent word ?. For example: I need to match "eese" or "heese" to the word "cheese".
The best way to achieve this is using an edgeNGram token filter combined with two reverse token filters. So, first you need to define a custom analyzer called reverse_analyzer in your index settings like below. Then you can see that I've declared a string field called your_field with a sub-field called suffix which has our custom analyzer defined.
PUT your_index
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase", "reverse", "substring", "reverse"]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"your_field": {
"type": "string",
"fields": {
"suffix": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
}
Then you can index a test document with "cheese" inside, like this:
PUT your_index/your_type/1
{"your_field": "cheese"}
When this document is indexed, the your_field.suffix field will contain the following tokens:
e
se
ese
eese
heese
cheese
Under the hood what is happening when indexing cheese is the following:
The keyword tokenizer will tokenize a single token, => cheese
The lowercase token filter will put the token in lowercase => cheese
The reverse token filter will reverse the token => eseehc
The substring token filter will produce different tokens of length 1 to 10 => e, es, ese, esee, eseeh, eseehc
Finally, the second reverse token filter will reverse again all tokens => e, se, ese, eese, heese, cheese
Those are all the tokens that will be indexed
So we can finally search for eese (or any suffix of cheese) in that sub-field and find our match
POST your_index/_search
{
"query": {
"match": {
"your_field.suffix": "eese"
}
}
}
=> Yields the document we've just indexed above.
You can do it two ways:
If you need it happen only for some search then search box only you can pass
*eese* or *heese*
Just give * in beginning and end of your search word. If you need it for every search
string "*#{params[:query]}*"
this will match with your parent word and give the result
There are multiple ways to do this
The analyzer approach - Here you Ngram tokenizer to break sub tokens of all the words. Hence for the word "cheese" -> [ "chee" , "hees" , "eese" , "cheese" ] and all ind of substrings would be generated. With this index size will go high , but the search speed would be optimized
The wildcard query approach - In this approach , a scan happens on the reverse index. This does not occupy additional index size , but it will take more time on the search.

problems with phrase matching in elasticsearch

I'm trying to perform Phrase matching using elasticsearch.
Here is what I'm trying to accomplish:
data - 1: {
"test" {
"title" : "text1 text2"
}
}
2: {
"test" {
"title" : "text3 text4"
}
}
3: {
"test" {
"title" : "text5"
}
}
4: {
"test" {
"title" : "text6"
}
}
Search terms:
If I lookup for "text0 text1 text2 text3" - It should return #1 (matches full string)
If I lookup for "text6 text5 text4 text3" - It should return #4, #3, but not #2 as its not in same order.
Here is what I've tried:
set the index_analyzer as keyword, and search_analyzer as standard
also tried creating custom tokens
but none of my solution allows me to lookup a substring match from search query against keyword in document.
If anyone has written similar queries, can you provide how the mappings are configured and what kind of query is been used.
What I see here is this: You want your search to match on any tokens sent from the query. If those tokens do match, it must be an exact match to the title.
This means that indexing your title field as keyword would get you that mandatory match. However, the standard analyzer for search would never match titles spaces as you'd have your index token {"text1 text2"} and your search token [{"text1},{"text2"}]. You can't use a phrase match with any sloppy value or else your token order requirement will be ignored.
So, what you really need is to generate keyword tokens during the index, but you need to generate shingles whenever you search. Your shingles will maintain order and if one of them matches, consider it a go. I would set to not output unigrams, but do allow unigrams if no shingles. This means that if you have just one word, it will output that token, but it if can combine your search words into various number of shingled tokens, it will not emit single word tokens.
PUT
{ "settings":
{
"analysis": {
"filter": {
"my_shingle": {
"type": "shingle",
"max_shingle_size": 50,
"output_unigrams": false
}
},
"analyzer": {
"my_shingler": {
"filter": [
"lowercase",
"asciifolding",
"my_shingle"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Then you just want to set your type mapping to use the keyword analyzer for index and the `my_shingler` analyzer for search.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

Resources