Elasticsearch - multi_match together with short queries - elasticsearch

I have query like this (I've removed sorting part because it doesn't matter):
GET _search
{
"query": {
"multi_match": {
"query": "somethi",
"fields": [ "title", "content"],
"fuzziness" : "AUTO",
"prefix_length" : 0
}
}
}
When running this I'm getting results like this:
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "2",
"_score": 0.083934024,
"_source": {
"title": "Matching something abc",
"content": "This is a piece of content",
"categories": [
{
"name": "B",
"weight": 4
}
]
},
"sort": [
4,
0.083934024,
"article#2"
]
},
{
"_index": "test_index",
"_type": "article",
"_id": "3",
"_score": 0.18436861,
"_source": {
"title": "Matching something abc",
"content": "This is a piece of content containing something",
"categories": [
{
"name": "C",
"weight": 3
}
]
},
"sort": [
3,
0.18436861,
"article#3"
]
},
...
So no problem to get what is expected. However I noticed, that I remove one letter from query to have someth instead, Elasticsearch won't return any results.
This is quite strange for me. It seems multi_match is doing partial match but it somehow require to use minimum x characters. Same if I try to put in query for example omethin I will get results, but using only omethi I won't get any.
Is there any setting to set minimum number of characters in queries or maybe I would need to rewrite my query to achieve what I want? I would like to run match on multiple fields (in above query on title and content fields) that will allow partial match together with fuzzinness.

You get this behaviour because you have "fuzziness": "AUTO" parameter set, which means that in a word with more than 5 characters it is acceptable to misplace maximum of two characters. Generally, fuzziness parameter tells elasticsearch to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character. With fuzziness it is not possible to have more than two changes.
If you need to be able to search with partial matching, you could try to configure you index using Edge NGram analyzer and set it to your title and content fields. You can easily test how it works:
Create na index with following mapping:
PUT http://127.0.0.1:9200/test
{
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
And run this query:
curl -X POST \
'http://127.0.0.1:9200/test/_analyze?pretty=true' \
-d '{
"analyzer" : "edge_ngram_analyzer",
"text" : ["something"]
}'
As a result you'll get:
{
"tokens": [
{
"token": "so",
...
},
{
"token": "som",
...
},
{
"token": "some",
...
},
{
"token": "somet",
...
},
{
"token": "someth",
...
},
{
"token": "somethi",
...
},
{
"token": "somethin",
...
},
{
"token": "something",
...
}
]
}
And these are the tokens you'll get during search with edge_ngram_analyzer. With min_gram and max_gram you can configure minimum/maximum length of characters in a gram.
If you need to handle the case with omething etc. (missing letter at the beginning) try the same with NGram analyzer.

Related

Elastic returns unexpected result from Search using edge_ngram

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?
When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

How to get exact match phrase more than one

Below is the query to get the exact match
GET courses/_search
{
"query": {
"term" : {
"name.keyword": "Anthropology 230"
}
}
}
I need to find the Anthropology 230 and Anthropology 250 also
How to get the exact match
You can check and try with, match, match_phrase or match_phrase_prefix
Using match,
GET courses/_search
{
"query": {
"match" : {
"name" : "Anthropology 230"
}
},
"_source": "name"
}
Using match_phrase,
GET courses/_search
{
"query": {
"match_phrase" : {
"name" : "Anthropology"
}
},
"_source": "name"
}
OR using regexp,
GET courses/_search
{
"query": {
"regexp" : {
"name" : "Anthropology [0-9]{3}"
}
},
"_source": "name"
}
The mistake that you are doing is that you are using the term query on keyword field and both of them are not analyzed, which means they try to find the exact same search string in inverted index.
What you should be doing is: define a text field which you anyway will have if you have not defined your mapping. I am also assuming the same as in your query you mentioned .keyword which gets created automatically if you don't define mapping.
Now you can just use below match query which is analyzed and uses standard analyzer which splits the token on whitespace, so Anthropology 250 and 230 will be generated for your 2 sample docs.
Simple and efficient query which brings both the docs
{
"query": {
"match" : {
"name" : "Anthropology 230"
}
}
}
And search result
"hits": [
{
"_index": "matchterm",
"_type": "_doc",
"_id": "1",
"_score": 0.8754687,
"_source": {
"name": "Anthropology 230"
}
},
{
"_index": "matchterm",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"name": "Anthropology 250"
}
}
]
The reason why above query matched both docs is that it created two tokens anthropology and 230 and matches anthropology in both of the documents.
You should definitely read about the analysis process and can also try analyze API to see the tokens generated for any text.
Analyze API output for your text
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "standard",
"text": "Anthropology 250"
}
{
"tokens": [
{
"token": "anthropology",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "250",
"start_offset": 13,
"end_offset": 16,
"type": "<NUM>",
"position": 1
}
]
}
Assuming you may have more 'Anthropology nnn' items, this should do what you need:
"query":{
"bool":{
"must":[
{"term": {"name.keyword":"Anthropology 230"}},
{"term": {"name.keyword":"Anthropology 250"}},
]
}
}

ngram matching gives same score to less relevant documents

I am searching for Bob Smith in my elasticsearch index. The results Bob Smith and Bobbi Smith both come back in the response with the same score. I want Bob Smith to have a higher score so that it appears first in my result set. Why are the scores the equivalent?
Here is my query
{
"query": {
"query_string": {
"query": "Bob Smith",
"fields": [
"text_field"
]
}
}
}
Below are my index's settings. I am using the ngram token filter described here: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
{
"contacts_5test": {
"aliases": {},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"term_vector": "yes",
"analyzer": "ngram_filter_analyzer"
}
}
},
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "contacts_5test",
"creation_date": "1588987227997",
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": "4",
"max_gram": "4"
}
},
"analyzer": {
"ngram_filter_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "HqOXu9bNRwCHSeK39WWlxw",
"version": {
"created": "7060199"
}
}
}
}
}
Here are the results from my query...
"hits": [
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "1",
"_score": 0.69795835,
"_source": {
"text_field": "Bob Smith"
}
},
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "2",
"_score": 0.69795835,
"_source": {
"text_field": "Bobbi Smith"
}
}
]
If I instead search for Bobbi Smith, elastic returns both documents, but with a higher score for Bobbi Smith. This makes more sense.
I was able to reproduce your issue and reason for this is due to the use of your ngram_filter, which doesn't create any token for bob as the minimum length of the token should be 4 while standard tokenizer created bob token but then it gets filtered out in your ngram_filter where you mentioned min_gram as 4.
Even I tried with less min_gram length to 3, which would create the tokens but the issue is that both bob and bobbie will have same bob tokens, hence score for both of them will be same.
While when you search for Bobbi Smith, then bobbi ie exact token will be present only in one document, hence you get the higher score.
Note:- Please use the analyze API and explain API to inspect the tokens generated and how these are matched, this would help you to understand the issue and my explanation in details and my

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.
Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources