elasticsearch: How to rank first appearing words or phrases higher - elasticsearch

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.

You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.

This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

How to do partial search and get relevant score in Elasticsearch

I am new to Elasticsearch, trying to do some search.
I have names of objects like :
Homework
work
jobroles
jobs
I am using wildcard query, but its returning score of 1.0 for each docs.
I want score based on how well it matched. Ex
Ex. If I type
work
score of work > homework
Its a good question and directly you can't get the exact match on top, what you need is ngram analyzer which provides the partial matches and another field which stores the exact tokens in lowercase(text field with standard analyzer will solve it).
I've reproduced your issue and solved it using above mentioned approach, Please refer my blog on autocomplete and my this SO answer for in-depth read of various autocomplete/partial searches and why/what/how part of it.
Working example
Create index mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"title_lowercase" :{
"type" : "text"
}
}
}
}
Index your sample docs
{
"title" : "Homework",
"title_lowercase" : "Homework"
}
{
"title" : "work",
"title_lowercase" : "work"
}
Search query
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "work"
}
}
},
{
"match": {
"title_lowercase": {
"query": "work"
}
}
}
]
}
}
}
And expected result
"hits": [
{
"_index": "internaledge",
"_type": "_doc",
"_id": "1",
"_score": 0.9926754, /note score of `work` is much higher than`homework`
"_source": {
"title": "work",
"title_lowercase": "work"
}
},
{
"_index": "internaledge",
"_type": "_doc",
"_id": "2",
"_score": 0.2995283,
"_source": {
"title": "Homework",
"title_lowercase": "Homework"
}
}
]

Elasticsearch: why exact match has lower score than partial match

my question
I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?
my search query
{
"query": {
"match": {
"word": "form"
}
}
}
result
word score
--------------------------
formulation 10.864353
formaldehyde 10.864353
formless 10.864353
formal 10.84412
formerly 10.84412
forma 10.84412
formation 10.574185
formula 10.574185
formulate 10.574185
format 10.574185
formally 10.574185
form 10.254687
former 10.254687
formidable 10.254687
formality 10.254687
formative 10.254687
ill-formed 10.054999
in form 10.035862
pro forma 9.492243
POST my_index/_analyze
The word form in search has only one token form.
In index, form tokens are ["f", "fo", "for", "form"]; formulation tokens are ["f", "fo", ..., "formulatio", "formulation"].
my config
filter
"edgengram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
analyzer
"analyzer": {
"abc_vocab_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"edgengram_filter",
"unique"
]
},
"abc_vocab_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"unique"
]
}
}
mapping
"word": {
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
}
You get the result in the way you see because you've implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.
Therefore, your relevancy also gets computed in that way. You can refer to this link and I'd specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.
You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.
Instead, as mentioned by #Val, you can create a sibling of text field with standard analyzer.
Mapping:
{
"word":{
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
"fields":{
"standard":{
"type": "text"
}
}
}
}
Query:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"word": "form"
}
}
],
"should": [ <---- Note this
{
"match": {
"word.standard": "form"
}
}
]
}
}
}
Let me know if this helps!
Because your type for this field is text which means ES will do full-text search analysis on this field. And ES search process is kind of finding results most similar to the word you have given. To accurately search word "form", change your search method to match_phrase Furthermore, you could also read articles below to learn more about different ES search methods:
https://www.cnblogs.com/yjf512/p/4897294.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
Looks like some issue in your custom analyzer, I created my custom autocomplete analyzer, which uses edge_ngram and lowercase filter and it works fine for me for your query and returns me exact match on top and this is how Elasticsearch works, exact matches always have more score., So no need to explicitly create another field and boost that, As Elasticsearch by default boosts the exact match on tokens match.
Index def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index few doc
{
"title" : "formless"
}
{
"title" : "form"
}
{
"title" : "formulation"
}
Search query on title field as provided in the question
{
"query": {
"match": {
"title": "form"
}
}
}
Search result with exact match having highest score
"hits": [
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "1",
"_score": 0.16410133,
"_source": {
"title": "form"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "2",
"_score": 0.16410133,
"_source": {
"title": "formulation"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "3",
"_score": 0.16410133,
"_source": {
"title": "formaldehyde"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "4",
"_score": 0.16410133,
"_source": {
"title": "formless"
}
}
]

Return only exact matches (substrings) in full text search (elasticsearch)

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed:
{title: "Joe Dirt"}
{title: "Meet Joe Black"}
{title: "Tomorrow Never Dies"}
and the search query is "I want to watch the movie Joe Dirt tomorrow"
I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to return "Joe Dirt" because the title is an exact match substring of the search query.
Is that possible in elasticsearch?
Thanks!
One way to achieve this is as follows :
1) while indexing index title using keyword tokenizer
2) While searching use shingle token-filter to extract substring from the query string and match against the title
Example:
Index Settings
put test
{
"settings": {
"analysis": {
"analyzer": {
"substring": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
},
"exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"substring": {
"type":"shingle",
"output_unigrams" : true
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "exact"
}
}
}
}
}
}
}
Index Documents
put test/movie/1
{"title": "Joe Dirt"}
put test/movie/2
{"title": "Meet Joe Black"}
put test/movie/3
{"title": "Tomorrow Never Dies"}
Query
post test/_search
{
"query": {
"match": {
"title.raw" : {
"analyzer": "substring",
"query": "Joe Dirt tomorrow"
}
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 0.015511602,
"hits": [
{
"_index": "test",
"_type": "movie",
"_id": "1",
"_score": 0.015511602,
"_source": {
"title": "Joe Dirt"
}
}
]
}

Resources