I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.
Related
I've created more elasticsearch indexes for different type of information in our system. Mainly they are used individually to look for elements in a particular index. However we have a general search on our home page, where the user can search in all indexes. E.g. the following search will be used:
curl -XGET 'localhost:9200/my-index-%2A/_doc/_search?pretty' -H 'Content-Type: application/json' -d'
{
"size":25,
"query":{
"bool":{
"must":[
{
"term":{"languageCode":"de"}
},
{
"bool":{
"should":[
{
"simple_query_string":{
"query":"search-term",
"fields":[
"title_*.language^50",
"description_*.language^10",
"content_*.language^1"
]
}
}
]
}
}
]
}
}
}'
I'm using in this search many indexes with wildcard (my-index-*/_doc/_search). It works absolutely correct, but my problem is, that I want that one of the indexes generates less score as the others. Is there any possibility to give less weight to an index in a multi-index query?
Yes, you can indeed apply an index boost
GET /_search
{
"indices_boost" : [
{ "my-index-do-not-want" : 0.5 }
]
}
In addition, depending on your use case you might want to consider turning on dfs_query_then_fetch for that query, as mentioned [here].(elasticsearch scoring on multiple indexes)
That way, your scores should be more comparable between indexes.
I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.
I want to remove documents with lowest relevancy in match query. Is there any other way to do this beside score t?
Use case:
Suppose we have :-
index: office
doctype: employee
post(field): Account officer, account manager, accountant, chief acc etc which are different documents.
Now I search "account" in a match query against all the docs in the "post" field.
Let's say "chief acc" value for "post" field in above doc is 'least relavant'.
I want to exclude those very less relevant matches in search results list.
I tried by using score of results but I think that is not feasible. Is there any other way to achieve this beside score??
Yes you can do this by having a filtered query inside your query:
POST _search
{
"query":{
"filtered":{
"filter":{
"not":{
"term":{
"post":"chief acc"
}
}
}
}
}
}
If you're using ES 5.0 you have to use must_not filter instead of not:
"must_not" : {
"term" : { "post" : "chief acc" }
}
Maybe you could have a look at this SO as well. Hope it helps!
I'm running a multi_match (with most_fields and "fuzziness": "AUTO") query for "Rob", but I get a result with "Ron" before "Rob".
If I remove the fuzziness, it shows Rob only, not Ron. However, I do want to use the fuzziness, I just expect all results that are exact match to be more relevant and to be shown first. It's not happening.
Investigating the 'explain', shows that the IDF of 'Ron' is a bit higher.
Back to my question - is it possible to configure some 'boost' or 'score' to the fuzziness element?
OK, I ended up with the following based on what suggested here:
https://medium.com/#oysterpail/fuzzy-queries-ae47b66b325c#.a4uxw5z0b
Their solution is using a bool query of should. I can't do it as I need this part of the query to be must (I use the should part for relevancy), and a bool query of must is actually AND. However, must + or did the trick:
{
"query":{
"bool":{
"must":{
"or":[
{
"multi_match":{
"query":"rob",
"fields":[
"username",
"firstName",
"lastName"
],
"type":"most_fields",
"fuzziness":"AUTO"
}
},
{
"multi_match":{
"query":"rob",
"fields":[
"username",
"firstName",
"lastName"
],
"type":"most_fields"
}
}
]
}
}
}
}
This way, the results coming from the fuzziness part, have a match only to the first part of the query, whereas the exact-match results have a match to both parts, therefore they are showing up first.
quite an old question but I'll answer to help others looking at it in the present.
Well the reason you are getting 'Ron' before 'Rob' is because of the TF/IDF algorithm. In your dataset the word 'Rob' has more occurrence than 'Ron' so the algorithm will give a lower score to 'Rob'.
If you just want to search for names then you can use a different scoring algorithm or similarity. In your case a 'boolean' similarity should work.
I am very new to Elasticsearch and I have to perform the following query:
GET book-lists/book-list/_search
{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"term":{
"title":"Sociology"
}
},
{
"term":{
"idOwner":"17xxxxxxxxxxxx45"
}
}
]
}
}
}
}
}
According to the Elasticsearch API, it is equivalent to pseudo-SQL:
SELECT document
FROM book-lists
WHERE title = "Sociology"
AND idOwner = 17xxxxxxxxxxxx45
The problem is that my document looks like this:
{
"_index":"book-lists",
"_type":"book-list",
"_id":"AVBRSvHIXb7carZwcePS",
"_version":1,
"_score":1,
"_source":{
"title":"Sociology",
"books":[
{
"title":"The Tipping Point: How Little Things Can Make a Big Difference",
"isRead":true,
"summary":"lorem ipsum",
"rating":3.5
}
],
"numberViews":0,
"idOwner":"17xxxxxxxxxxxx45"
}
}
And the Elasticsearch query above doesn't return anything.
Whereas, this query returns the document above:
GET book-lists/book-list/_search
{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"term":{
"numberViews":"0"
}
},
{
"term":{
"idOwner":"17xxxxxxxxxxxx45"
}
}
]
}
}
}
}
}
This makes me suspect that the fact that "title" is the same name for the two fields is for something.
Is there a way to fix this without having to rename any of the fields. Or am I missing it somewhere else?
Thanks for anyone trying to help.
Your problem is described in the documentation.
I suspect that you don't have any explicit mapping on your index, which means elasticsearch will use dynamic mapping.
For string fields, it will pass the string through the standard analyzer which lowercases it (among other things). This is why your query doesn't work.
Your options are:
Specify an explicit mapping on the field so that it isn't analyzed before storing in the index (index: not_analyzed).
Clean your term query before sending it to elasticsearch (in this specific query lowercasing will work, but note that the standard analyzer also does other things like remove stop words, so depending on the title you may still have issues).
Use a different query type (e.g., query_string instead of term which will analyze the query before running it).
Looking at the sort of data you are storing you probably need to specify an explicit not_analyzed mapping.
For option three your query would look something like this:
{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"query_string":{
"fields": ["title"],
"analyzer": "standard",
"query": "Sociology"
}
},
{
"term":{
"idOwner":"17xxxxxxxxxxxx45"
}
}
]
}
}
}
}
}
Note that the query_string query has special syntax (e.g., OR and AND are not treated as literals) which means you have to be careful what you give it. For this reason explicit mapping with a term filter is probably more appropriate for your use case.
I have described this issue in this blog.
The issue is coming due to default tokenization in Elasticsearch.
In the same , I have outlined 2 solutions.
One is enabling not_analyzed flag on the required field and other is to use keyword tokenizer.
To expand on solarissmoke's solution, while the contents of that field will be passed through the standard analyzer, your query will not. If you refer to the Elasticsearch documentation on the term query, you will see that term queries are not analyzed.
The match query is probably more appropriate for your case. What you query will be analyzed in the same way as the contents of the title field by default. The query_string query brings a lot more to the table and you should review the documentation if you plan on using that.
So again pretty much what you had with the small tweak:
GET book-lists/book-list/_search
{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"match":{
"title":"Sociology"
}
},
{
"term":{
"idOwner":"17xxxxxxxxxxxx45"
}
}
]
}
}
}
}
}
It is important to note passing lowercase version of the terms to the term query (hack - does not seem like a good idea given what solarissmoke describe about the other features of the Standard analyzer like the stop filter), using the query_string query, or using the match query is still very different from the SQL query you described:
SELECT document
FROM book-lists
WHERE title = "Sociology"
AND idOwner = 17xxxxxxxxxxxx45
With those Elasticsearch queries, you can match records where idOwner might be the same but title might be something like "Another Sociology Title" which is different from what you would expect with that SQL. Here is some great stuff from the documentation and another stackoverflow post that will elaborate on what was going on, where term queries and filters are appropriate, and getting exact matches:
Elasticsearch : Finding Exact Values
Stackoverflow : Exact (not substring) matching in Elasticsearch