Elasticsearch: Constant score applied within match query, but after search terms have been analysed? - elasticsearch

Imagine I have some documents, with the following values contained within a text field called name
Document1: abc xyz group
Document2: group x/group y
Document3: group 1, group 2, group 3, group 4
Now imagine I'm sending a simple match query to ES for the term 'group':
{
"query": {
"match": {
"name": "group"
}
}
}
My desired outcome would be that all 3 documents would return with the same score, no matter how often the term appears, where it appears, etc.
Now, I already know that I can do this by wrapping my match with a constant_score, like so:
{
"query": {
"constant_score": {
"filter": {
"match": {
"name": "group"
}
},
"boost": 1
}
}
}
BUT, say I now want to query using the search term abc group. In this case, what I want to happen is that Document2 and Document3 will return the same score (matches group), but Document1 to have a better score as it matches both abc and group.
With a constant_score wrapping my match query, documents that contain any of the terms return the same score (i.e Document1, 2 and 3 return the same score for abc group). If I remove the constant_score, then Document 3 has the best score presumably because it contains more matches with the search text (group appearing 4 times).
It seems as though I need a way of moving the constant_score query to after the match query has analyzed my search text. Effectively causing a query of abc group to be two constant_score queries - one for abc and one for group.
Does anyone know of a way to achieve this?

I've managed to solve this by utilising Elasticsearch's unique token filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html
I've added that to my name field in the index mappings, and it looks to be retrieving the desired results without having to worry about constant_score.
Note however all this does is eliminate term frequencies from having any effect on the _score - other metrics (such as fieldLength) still have an effect on the results. This isn't, therefore, the equivalent of using a post-analyzed version of constant_score as I hypothesized in the question, however this will suffice for my current requirements.

Related

scoring of Term vs. Terms query different

I am retrieving documents by filtering and using a term query to apply a score.
The query should match all animals having a specified color - the more colors are matched, the higher the score of a doc. Strange thing is, term and terms query result in a different scoring.
{
"query": {
"bool": {
"should": [
{"terms": {"color": ["brown","darkbrown"] } },
]
}
}
}
should be the same like using
{"term": {"color": {"value": "brown"} } },
{"term": {"color": {"value": "darkbrown"} } }
Query no. 1 gives me the exact same score for a document whether 1 or 2 terms are matched. The latter of course returns a higher score, if more colors are matched.
As stated by the coordination factor the returned score should be higher if more terms are matched. Therefore these two queries should result in the same score - or is because term queries do not analyze the search term?
My field is indexed as text. Strings are indexed as an "array" of strings, e.g. "brown","darkbrown"
Difference between term vs terms query:
Term query return documents that contain one or more exact term in a provided field.
The terms query is the same as the term query, except you can search for multiple values.
Warning: Avoid using the term query for text fields.
As far your this part is concerned
or is because term queries do not analyze the search term?
Yes, It is because the search term does not analyze the term searched. It just matches the exact search term.

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

Boosting the relevance score based on the unique keyword found

I am in a scenario where I need to give more relevance to the document in Index if it has a unique keyword. Let me provide a scenario.
Let's say I need to search for a term znkdref unsuccessfull so the result will have contents which have znkdref or unsuccessfull or znkdref unsuccessfull but here I want that the contents which are having znkdref unsuccessfull should have highest relevance and then content having znkdref should have less relevance and then content having unsuccessfull should have least relevance.
Is there a way to achieve this ?? I would be glad to get any help
You want to use Query Time Boosting, in particular Prioritized Clauses.
In short you need to extract the keywords that you want boosted and build a query that boosts the parts that you want.
{
"query": {
"bool": {
"should": [{
"match": {
"content": {
"query": "znkdref",
"boost": 2
}
}
},
{
"match": {
"content": {
"query": "unsuccessfull"
}
}
}]
}
}
}
Update based on comment:
If you want to know why a document got the score that it did (maybe to identify "keywords") then you can pass in "explain" as a query parameter or set it in the root POST payload. The result will now have document frequency counts and sub scores.
Do you mean "znkdref" is a unique keyword? For example, "znkdref" is a special name of something. If so.
Of course, the documents match the whole query string "znkdref unsuccessfull" will have a highest relevance score in general.
The documents contain "znkdref" will usually have a higher relevance score than the documents contain "unsuccessfull". Because TF.IDF score of "znkdref" is bigger than TF.IDF score of "unsuccessfull".
The relevance score function is described at https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html
I hope that my answer is helpful for you.

How to sort elastic search results by score + boost + field?

Given an index of books that have a title, an author, and a description, I'd like the resulting search results to be sorted this way:
all books that match the title sorted by downloads (a numeric value)
all books that match on author sorted by downloads
all books that match on description sorted by downloads
I use the search query below, but the problem is that each entry has a different score thus making sorting by downloads irrelevant.
e.g. when the search term is 'sorting' - title: 'sorting in elastic search' will score higher than title: 'postgresql sorting is awesome' (because of the word position).
query = QueryBuilders.multiMatchQuery(queryString, "title^16", "author^8", "description^4")
elasticClient.prepareSearch(Index)
.setTypes(Book)
.setQuery(query)
.addSort(SortBuilders.scoreSort())
.addSort(SortBuilders.fieldSort("downloads").order(SortOrder.DESC))
How do I construct my query so that I could get the desired book sorting?
I use standard analysers and I need to the search query to be analysed, also I will have to handle multi-word search query strings.
Thx.
What you need here is a way to compute score based on three weighted field and a numeric field. Sort will sum the score obtained from both , due to which if either one of them is too large , it will supersede the other.
Hence a better approach would be to multiple downloads with the score obtained by the match.
So i would recommend function score query -
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "sorting",
"fields": [
"title^16",
"author^8",
"description^4"
]
}
},
"function": [
{
"field_value_factor": {
"field": "downloads"
}
}
],
"boost_mode": "multiply"
}
}
}
This will compute the score based on all three fields. And then multiply that score with the value in download field to get the final score. The multiply boost_mode decides how the value computed by functions are clubbed together with the score computed by query.

Resources