Repeated values in Elasticsearch array and query scoring - elasticsearch

I have two documents with a field country which can contain repeated values, e.g.
Doc1:
country: [US, US, GB, US]
Doc2:
country: [US, GB]
I need a query that when looking for country:US will assign a higher score to Doc1 than Doc2 since US appears multiple times in the country field of Doc1, while it will assign the same score to the two documents when looking for country:GB as it appears the same number of times in both documents. Is this something achievable with Elasticsearch?

If you are doing a simple match search on US
GET countryindex/_search
{
"query": {
"match": {
"country": "US"
}
}
}
It will give more score to more frequency of elements so [US, US, GB, US] will get more score than "[US, GB]"
If you will search for "GB" -->"[US, GB]" will get more score than [US, US, GB, US], since shorter field length gets more score.
If you want to give same score when number of matches is same , you need to give norms: false in your mapping.
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}

Related

ElasticSearch: How to search only the N docs for large scale index?

I have large number of docs(about 100M) stored in a single index, when using group by on single field, the query may eat up all my CPU on ES server(most of time, < 100 results returned).
Is it possible to limit the query scope(i.e., only search 1M docs) for a single query?
Use query pagination to limit the search scope:
GET /_search
{
"from": 0,
"size": 1000000,
"query": {
"match": {
"city": "New york"
}
}
}
more information in documentation
to add to the other answer, you can also look at https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-aggregations-bucket-sampler-aggregation.html

Elasticsearch boost but only one occurrence of term per field

I'm currently sending the following query to ElasticSearch:
{
"size": 100,
"query": {
"function_score": {
"query": {
"simple_query_string": {
"query": "term1",
"fields": ["field1^10", "field2^5"]
}
}]
}
}
}
Now imagine I have two documents.
Document1 contains one occurrence of "term1" on field1
Document2 contains three occurrences of "term1" on field2
What I get: Elastic returns Document2 above Document1
What I want: Document1 above Document2.
To achieve this, Elastic should not multiply the number of occurrences of "term1" just that it "appears". What should I do to my query?
There seems to be two kinds of options to force Elastic not give more weight based on number of occurrences of a term.
The first one is to map the fields to disable term frequency (TF): https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#tfidf
The second one is to use the Constant Score Query: https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html

How to filter results based on frequency of repeating terms in an array in elasticsearch

I have an array field with a lot of keywords and i need to sort the documents on the basis on how many times a particular keyword repetation in those arrays.
For eg,if my field name is "nationality" and for document 1, it consists of the following
doc1
nationality :
["US","UK","Australia","India","US","US"]
and for doc2
nationality:
["US","UK","US","US","US","China"]
I want only those documents to be shown where the term "US" occurs more than 3 times. That would make only doc2 to be shown. How to do this?
You can use scripting for this to be implemented.
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['nationality']['US'].tf() > 3"
}
}
}
}
}
Here in this scripy the array "nationality" is checked for the term "US" and the count is taken by tf (term frequency). Now only the documents with term frequency greater than three are shown in the results. You can learn more about the filter operations here

How to sort elastic search results by score + boost + field?

Given an index of books that have a title, an author, and a description, I'd like the resulting search results to be sorted this way:
all books that match the title sorted by downloads (a numeric value)
all books that match on author sorted by downloads
all books that match on description sorted by downloads
I use the search query below, but the problem is that each entry has a different score thus making sorting by downloads irrelevant.
e.g. when the search term is 'sorting' - title: 'sorting in elastic search' will score higher than title: 'postgresql sorting is awesome' (because of the word position).
query = QueryBuilders.multiMatchQuery(queryString, "title^16", "author^8", "description^4")
elasticClient.prepareSearch(Index)
.setTypes(Book)
.setQuery(query)
.addSort(SortBuilders.scoreSort())
.addSort(SortBuilders.fieldSort("downloads").order(SortOrder.DESC))
How do I construct my query so that I could get the desired book sorting?
I use standard analysers and I need to the search query to be analysed, also I will have to handle multi-word search query strings.
Thx.
What you need here is a way to compute score based on three weighted field and a numeric field. Sort will sum the score obtained from both , due to which if either one of them is too large , it will supersede the other.
Hence a better approach would be to multiple downloads with the score obtained by the match.
So i would recommend function score query -
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "sorting",
"fields": [
"title^16",
"author^8",
"description^4"
]
}
},
"function": [
{
"field_value_factor": {
"field": "downloads"
}
}
],
"boost_mode": "multiply"
}
}
}
This will compute the score based on all three fields. And then multiply that score with the value in download field to get the final score. The multiply boost_mode decides how the value computed by functions are clubbed together with the score computed by query.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources