Elasticsearch - search across multiple indices with conditional decay function - elasticsearch

I'm trying to search across multiple indices with one query, but only apply the gaussian decay function to a field that exists on one of the indices.
I'm running this through elasticsearch-api gem, and that portion works just fine.
Here's the query I'm running in marvel.
GET episodes,shows,keywords/_search?explain
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "AWESOME SAUCE",
"type": "most_fields",
"fields": [ "title", "summary", "show_title"]
}
},
"functions": [
{ "boost_factor": 2 },
{
"gauss": {
"published_at": {
"scale": "4w"
}
}
}
],
"score_mode": "multiply"
}
},
"highlight": {
"pre_tags": ["<span class='highlight'>"],
"post_tags": ["</span>"],
"fields": {
"summary": {},
"title": {},
"description": {}
}
}
}
The query works great for the episodes index because it has the published_at field for the gauss func to work its magic. However, when run across all indices, it fails for shows and keywords (still succeeds for episodes).
Is it possible to run a conditional gaussian decay function if the published_at field exists or on the single episodes index?
I'm willing to explore alternatives (i.e. run separate queries for each index and then merge the results), but thought a single query would be the best in terms of performance.
Thanks!

You can add a filter to apply those gaussian decay function only to a subset of documents:
{
"filter": {
"exists": {
"field": "published_at"
}
}
"gauss": {
"published_at": {
"scale": "4w"
}
}
}
For docs that don't have the field you can return a score of 0:
{
"filter": {
"missing": {
"field": "published_at"
}
}
"script_score": {
"script": "0"
}
}

In the newer elasticsearch versions you have to use the script score query. The function score query is getting deprecated.

Related

Elasticsearch find documents based on result of a main query

I want to search documents based on the field of the result main query
For ex. Let say that my doc contains only two fields
userId
geopoint
I need a query that return me the document of a specific userId and documents of users that are around his geopoint
I didn't find a way to make this in one query and for now I making 2 queries (one to retrieve the doc of a user and one to retrieve users around his geopoint)
Thanks
UPDATE 1
The first query:
GET users\_search
{
"query": {
"term": {
"userId": "10250000075114"
}
}
}
Then I make the second query for users around it
GET users\_search
{
"query": {
"function_score": {
"query": {
"bool": {
"must_not": {
"term": {
"userId": "10250000075114"
}
}
}
},
"functions": [
{
"gauss": {
"rank": {
"origin": "0.8",
"offset": "0.05",
"scale": "0.1"
}
}
},
{
"gauss": {
"startPoint": {
"origin": "32.547484,34.95457",
"offset": "5km",
"scale": "10km"
}
}
},
{
"script_score": {
"script": "_score"
}
}
]
}
}
}
Where the startPoint in the second query is the startPoint result of the first
what you are looking for is the sub-query(which is present in RDBMS) but sub-queries are not present in Elasticsearch.
But you can use the filter on your user-ids and then find the users around only those users, please refer boolean query for more info and examples.

ElasticSearch - score boosting using scripting

We have a specific use-case for our ElasticSearch instance: we store documents which contain proper names, dates of birth, addresses, ID numbers, and other related info.
We use a name-matching plugin which overrides the default scoring of ES and assigns a relevancy score between 0 and 1 based on how closely the name matches.
What we need to do is boost that score by a certain amount if other fields match. I have started to read up on ES scripting to achieve this. I need assistance on the script part of the query. Right now, our query looks like this:
{
"size":100,
"query":{
"bool":{
"should":[
{"match":{"Name":"John Smith"}}
]
}
},
"rescore":{
"window_size":100,
"query":{
"rescore_query":{
"function_score":{
"doc_score":{
"fields":{
"Name":{"query_value":"John Smith"},
"DOB":{
"function":{
"function_score":{
"script_score":{
"script":{
"lang":"painless",
"params":{
"query_value":"01-01-1999"
},
"inline":"if **<HERE'S WHERE I NEED ASSISTANCE>**"
}
}
}
}
}
}
}
}
},
"query_weight":0.0,
"rescore_query_weight":1.0
}
}
The Name field will always be required in a query and is the basis for the score, which is returned in the default _score field; for ease of demonstration, we'll just add one additional field, DOB, which if matched, should boost the score by 0.1. I believe I'm looking for something along the lines of if(query_value == doc['DOB'].value add 0.1 to _score), or something along these lines.
So, what would be the correct syntax to be entered into the inline row to achieve this? Or, if the query requires other syntax revision, please advise.
EDIT #1 - it's important to highlight that our DOB field is a text field, not a date field.
Splitting to a separate answer as this solves the problem differently (i.e. - by using script_score as OP proposed instead of trying to rewrite away from scripts).
Assuming the same mapping and data as the previous answer, a scripted version of the query might look like the following:
POST /employee/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"functions": [
{
"script_score": {
"script": {
"source": "double boost = 0.0; if (params['_source']['State'] == 'FL') { boost += 0.1; } if (params['_source']['DOB'] == '1965-05-24') { boost += 0.3; } return boost;",
"lang": "painless"
}
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}
Two notes about the script:
The script uses params['_source'][field_name] to access the document, which is the only way to get access to text fields. This is significantly slower as it requires accessing documents directly on disk, though this penalty might not be too bad in the context of a rescore. You could instead use doc[field_name].value if the field was an aggregatable type, such as keyword, date, or something numeric
DOB here is compared directly to a string. This is possible because we're using the _source field, and the JSON for the documents has the dates specified as strings. This is somewhat brittle, but likely will do the trick
Assuming static weights per additional field, you can accomplish this without using scripting (though you may need to use script_score for any more complex weighting). To solve your issue of directly adding to a document's original score, your rescoring query will need to be a function score query that:
Composes queries for additional fields in a should clause for the function score's main query (i.e. - will only produce scores for documents matching at least one additional field)
Uses one function per additional field, with the filter set to select documents with some value for that field, and a weight to specify how much the score should increase (or some other scoring function if desired)
Mapping (as template)
Adding a State and DOB field for sake of example (making sure multiple additional fields contribute to the score correctly)
PUT _template/employee_template
{
"index_patterns": ["employee"],
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"Name": {
"type": "text"
},
"State": {
"type": "keyword"
},
"DOB": {
"type": "date"
}
}
}
}
}
Sample data
POST /employee/_doc/_bulk
{"index":{}}
{"Name": "John Smith", "State": "NY", "DOB": "1970-01-01"}
{"index":{}}
{"Name": "John C. Reilly", "State": "CA", "DOB": "1965-05-24"}
{"index":{}}
{"Name": "Will Ferrell", "State": "FL", "DOB": "1967-07-16"}
Query
EDIT: Updated the query to include the original query in the new function score in an attempt to compensate for custom scoring plugins.
A few notes about the query below:
Setting the rescorers score_mode: max is effectively a replace here, since the newly computed function score should only be greater than or equal to the original score
query_weight and rescore_query_weight are both set to 1 such that they are compared on equal scales during score_mode: max comparison
In the function_score query:
score_mode: sum will add together all the scores from functions
boost_mode: sum will add the sum of the functions to the score of the query
POST /employee/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
],
"filter": {
"bool": {
"should": [
{
"term": {
"State": "CA"
}
},
{
"range": {
"DOB": {
"lte": "1968-01-01"
}
}
}
]
}
}
}
},
"functions": [
{
"filter": {
"term": {
"State": "CA"
}
},
"weight": 0.1
},
{
"filter": {
"range": {
"DOB": {
"lte": "1968-01-01"
}
}
},
"weight": 0.3
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"score_mode": "max",
"query_weight": 1,
"rescore_query_weight": 1
}
}
}

Elasticsearch apply condintions in query on basis of results count

Is there any way in Elasticsearch for following type of outcome
"Apply first condition, if no results found then apply next conditions and so on.."
I am aware of basics of ES queries. I know this can be done by querying again and again on results basis but I want to do this in single query for the sake of time and efficiency.
Here is my current query
GET_search{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 28.6143519,
"lon": -81.50773
},
"bottom_right": {
"lat": 28.3479859,
"lon": -81.22977
}
}
}
}
]
}
}
}
},
"size": 10,
"from": 0,
"sort": {
"search_score": {
"order": "desc"
}
}
}
Now what I want to do is, if this query hits zero results then this should search for another increased set of lat lon bounds. I can do this by requering elasticsearch but it will be an inefficient way.
I want to know if is this possible in elasticsearch?

How can we use exists query in tandem with the search query?

I have a scenario in Elasticsearch where my indexed docs are like this :-
{"id":1,"name":"xyz", "address": "xyz123"}
{"id":1,"name":"xyz", "address": "xyz123"}
{"id":1,"name":"xyz", "address": "xyz123", "note": "imp"}
Here the requirement stress that we have to do a term match query and then provide relevance score to them which is a straight forward thing but the additional aspect here is if any doc found in search result has note field then it should be given higher relevance. How can we achieve it with DSL query? Using exists we can check which docs contain notes but how to integrate with match query in ES query. Have tried lot of ways but none worked.
With ES 5, you could boost your exists query to give a higher score to documents with a note field. For example,
{
"query": {
"bool": {
"must": {
"match": {
"name": {
"query": "your term"
}
}
},
"should": {
"exists": {
"field": "note",
"boost": 4
}
}
}
}
}
With ES 2, you could try a boosted filtered subset
{
"query": {
"function_score": {
"query": {
"match": { "name": "your term" }
},
"functions": [
{
"filter": { "exists" : { "field" : "note" }},
"weight": 4
}
],
"score_mode": "sum"
}
}
}
I believe that you are looking for boosting query feature
https://www.elastic.co/guide/en/elasticsearch/reference/5.1/query-dsl-boosting-query.html
{
"query": {
"boosting": {
"positive": {
<put yours original query here>
},
"negative": {
"filtered": {
"filter": {
"exists": {
"field": "note"
}
}
}
},
"negative_boost": 4
}
}
}

Elasticsearch outputs the score of 1.0 for all results when searching for a single "starred" term

We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like lettering we get reasonable scores and can sort the results according to the score.
However, when we modify the search term before querying and make the "starred" version of it (e.g., *lettering*) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project.
Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., *lettering* *digital*).
EDIT 1:
Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property):
elasticSearchMapping:
type: object
include_in_all: true
enabled: true
properties:
'keywords':
type: string
include_in_all: true
boost: 50
Query:
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [{
"match_all": []
}, {
"query_string": {
"query": "*lettering*"
}
}]
}
},
"filter": {
"bool": {
"must": [{
"term": {
"__parentPath": "/sites/industrycatalog"
}
}, {
"terms": {
"__workspace": ["live"]
}
}, {
"term": {
"__dimensionCombinationHash": "d751713988987e9331980363e24189ce"
}
}, {
"term": {
"__typeAndSupertypes": "IndustryCatalog:Entry"
}
}],
"should": [],
"must_not": [{
"term": {
"_hidden": true
}
}, {
"range": {
"_hiddenBeforeDateTime": {
"gt": "now"
}
}
}, {
"range": {
"_hiddenAfterDateTime": {
"lt": "now"
}
}
}]
}
}
}
},
"fields": ["__path"],
"script_fields": {
"distance": {
"script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)"
}
},
"sort": [{
"customer.featureFlags.industrycatalog": {
"order": "asc"
}
}, {
"_geo_distance": {
"coordinates": {
"lat": "51.75631079999999",
"lon": "14.332867899999997"
},
"order": "asc",
"unit": "km",
"distance_type": "plane"
}
}],
"size": 999999
}
What you are doing is wildcard query, They fall under term level queries and by default constant score is applied.
Check the Lucene Documentation, WildcardQuery extends MultiTermQuery
You can also verify this with the help of explain api, you will something like this
"_explanation": {
"value": 1,
"description": "ConstantScore(company:lettering), product of:",
"details": [{
"value": 1,
"description": "boost"
}, {
"value": 1,
"description": "queryNorm"
}]
}
You can change this behavior with rewriting,
Try this, rewrite also works with query string query
{
"query": {
"wildcard": {
"company": {
"value": "digital*",
"rewrite": "scoring_boolean"
}
}
}
}
It has various options for scoring, see what fits your requirement.
EDIT 1, the reason you see score other than 1 for *lettering* *digital* is due to queryNorm, you can again check with explain api, If you look closely, all documents with both matches will have same score and documents with single match will have same score also.
P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the inverted index. You might want to check edge ngram or ngram filter
Hope this helps!

Resources