elasticsearch - get intermediate scores within 'function_score' - elasticsearch

Here's my index
POST /blogs/1
{
"name" : "learn java",
"popularity" : 100
}
POST /blogs/2
{
"name" : "learn elasticsearch",
"popularity" : 10
}
My search query:
GET /blogs/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": "learn"
}
},
"script_score": {
"script": {
"source": "_score*(1+Math.log(1+doc['popularity'].value))"
}
}
}
}
}
which returns:
[
{
"_index": "blogs",
"_type": "1",
"_id": "AW5fxnperVbDy5wjSDBC",
"_score": 0.58024323,
"_source": {
"name": "learn elastic search",
"popularity": 100
}
},
{
"_index": "blogs",
"_type": "1",
"_id": "AW5fxqmL8cCMCxtBYOyC",
"_score": 0.43638366,
"_source": {
"name": "learn java",
"popularity": 10
}
}
]
Problem: I need to return an extra field in results which would give me raw score (just tf/idf which doesn't take popularity into account)
Things I have explored: script_fields (which doesn't give access to _score at fetch time.

The problem is in the way you are querying, which over-writes the _score variable. Instead if you use sort then _score isn't changed and can be pulled up within the same query.
You can try querying this way :
{
"query": {
"match": {
"name": "learn"
}
},
"sort": [
{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "_score*(1+Math.log(1+doc['popularity'].value))"
},
"order": "desc"
}
},
"_score"
]
}

Related

How to search over all fields and return every document containing that search in elasticsearch?

I have a problem regarding searching in elasticsearch.
I have a index with multiple documents with several fields. I want to be able to search over all the fields running a query and want it to return all the documents that contains the value specified in the query. I Found that using simple_query_string worked well for this. However, it does not return consistent results. In my index I have documents with several fields that contain dates. For example:
"revisionDate" : "2008-01-01T00:00:00",
"projectSmirCreationDate" : "2008-07-01T00:00:00",
"changedDate" : "1971-01-01T00:00:00",
"dueDate" : "0001-01-01T00:00:00",
Those are just a few examples, however when I index for example:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "2008"
}
}
}
It only returns two documents, this is a problem because I have much more documents than just two that contains the value "2008" in their fields.
I also have problem searching file names.
In my index there are fields that contain fileNames like this:
"fileName" : "testPDF.pdf",
"fileName" : "demo.pdf",
"fileName" : "demo.txt",
When i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo"
}
}
}
I get no results
But if i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo.txt"
}
}
}
I get the proper result.
Is there any better way to search across all documents and fields than I did? I want it to return all the document matching the query and not just two or zero.
Any help would be greatly appreciated.
Elasticsearch uses a standard analyzer if no analyzer is specified. Since no analyzer is specified on "fileName", demo.txt gets tokenized to
{
"tokens": [
{
"token": "demo.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Now when you are searching for demo it will not give any result, but searching for demo.txt will give the result.
You can instead use a wildcard query to search for a document having demo in fileName
{
"query": {
"wildcard": {
"fileName": {
"value": "demo*"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"fileName": "demo.pdf"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"fileName": "demo.txt"
}
}
]
Since revisionDate, projectSmirCreationDate, changedDate, dueDate are all of type date, so you cannot do a partial search on these dates.
You can use multi-fields, to add one more field (of text type) in the above fields. Modify your index mapping as shown below
{
"mappings": {
"properties": {
"changedDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"projectSmirCreationDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"dueDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"revisionDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
}
}
}
}
Index Data:
{
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
{
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
Search Query:
{
"query": {
"multi_match": {
"query": "2008"
}
}
}
Search Result:
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
}
]

Elasticsearch query filter combination issue

Im trying to understand why the below elasticsearch query does not work.
EDIT:
The fields mentioned in the query are from different indices. For example Filter has classification field which is in a different index to the fields mentioned in the query string.
The expectation of the filter query is that when the user searches specifically on classification field i.e. secret or protected then the values are displayed. Else if the user searches for any other field from a different index for example firstname or person, then it should not consider any filter applied as firstname or person is not part of the filter
{
"query": {
"bool": {
"filter": {
"terms": {
"classification": [
"secret",
"protected"
]
}
},
"must": {
"query_string": {
"query": "*john*",
"fields": [
"classification",
"firstname",
"releasability",
"person"
]
}
}
}
}
}
The result expected is john in the field person is returned. This works when there is no filter applied in the above code as
{
"query": {
"query_string": {
"query": "*john*",
"fields": [
"classification",
"firstname",
"releasability",
"person"
]
}
}
}
The purpose of the filter is only to filter records when the said fields contain the values mentioned, otherwise it should work for all values.
Why is it not producing the results for john and only producing results for classification values only?
Adding a working example with sample index data and search query.
To know more about Bool query refer this official documentation
Index Data:
Index data in my_index index
{
"name":"John",
"title":"b"
}
{
"name":"Johns",
"title":"a"
}
Index data in my_index1 index
{
"classification":"protected"
}
{
"classification":"secret"
}
Search Query :
POST http://localhost:9200/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": [
{
"terms": {
"classification": [
"secret",
"protected"
]
}
}
]
}
},
{
"bool": {
"must": [
{
"query_string": {
"query": "*john*",
"fields": [
"name",
"title"
]
}
}
]
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "John",
"title": "b"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"name": "Johns",
"title": "a"
}
},
{
"_index": "my_index1",
"_type": "_doc",
"_id": "1",
"_score": 0.0,
"_source": {
"classification": "secret"
}
},
{
"_index": "my_index1",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"classification": "protected"
}
}
]

ElasticSearch: Avg aggregation for datetime format

I am stuck regarding an elastic search query using python
I have data such as:
{
"_index": "user_log",
"_type": "logs",
"_id": "gdUJpXIBAoADuwvHTK29",
"_score": 1,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "gtUJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-21 09:15:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g9UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-22 07:50:00",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g8UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-22 04:15:01",
}
Here, for each user give working hours for different date(21 and 22). I want to take an average of each user's working hours.
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name"
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_hours"
}
}
}
}
}
}
This query not working. How to find the average working hours for each user for all dates? And, I also want to run this query using python-elastic search.
Updated
When I use ingest pipeline as #Val mention. I am getting an error:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"processor_type" : "script",
"script_stack" : [
"\n def workDate = /\\s+/.split(ctx.working_h ...",
" ^---- HERE"
],
"script" : "\n def workDate = /\\s+/.split(ctx.working_hours);\n def workHours = /:/.split(workDate[1]);\n ctx.working_minutes = (Integer.parseInt(workHours[0]) * 60) + Integer.parseInt(workHours[1]);\n ",
"lang" : "painless",
"position" : {
"offset" : 24,
"start" : 0,
"end" : 49
}
}
.....
How can I solve it?
The problem is that your working_hours field is a point in time and does not denote a duration.
For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes.
So instead of having documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
Create documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_day": "2019-10-21",
"working_hours": "09:00:01",
"working_minutes": 540
}
Then you can use your query on the working_minutes field:
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name.keyword",
"order": {
"avg_hours": "desc"
}
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_minutes"
}
}
}
}
}
}
If it is not convenient to compute the working_minutes field in your client code, you can achieve the same thing using an ingest pipeline. Let's define the pipeline first:
PUT _ingest/pipeline/working-hours
{
"processors": [
{
"dissect": {
"field": "working_hours",
"pattern": "%{?date} %{tmp_hours}:%{tmp_minutes}:%{?seconds}"
}
},
{
"convert": {
"field": "tmp_hours",
"type": "integer"
}
},
{
"convert": {
"field": "tmp_minutes",
"type": "integer"
}
},
{
"script": {
"source": """
ctx.working_minutes = (ctx.tmp_hours * 60) + ctx.tmp_minutes;
"""
}
},
{
"remove": {
"field": [
"tmp_hours",
"tmp_minutes"
]
}
}
]
}
Then you need to update your Python client code to use the new pipeline that will create the working_hours field for you:
helpers.bulk(es, reader, index='user_log', doc_type='logs', pipeline='working-hours')

How to change the order of search results on Elastic Search?

I am getting results from following Elastic Search query:
"query": {
"bool": {
"should": [
{"match_phrase_prefix": {"title": keyword}},
{"match_phrase_prefix": {"second_title": keyword}}
]
}
}
The result is good, but I want to change the order of the result so that the results with matching title comes top.
Any help would be appreciated!!!
I was able to reproduce the issue with sample data and My solution is using a query time boost, as index time boost is deprecated from the Major version of ES 5.
Also, I've created sample data in such a manner, that without boost both the sample data will have a same score, hence there is no guarantee that one which has match comes first in the search result, this should help you understand it better.
1. Index Mapping
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"second_title" :{
"type" :"text"
}
}
}
}
2. Index Sample docs
a)
{
"title": "opster",
"second_title" : "Dimitry"
}
b)
{
"title": "Dimitry",
"second_title" : "opster"
}
Search query
{
"query": {
"bool": {
"should": [
{
"match_phrase_prefix": {
"title": {
"query" : "dimitry",
"boost" : 2.0 <-- Notice the boost in `title` field
}
}
},
{
"match_phrase_prefix": {
"second_title": {
"query" : "dimitry"
}
}
}
]
}
}
}
Output
"hits": [
{
"_index": "60454337",
"_type": "_doc",
"_id": "1",
"_score": 1.3862944,
"_source": {
"title": "Dimitry", <-- Dimitry in title field has doube score
"second_title": "opster"
}
},
{
"_index": "60454337",
"_type": "_doc",
"_id": "2",
"_score": 0.6931472,
"_source": {
"title": "opster",
"second_title": "Dimitry"
}
}
]
Let me know if you have any doubt understanding it.

Boosting results based on selected types in elasticsearch

I have different types indexed in elastic search.
but, if I want to boost my results on some selected types then what should I do?
I could use type filter in boosting query, but type filter allows me only one type to be used in filter. I need results to be boosted on the basis of multiple types.
Example:
I have Person, Event, Location data indexed in elastic search where Person, Location and Event are my types.
I am searching for keyword 'London' in all types but i want Person and Event type records to be boosted than Location.
How could I achieve the same?
One of the ways of getting the desired functionality is by wrapping your query inside a bool query and then make use of the should clause, in order to boost certain documents
Small example:
POST test/person
{
"title": "london elise moore"
}
POST test/event
{
"title" : "london is a great city"
}
Without boost:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "london"
}
}
]
}
}
}
With the following response:
"hits": {
"total": 2,
"max_score": 0.2972674,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "AVVx621GYvUb9aQn6r5X",
"_score": 0.2972674,
"_source": {
"title": "london elise moore"
}
},
{
"_index": "test",
"_type": "event",
"_id": "AVVx63LrYvUb9aQn6r5Y",
"_score": 0.26010898,
"_source": {
"title": "london is a great city"
}
}
]
}
And now with the added should clause:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "london"
}
}
],
"should": [
{
"term": {
"_type": {
"value": "event",
"boost": 2
}
}
}
]
}
}
}
Which gives back the following response:
"hits": {
"total": 2,
"max_score": 1.0326607,
"hits": [
{
"_index": "test",
"_type": "event",
"_id": "AVVx63LrYvUb9aQn6r5Y",
"_score": 1.0326607,
"_source": {
"title": "london is a great city"
}
},
{
"_index": "test",
"_type": "person",
"_id": "AVVx621GYvUb9aQn6r5X",
"_score": 0.04235228,
"_source": {
"title": "london elise moore"
}
}
]
}
You could even leave out the extra boost in the should clause, cause if the should clause matches it will boost the result :)
Hope this helps!
I see two ways of doing that using that but both is using scripts
1. using sorting
POST c1_1/_search
{
"from": 0,
"size": 10,
"sort": [
{
"_script": {
"order": "desc",
"type": "number",
"script": "double boost = 1; if(doc['_type'].value == 'Person') { boost *= 2 }; if(doc['_type'].value == 'Event') { boost *= 3}; return _score * boost; ",
"params": {}
}
},
{
"_score": {}
}
],
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "*",
"default_operator": "and"
}
}
],
"minimum_should_match": "1"
}
}
}
Second option Using function score.
POST c1_1/_search
{
"from": 0,
"size": 10,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "*",
"default_operator": "and"
}
}
],
"minimum_should_match": "1"
}
},
"script_score": {
"script": "_score * (doc['_type'].value == 'Person' || doc['_type'].value == 'Event'? 2 : 1)"
}
}
}
}

Resources