ElasticSearch - Boosting based on depth in a recursive structure - elasticsearch

I am using Elastic search 2.4.4(compatible with spring boot 1.5.2).
I have a document object which has the following structure :
{
id : 1,
title : Doc title
//some more metadata
sections :[
{
"id" : 2,
"title: Sec title 1,
sections:[...]
},{
id : 3,
title: Sec title 2,
sections:[...]
}
]
}
Basically I want to make the titles in the document searchable(all document title, section titles and subsection titles at any level) and I want to be able to score the documents based on the level at which they match in the tree hierarchy.
My initial thought was using some strcture like this :
{
titles:[
{
title : doc title,
depth : 0
},
{
title : sec title 1,
depth : 1
},
{
title : sec title 2,
depth : 1
},
......
]
}
I would like to rank the documents based on the depth at which there is match(higher the depth, lower is the score).
I know the basic boosting based on the field but,
is there a way can do this in elastic search?
OR
Is it possible to do it by changing the structure?

Yes, you can achieve this by indexing documents in your modified format (a flat array of objects) using a Nested datatype mapping and a Function Score Query inside of a Nested Query:
PUT someindex
{
"mappings": {"sometype":{"properties": {"titles":{"type": "nested"}}}}
}
POST someindex/sometype/0
{
"titles": [
{ "title": "doc title", "depth": 0 },
{ "title": "sec title 1", "depth": 1 },
{ "title": "sec title 2", "depth": 1 }
]
}
POST someindex/sometype/1
{
"titles": [
{ "title": "sec doc title", "depth": 0 }
]
}
GET someindex/sometype/_search
{
"query": {
"nested": {
"path": "titles",
"score_mode": "max",
"query": {
"function_score": {
"query": {
"match": {
"titles.title": "sec"
}
},
"functions": [
{
"exp": {
"titles.depth": {
"origin": 0,
"scale": 1
}
}
}
]
}
}
}
}
}
In this example, document 1 is scored higher because it has a title matching sec at depth 0, whereas document 2 only has a title matching sec at depth 1.
The nested datatype and query ensure that the function_score associates the matching title with its depth, and the function score exp prioritizes titles with lower depth.

Related

Elasticsearch conditional sorting by different fields

Let's say that my buisness need is to sort results differently, based on some "external" parameter that i'm passing to query.
Documents are more or less like:
{
"transfer_rate": 2000.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 1, "campaign": 1 },
{ "transfer_rate": 500.00, "identifier": 2, "campaign": 2 },
{ "transfer_rate": 750.00, "identifier": 3, "campaign": 3 },
//...
]
},
{
"transfer_rate": 500.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 4, "campaign": 1 },
{ "transfer_rate": 2000.00, "identifier": 5, "campaign": 2 },
{ "transfer_rate": 625.00, "identifier": 6, "campaign": 3 },
{ "transfer_rate": 225.00, "identifier": 7, "campaign": 1 },
//...
]
}
And now i do have my "parameter", let's say, that's equal to 750.00.
Now, i would like to order this set of documents differently, depends on how different root's transfer_rate is compared to given param as follows:
If doc['transfer_rate'] >= _param then sort by doc['transfer_rate'], else sort by MIN of doc['some_collection'].transfer_rate.
I know that there could be some document optimisations done, but i wasn't inventing this model, nor i'm allowed to change or re-index.
The tricky part about nested objects is, that they do contain property (in given example it's campaign) that has to match criteria, so basically:
When doc['transfer_rate'] is LT than _param_, order by minimum value of doc['some_collection'].transfer_rate where campaign equals to XYZ
So, for given example, with given parameter, documents like first one, should be ordered by doc['transfer_rate'] and documents like second one, should be ordered by nested.
Thanks for any advices / links / support
This is going to be a pain if you can not reindex the data.
I came up with this query
GET /71095886/_search
{
"query": {
"nested": {
"path": "some_collection",
"query": {
"match": {
"some_collection.campaign": 1
}
}
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": """
if (doc['transfer_rate'].value >= params.factor){
return doc['transfer_rate'].value;
} else {
def min = 10000;
for (item in doc['some_collection']){
if (item['transfer_rate'] < min){
min = item['transfer_rate'];
}
}
return min;
}
""",
"params": {
"factor": 2000
}
},
"order": "asc"
}
}
}
But it won't work because of the nested object, and how it is stored in Elastic (actually Lucene, but let's not get down that road .. yet)
If you add "nested_path" : "some_collection" in _script you won't have access to the global transfer_rate anymore (because stored in a different Lucene documents).
Maybe on thing you can look into is runtime fields

Custom ordering on elastic search

I'm executing a simple query which returns items matched by companyId.
In addition to only showing clients matching a specific company I also want records matching a certain location to appear at the top.So if somehow I pass through pseudo sort:"location=Johannesburg" it would return the data below and items which match the specific location would appear on top, followed by items with other locations.
Data:
{
"clientId" : 1,
"clientName" : "Name1",
"companyId" : 8,
"location" : "Cape Town"
},
{
"clientId" : 2,
"clientName" : "Name2",
"companyId" : 8,
"location" : "Johannesburg"
}
Query:
{
"query": {
"match": {
"companyId": "8"
}
},
"size": 10,
"_source": {
"includes": [
"firstName",
"companyId",
"location"
]
}
}
Is something like this possible in elastic and if so what is the name of this concept?(I'm not sure what to even Google for to solve this problem)
It can be done in different ways.
Simplest (if go only with text matching) is use bool query with should statement.
The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document. Doc
Example:
{"query":
"bool": {
"must": [
"match": {
"companyId": "8"
}
],
"should": [
"match": {
"location": "Johannesburg"
}
]
}
}
}
More complex solution is to store GEO points in location, and use Distance feature query as example.

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Elasticsearch boosting score for values in array

I am trying to implement scoring of documents based on certain values stored in array via elasticsearch. For example, if my document contain an array object like this:
Document 1:
{
id: "test",
marks: [{
"classtype" : "x1",
"value": 90
}]
}
Document 2:
{
id: "test2",
marks: [{
"classtype" : "x1",
"value": 50
},{
"classtype" : "x2",
"value": 60
}]
}
I want my output scores to be boosted by choosing boosting factor on basis of "classtype", but applicable on "value".
equivalent code would be:
var boostingfactor = {
"x1" : 1,
"x2" : 10
}
var smartscore = 0;
marks.forEach(function(mark){
return smartscore += mark.value * boostingfactor[mark.classtype];
});
return smartscore;
I have tried elasticsearch query on integer values, but not sure if same can be done for values present in array. I also tried writing scripts in elasticsearch's painless language, but couldnt find right way to filter values based on classtype.
POST /student/_search
{
"query": {
"function_score": {
"script_score" : {
"script" : {
"params": {
"x1": 1,
"x2": 10
},
"source": "params[doc['marks.classtype']] * marks.value"
}
}
}
}
}
Expected result is scoring of 90 (90*1) for sample document 1 and 650 (50*1+60*10) for document 2 but above query fails with exception:
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"params[doc['marks.classtype'].value]",
" ^---- HERE"
],
"script": "params[doc['marks.classtype'].value]",
"lang": "painless"
}
Is it possible to accomplish the result via modifying script?
Elasticsearch version: 7.1.0
I was able to read through array values using following script:
"script_score" : {
"script" : {
"params": {
"x1": 5,
"x2": 10
},
"source": "double sum = 0.0; for (item in params._source.marks) { sum += item.value * params[item.classtype]; } return sum;"
}
}

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Resources