ElasticSearch comparative range results - elasticsearch

Hi I would like to index objects that look like that
{
uuid: "123",
clauses: [{ order: 1, uuid: "345"},{ order: 2, uuid: "567"},{ order: 3, uuid: "789"}]
}
Is there a way to write a query that matches all the objects that contain
clauses with uuid: "345" and uuid: "789" but order of the second one is at most two bigger than first one?
So the above example would match but the next one wouldn't :
{
uuid: "999",
clauses: [{ order: 1, uuid: "345"},{ order: 2, uuid: "567"},{order: 3, uuid: "777"},{ order: 4, uuid: "789"}]
}
The reason is that order of "789" clause is 4 which is more than 2 bigger than "345" clause, which has order 1.
Any help is appreciated!
Thanks,
Michail

One way to achieve this involves using a script filter.
The script I'm using is the following:
def idxs = [];
for (int i = 0; i < doc['clauses.uuid'].values.size(); i++) {
if (matches.contains(doc['clauses.uuid'].values[i])){
idxs << i
}
};
def orders = idxs.collect{ doc['clauses.order'].values[it]};
return orders[1] - orders[0] <= 2
Basically, what I'm doing is first collection all indices of the clauses which contain a uuid in the matches array (i.e. 345 and 789).
Then, with the indices I got I gather all order values at those indices. And finally, I check that the second order minus the first order is not bigger than 2.
POST your_index/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"clauses.uuid": "345"
}
},
{
"term": {
"clauses.uuid": "789"
}
},
{
"script": {
"script": "def idxs = []; for (int i = 0; i < doc['clauses.uuid'].values.size(); i++) {if (matches.contains(doc['clauses.uuid'].values[i])){idxs << i}}; def orders = idxs.collect{doc['clauses.order'].values[it]}; return orders[1] - orders[0] <= 2",
"params": {
"matches": [
"345",
"789"
]
}
}
}
]
}
}
}
That will return only the first document and not the second.

Related

Elasticsearch random_score pushes documents towards the end of results

Here's the logic I am trying to accomplish:
I am using Elasticsearch to display top selling Products and randomly inserting newly created products in the results using function_score query DSL.
The issue I am facing is that I am using random_score fn for newly created products and the query does inserts new products up till page 2 or 3 but then rest all the other newly created products pushed towards the end of search results.
Here's the logic written for function_score:
function_score: {
query: query,
functions: [
{
filter: [
{ terms: { product_type: 'sponsored') } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: Time.current.to_i / (60 * 10), # new seed every 10 minutes
field: '_seq_no'
},
weight: 0.975
},
{
filter: { range: { live_at: { lt: 'CURRENT_DATE - 1.MONTH' } } },
linear: {
weighted_sales_rate: {
decay: 0.9,
origin: 0.5520974289580515,
scale: 0.5520974289580515
}
},
weight: 1
}
],
score_mode: 'sum',
boost_mode: 'replace'
}
And then I am sorting based on {"_score" => { "order" => "desc" } }
Let's say there are 100 sponsored products created in last 1 month. Then the above Elasticsearch query displays 8-10 random products (3 to 4 per page) as I scroll through 2 or 3 pages but then all other 90-92 products are displayed in last few pages of the result. - This is because the score calculated by random_score for 90-92 products is coming lower than the score calculated by linear
decay function.
Kindly suggest how can I modify this query so that I continue to see newly created Products as I navigate through pages and can prevent pushing new records towards the end of results.
[UPDATE]
I tried adding gauss decay function to this query (so that I can somehow modify the score of the products appearing towards the end of result) like below:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } },
{ range: { "_score" => { lt: 0.9 } } }
],
gauss: {
views_per_age_and_sales: {
origin: 1563.77,
scale: 1563.77,
decay: 0.95
}
},
weight: 0.95
}
But this too is not working.
Links I have referred to:
https://intellipaat.com/community/12391/how-to-get-3-random-search-results-in-elasticserch-query
Query to get random n items from top 100 items in Elastic Search
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-function-score-query.html
I am not sure if this is the best solution, but I was able to accomplish this with wrapping up the original query with script_score query + I have added a new ElasticSearch indexing called sort_by_views_per_year. Here's how the solution looks:
Link I referred to: https://github.com/elastic/elasticsearch/issues/7783
attribute(:sort_by_views_per_year) do
object.live_age&.positive? ? object.views_per_year.to_f / object.live_age : 0.0
end
Then while querying ElasticSearch:
def search
#...preparation of query...#
query = original_query(query)
query = rearrange_low_scoring_docs(query)
sort = apply_sort opts[:sort]
Product.search(query: query, sort: sort)
end
I have not changed anything in original_query (i.e. using random_score to products <= 1.month.ago and then use linear decay function).
def rearrange_low_scoring_docs query
{
function_score: {
query: query,
functions: [
{
script_score: {
script: "if (_score.doubleValue() < 0.9) {return 0.9;} else {return _score;}"
}
}
],
#score_mode: 'sum',
boost_mode: 'replace'
}
}
end
Then finally my sorting looks like this:
def apply_sort
[
{ '_score' => { 'order' => 'desc' } },
{ 'sort_by_views_per_year' => { 'order' => 'desc' } }
]
end
It would be way too helpful if ElasticSearch random_score query DSL starts supporting something like: max_doc_to_include and min_score attributes. So that I can use it like:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: 123456, # new seed every 10 minutes
field: '_seq_no',
max_doc_to_include: 10,
min_score: 0.9
},
weight: 0.975
},

Elasticsearch merging documents in response

I am having data in 3 indexes. I want to generate a invoice report using information from other indexes. For example the following are the sample document of each index
Users index
{
"_id": "userId1",
"name": "John"
}
Invoice index
{
"_id": "invoiceId1",
"userId": "userId1",
"cost": "10000",
"startdate": "",
"enddate": ""
}
Orders index
{
"_id": "orderId1",
"userId": "userId1",
"productName": "Mobile"
}
I want to generate a invoice report by combining information from these three indexes as follows
{
"_id": "invoiceId1",
"userName": "John",
"productName": "Mobile",
"cost": "10000",
"startdate": "",
"enddate": ""
}
How to write Elasticsearch query which returns response by combining information from other index documents?
You cannot do query-time joins in Elasticsearch and will need to denormalize your data in order to efficiently retrieve and group it.
Having said that, you could:
leverage the multi-target syntax and query multiple indices at once
use an OR query on the id and userId -- since either of those is referenced at least once in any of your docs
and then trivially join your data through a map/reduce tool called scripted metric aggregations
Quick side note: you won't be able to use the _id keyword inside your docs because it's reserved.
Assuming your docs and indices are structured as follows:
POST users_index/_doc
{"id":"userId1","name":"John"}
POST invoices_index/_doc
{"id":"invoiceId1","userId":"userId1","cost":"10000","startdate":"","enddate":""}
POST orders_index/_doc
{"id":"orderId1","userId":"userId1","productName":"Mobile"}
Here's how the scripted metric aggregation could look like:
POST users_index,invoices_index,orders_index/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"term": {
"id.keyword": {
"value": "userId1"
}
}
},
{
"term": {
"userId.keyword": {
"value": "userId1"
}
}
}
]
}
},
"aggs": {
"group_by_invoiceId": {
"scripted_metric": {
"init_script": "state.users = []; state.invoices = []; state.orders = []",
"map_script": """
def source = params._source;
if (source.containsKey("name")) {
// we're dealing with the users index
state.users.add(source);
} else if (source.containsKey("cost")) {
// we're dealing with the invoices index
state.invoices.add(source);
} else if (source.containsKey("productName")) {
// we're dealing with the orders index
state.orders.add(source);
}
""",
"combine_script": """
def non_empty_state = [:];
for (entry in state.entrySet()) {
if (entry != null && entry.getValue().length > 0) {
non_empty_state[entry.getKey()] = entry.getValue();
}
}
return non_empty_state;
""",
"reduce_script": """
def final_invoices = [];
def all_users = [];
def all_invoices = [];
def all_orders = [];
// flatten all resources
for (state in states) {
for (kind_entry in state.entrySet()) {
def map_kind = kind_entry.getKey();
if (map_kind == "users") {
all_users.addAll(kind_entry.getValue());
} else if (map_kind == "invoices") {
all_invoices.addAll(kind_entry.getValue());
} else if (map_kind == "orders") {
all_orders.addAll(kind_entry.getValue());
}
}
}
// iterate the invoices and enrich them
for (invoice_entry in all_invoices) {
def invoiceId = invoice_entry.id;
def userId = invoice_entry.userId;
def userName = all_users.stream().filter(u -> u.id == userId).findFirst().get().name;
def productName = all_orders.stream().filter(o -> o.userId == userId).findFirst().get().productName;
def cost = invoice_entry.cost;
def startdate = invoice_entry.startdate;
def enddate = invoice_entry.enddate;
final_invoices.add([
'id': invoiceId,
'userName': userName,
'productName': productName,
'cost': cost,
'startdate': startdate,
'enddate': enddate
]);
}
return final_invoices;
"""
}
}
}
}
which'd return
{
...
"aggregations" : {
"group_by_invoiceId" : {
"value" : [
{
"cost" : "10000",
"enddate" : "",
"id" : "invoiceId1",
"userName" : "John",
"startdate" : "",
"productName" : "Mobile"
}
]
}
}
}
Summing up, there are workarounds to achieve query-time joins. At the same time, scripts like this shouldn't be used in production because they could take forever.
Instead, this aggregation should be emulated outside of Elasticsearch after the query resolves and returns the index-specific hits.
BTW — I set size: 0 to return just the aggregation results so increase this parameter if you want to get some actual hits.

adding function to loops through

I need to search in a big json nested collection which have unique IDs recursively. The collection contains key values or nested arrays which contains keys. Keys can be anywhere in the object. Keys can be number or string.
Please note: Key values are unique if they are not in array. If they are in array, the key duplicates per items in array. For example,
"WebData": {
WA1: 3, //not in array so unique
WA3: 2, so unique
WA3: "NEO",
WebGroup : [
{ Web1: 1, //duplicate Web1
Web2: 2
},
{ Web1: 2, //duplicate Web2
Web2: 2
}]
}
What I want:
I will pass an array of keys in different variations for example
Not in Arrays: I will pass key return either their values or sum for example:
function(["WA1",""WA3", "RAE1"],"notsum")
If I pass (not sum)
["WA1",""WA3", "RAE1"]
and the operation is not "sum", it should return an array of their values from the collection
[3,2,1]
If I pass the same but operation is sum)
function(["WA1",""WA3", "RAE1"],"sum")
["WA1",""WA3", "RAE1"]
it should return sum from the collection
return 6
If in Array: If the value to search are in the array means they duplicate, then it should return me sum or their individual values again For example
["WEB1","Web2"]
. It could either return me,
[7,1] //Again total of 3+4, 0+1 //see in example
or
[[3,4],[0,1]] //Because values are duplicate and in array, just collect them
I need to do in an elegant way:
Full example of JSON:
{
version: "1.0"
submission : "editing"
"WebData": {
WA1: 3,
WA3: 2,
WA3: "NEO",
WebGroup : [
{ Web1: 3,
Web2: 0
},
{ Web1: 4,
Web2: 1
}]
},
"NonWebData": {
NWA1: 3,
NWA2: "INP",
NWA3: 2,
},
"FormInputs": {
FM11: 3,
FM12: 1,
FM13: 2,
"RawData" : {
"RawOverview": {
"RAE1" : 1,
"RAE2" : 1,
},
"RawGroups":[
{
"name": "A1",
"id": "1",
"data":{
"AD1": 'period',
"AD2": 2,
"AD3": 2,
"transfers": [
{
"type": "in",
"TT1": 1,
"TT2": 2,
},
{
"type": "out",
"TT1": 1,
"TT2": 2,
}
]
}
},
{
"name": "A2",
"id": "2",
"data":{
"AD1": 'period',
"AD2": 2,
"AD3": 2,
"transfers": [
{
"type": "in",
"TT1": 1,
"TT2": 2,
},
{
"type": "out",
"TT1": 1,
"TT2": 2,
}
]
}
}
]
},
"Other":
{ O1: 1,
O2: 2,
O3: "hello"
},
"AddedBy": "name"
"AddedDate": "11/02/2019"
}
I am not able to write a function here, which can do this for me, my code is simply searching in this array, and I loop through to find it, which is I am sure not the correct way.
My code is not elegant, and I am using somehow repetitive functions. This is just one snippet, to find out the keys in one level. I want only 1 or 2 functions to do all this
function Search(paramKey, formDataArray) {
var varParams = [];
for (var key in formDataArray) {
if (formDataArray.hasOwnProperty(key)) {
var val = formDataArray[key];
for (var ikey in val) {
if (val.hasOwnProperty(ikey)) {
if (ikey == paramKey)
varParams.push(val[ikey]);
}
}
}
}
return varParams;
}
One more test case if in Array: to Return only single array of values, without adding. (Update - I achieved this through editing the code following part)
notsumsingle: function (target, key, value) {
if (target[key] === undefined) {
target[key] = value;
return;
}
target.push(value);
},
"groupData": [
{
"A1G1": 1,
"A1G2": 22,
"AIG3": 4,
"AIG4": "Rob"
},
{
"A1G1": 1,
"A1G2": 41,
"AIG3": 3,
"AIG4": "John"
},
{
"A1G1": 1,
"A1G2": 3,
"AIG3": 1,
"AIG4": "Andy"
}
],
perform(["AIG2",""AIG4"], "notsum")
It is returning me
[
[
22,
41,
3
]
],
[
[
"",
"Ron",
"Andy"
]
]
Instead, can I add one more variation "SingleArray" like "sum" and "notsum" and get the result as single Array.
[
22,
41,
3
]
[
"",
"Ron",
"Andy"
]
4th one, I asked, is it possible the function intelligent enough to pick up the sum of arrays or sum of individual fields automatically. for example, in your example, you have used "sum" and "total" to identify that.
console.log(perform(["WA1", "WA3", "RAE1"], "total")); // 6
console.log(perform(["Web1", "Web2"], "sum")); // [7, 1]
Can the function, just use "sum" and returns single or array based on if it finds array, return [7,1] if not return 6
5th : I found an issue in the code, if the json collection is added this way
perform(["RAE1"], "notsum") //[[1,1]]
perform(["RAE1"], "sum") //2
It returns [1, 1], or 2 although there is only one RAE1 defined and please note it is not an array [] so it should not be encoded into [[]] array, just the object key
"RawData" : {
"RawOverview": {
"RAE1" : 1,
"RAE2" : 1,
}
For making it easier, and to take the same interface for getting sums or not sums and a total, without any array, you could introduce another operation string total for getting the sum of all keys.
This approach takes an object for getting a function which either add an value to an array at the same index or stores the value at an specified index, which match the given keys array of the function.
For iterating the object, you could take the key/value pairs and iterate until no more object is found.
As result, you get an array, or the total sum of all items.
BTW, the keys of an object are case sensitive, for example 'WEB1' does not match 'Web1'.
function perform(keys, operation) {
function visit(object) {
Object
.entries(object)
.forEach(([k, v]) => {
if (k in indices) return fn(result, indices[k], v);
if (v && typeof v === 'object') visit(v);
});
}
var result = [],
indices = Object.assign({}, ...keys.map((k, i) => ({ [k]: i }))),
fn = {
notsum: function (target, key, value) {
if (target[key] === undefined) {
target[key] = value;
return;
}
if (!Array.isArray(target[key])) {
target[key] = [target[key]];
}
target[key].push(value);
},
sum: function (target, key, value) {
target[key] = (target[key] || 0) + value;
}
}[operation === 'total' ? 'sum' : operation];
visit(data);
return operation === 'total'
? result.reduce((a, b) => a + b)
: result;
}
var data = { version: "1.0", submission: "editing", WebData: { WA1: 3, WA3: 2, WAX: "NEO", WebGroup: [{ Web1: 3, Web2: 0 }, { Web1: 4, Web2: 1 }] }, NonWebData: { NWA1: 3, NWA2: "INP", NWA3: 2 }, FormInputs: { FM11: 3, FM12: 1, FM13: 2 }, RawData: { RawOverview: { RAE1: 1, RAE2: 1 }, RawGroups: [{ name: "A1", id: "1", data: { AD1: 'period', AD2: 2, AD3: 2, transfers: [{ type: "in", TT1: 1, TT2: 2 }, { type: "out", TT1: 1, TT2: 2 }] } }, { name: "A2", id: "2", data: { AD1: 'period', AD2: 2, AD3: 2, transfers: [{ type: "in", TT1: 1, TT2: 2 }, { type: "out", TT1: 1, TT2: 2 }] } }] }, Other: { O1: 1, O2: 2, O3: "hello" }, AddedBy: "name", AddedDate: "11/02/2019" };
console.log(perform(["WA1", "WA3", "RAE1"], "notsum")); // [3, 2, 1]
console.log(perform(["WA1", "WA3", "RAE1"], "total")); // 6
console.log(perform(["Web1", "Web2"], "sum")); // [7, 1]
console.log(perform(["Web1", "Web2"], "notsum")); // [[3, 4], [0, 1]]
.as-console-wrapper { max-height: 100% !important; top: 0; }

ElasticSearch apply limit on bucket results

I am in a situation where I have applied limit for the ElasticSearch
results but it's not working for me. I have gone through the ES
guide below is my code:
module Invoices
class RestaurantBuilder < Base
def query(options = {})
buckets = {}
aggregations = {
orders_count: { sum: { field: :orders_count } },
orders_tip: { sum: { field: :orders_tip } },
orders_tax: { sum: { field: :orders_tax } },
monthly_fee: { sum: { field: :monthly_fee } },
gateway_fee: { sum: { field: :gateway_fee } },
service_fee: { sum: { field: :service_fee } },
total_due: { sum: { field: :total_due } },
total: { sum: { field: :total } }
}
buckets_for_restaurant_invoices buckets, aggregations, options[:restaurant_id]
filters = []
filters << time_filter(options)
query = {
query: { bool: { filter: filters } },
aggregations: buckets,
from: 0,
size: 5
}
query
end
def buckets_for_restaurant_invoices(buckets, aggregations, restaurant_id)
restaurant_ids(restaurant_id).each do |id|
buckets[id] = {
filter: { term: { restaurant_id: id } },
aggregations: aggregations
}
end
end
def restaurant_ids(restaurant_id)
if restaurant_id
[restaurant_id]
else
::Restaurant.all.pluck :id
end
end
end
end
the restaurant_ids function returns approx 5.5k restaurants so in this
case i got an error "circuit_breaking_exception","reason":"[request]
Data too large, data for [] would be
[622777920/593.9mb], which is larger than the limit of
[622775500/593.9mb]". That's why I want to apply some limit so that I
can get only a few hundreds of records at a time.
Could anyone guide me where I am doing wrong?
The way to limit the amount of data to avoid this error is to configure the indices.breaker.request.limit.

Elasticsearch custom sorting / adding filter clauses scores

I have this simple documents set:
{
id : 1,
book_ids : [2,3],
collection_ids : ['a','b']
},
{
id : 2,
book_ids : [1,2]
}
If I run this filter query, it will match both documents:
{
bool: {
filter: [
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'book_ids'
}
}
}
},
{
bool: {
filter: {
term: {
book_ids: 2
}
}
}
}
]
}
},
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'collection_ids'
}
}
}
},
{
bool: {
filter: {
term: {
collection_ids: 'a'
}
}
}
}
]
}
}
]
}
}
The thing is I want to sort these documents, and I would like the first one (id: 1) to be returned first because it matched both the book_ids value and the collection_ids values provided.
A simple sort clause like this one is not working:
[
'book_ids',
'collection_ids'
]
because it will return first document 2 due to the book_ids array first value.
Edit: this is a simplified example of the problem I am facing, which has N such clauses in the should clause. Moreover there is an order between the clauses, as I tried to reflect with the sort snippet: results matching the first clause (book_ids) should appear before results matching the second clause (collection_ids). I am really looking for some kind of SQL sort operation where I would only take into account the matching value of the field array. A viable option might be to assign decreasing constant_scores to each term clause, according to the expected sort order, and ES would have to sum this sub-scores to compute the final score. But I cannot figure out how to do it or if it is even possible.
Bonus question:
is there any way for ElasticSearch to return some kind of new document with only the matching values? Here is what I would expect as a response to the above filter query:
{
id : 1,
book_ids : [2],
collection_ids : ['a']
},
{
id : 2,
book_ids : [2]
}
I think you're right about the constant score idea. I think you can do it like this:
{
query: {
bool: {
must: [
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'book_ids'
}
}
}
},
{
constant_score: {
filter: {
term: {
book_ids: 2
}
},
boost: 100
}
}
]
}
},
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'collection_ids'
}
}
}
},
{
constant_score: {
filter: {
term: {
collection_ids: 'a'
}
},
boost: 50
}
}
]
}
}
]
}
}
}
I think the only thing you were missing using constant score, was likely just that the top level query needs to be must, not filter. (There's no scoring for filters, all the scores are 0.)
An alternative would be to put the filter inside a function_score query (but leave it as a filter), and then compute the score as you want (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html)
As to the bonus question, it's possible if you use a script field to filter and add a new field like you want (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html), but it's not possible in a straightforward way. It's probably easier and makes more sense to do that filtering after you receive the result, unless you have very long lists in your values.

Resources