ElasticSearch apply limit on bucket results

ElasticSearch apply limit on bucket results - elasticsearch

I am in a situation where I have applied limit for the ElasticSearch
results but it's not working for me. I have gone through the ES
guide below is my code:
module Invoices
class RestaurantBuilder < Base
def query(options = {})
buckets = {}
aggregations = {
orders_count: { sum: { field: :orders_count } },
orders_tip: { sum: { field: :orders_tip } },
orders_tax: { sum: { field: :orders_tax } },
monthly_fee: { sum: { field: :monthly_fee } },
gateway_fee: { sum: { field: :gateway_fee } },
service_fee: { sum: { field: :service_fee } },
total_due: { sum: { field: :total_due } },
total: { sum: { field: :total } }
}
buckets_for_restaurant_invoices buckets, aggregations, options[:restaurant_id]
filters = []
filters << time_filter(options)
query = {
query: { bool: { filter: filters } },
aggregations: buckets,
from: 0,
size: 5
}
query
end
def buckets_for_restaurant_invoices(buckets, aggregations, restaurant_id)
restaurant_ids(restaurant_id).each do |id|
buckets[id] = {
filter: { term: { restaurant_id: id } },
aggregations: aggregations
}
end
end
def restaurant_ids(restaurant_id)
if restaurant_id
[restaurant_id]
else
::Restaurant.all.pluck :id
end
end
end
end
the restaurant_ids function returns approx 5.5k restaurants so in this
case i got an error "circuit_breaking_exception","reason":"[request]
Data too large, data for [] would be
[622777920/593.9mb], which is larger than the limit of
[622775500/593.9mb]". That's why I want to apply some limit so that I
can get only a few hundreds of records at a time.
Could anyone guide me where I am doing wrong?

The way to limit the amount of data to avoid this error is to configure the indices.breaker.request.limit.

Related

Elasticsearch random_score pushes documents towards the end of results

Here's the logic I am trying to accomplish:
I am using Elasticsearch to display top selling Products and randomly inserting newly created products in the results using function_score query DSL.
The issue I am facing is that I am using random_score fn for newly created products and the query does inserts new products up till page 2 or 3 but then rest all the other newly created products pushed towards the end of search results.
Here's the logic written for function_score:
function_score: {
query: query,
functions: [
{
filter: [
{ terms: { product_type: 'sponsored') } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: Time.current.to_i / (60 * 10), # new seed every 10 minutes
field: '_seq_no'
},
weight: 0.975
},
{
filter: { range: { live_at: { lt: 'CURRENT_DATE - 1.MONTH' } } },
linear: {
weighted_sales_rate: {
decay: 0.9,
origin: 0.5520974289580515,
scale: 0.5520974289580515
}
},
weight: 1
}
],
score_mode: 'sum',
boost_mode: 'replace'
}
And then I am sorting based on {"_score" => { "order" => "desc" } }
Let's say there are 100 sponsored products created in last 1 month. Then the above Elasticsearch query displays 8-10 random products (3 to 4 per page) as I scroll through 2 or 3 pages but then all other 90-92 products are displayed in last few pages of the result. - This is because the score calculated by random_score for 90-92 products is coming lower than the score calculated by linear
decay function.
Kindly suggest how can I modify this query so that I continue to see newly created Products as I navigate through pages and can prevent pushing new records towards the end of results.
[UPDATE]
I tried adding gauss decay function to this query (so that I can somehow modify the score of the products appearing towards the end of result) like below:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } },
{ range: { "_score" => { lt: 0.9 } } }
],
gauss: {
views_per_age_and_sales: {
origin: 1563.77,
scale: 1563.77,
decay: 0.95
}
},
weight: 0.95
}
But this too is not working.
Links I have referred to:
https://intellipaat.com/community/12391/how-to-get-3-random-search-results-in-elasticserch-query
Query to get random n items from top 100 items in Elastic Search
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-function-score-query.html

I am not sure if this is the best solution, but I was able to accomplish this with wrapping up the original query with script_score query + I have added a new ElasticSearch indexing called sort_by_views_per_year. Here's how the solution looks:
Link I referred to: https://github.com/elastic/elasticsearch/issues/7783
attribute(:sort_by_views_per_year) do
object.live_age&.positive? ? object.views_per_year.to_f / object.live_age : 0.0
end
Then while querying ElasticSearch:
def search
#...preparation of query...#
query = original_query(query)
query = rearrange_low_scoring_docs(query)
sort = apply_sort opts[:sort]
Product.search(query: query, sort: sort)
end
I have not changed anything in original_query (i.e. using random_score to products <= 1.month.ago and then use linear decay function).
def rearrange_low_scoring_docs query
{
function_score: {
query: query,
functions: [
{
script_score: {
script: "if (_score.doubleValue() < 0.9) {return 0.9;} else {return _score;}"
}
}
],
#score_mode: 'sum',
boost_mode: 'replace'
}
}
end
Then finally my sorting looks like this:
def apply_sort
[
{ '_score' => { 'order' => 'desc' } },
{ 'sort_by_views_per_year' => { 'order' => 'desc' } }
]
end
It would be way too helpful if ElasticSearch random_score query DSL starts supporting something like: max_doc_to_include and min_score attributes. So that I can use it like:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: 123456, # new seed every 10 minutes
field: '_seq_no',
max_doc_to_include: 10,
min_score: 0.9
},
weight: 0.975
},

Pass each matched record value to filter in Elasticsearch

For geo_distance query I'm using a constant value for distance. I need to make it dynamic. So I want to pass the above matched record radius value to distance.
Here's the code:
let searchRadius = '12KM'
query: {
bool: {
must: {
match: {
companyName: {
query: req.text
}
}
},
filter: {
geo_distance: {
distance: searchRadius,//here I want to pass doc['radius']
location: {
lat: parseFloat(req.lat),
lon: parseFloat(req.lon)
}
}
},
}
}
For each record, I have a different radius value. I want to pass doc['radius'] instead of constant searchRadius value.
I can hit two queries then iterate the values but it's not optimal. Can anyone suggest how can I pass each record value to geo_distance filter?

I have resolved from this answer.
Heres the code
query: {
bool: {
must: [
{
match: {
companyName: {
query: req.text
}
}
},
{
script: {
script: {
params: {
lat: parseFloat(req.lat),
lon: parseFloat(req.lon)
},
source: "doc['location'].arcDistance(params.lat, params.lon) / 1000 < doc['searchRadius'].value",
lang: "painless"
}
}
}
]
}
},
Using script Query, from more details:
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-script-query.html

Query to get top 10 users from MongoDb in Spring

So basically I have a collection that looks like this(other fields omitted):
[{
user: mail1#test.com
},
{
user: mail1#test.com
},
{
user: mail1#test.com
},
{
user: mail2#test.com
},
{
user: mail2#test.com
},
{
user: mail3#test.com
}
]
I'm looking for a way to query MongoDB in order to get the top 10 active users(those with the most records in DB). Is there an easy way to get this, perhaps just using the interface?

perhaps a simple group aggregation will give you the needed result?
db.Users.aggregate(
[
{
$group: {
_id: "$user",
count: { $sum: 1 }
}
},
{
$sort: { count: -1 }
},
{
$limit: 10
},
{
$project: {
user: "$_id",
_id: 0
}
}
])

There is something called $sortByCount for aggregation.
List<UserCount> getTop10UserCount() {
return mongoTemplate.aggregate(
newAggregation(
User.class,
sortByCount("user"),
limit(10),
project("_id", "count")
),
UserCount.class
);
}
static class UserCount {
String _id;
Integer count;
// constructors, getters or setters..
}

Group Data on elastic search with same value on two key

I have just started to learn about elastic search and facing a problem on group aggregation. I have a data set on elastic search like :
[{
srcIP : "10.0.11.12",
dstIP : "19.67.78.91",
totalMB : "0.25"
},{
srcIP : "10.45.11.62",
dstIP : "19.67.78.91",
totalMB : "0.50"
},{
srcIP : "13.67.52.91",
dstIP : "10.0.11.12",
totalMB : "0.75"
},{
srcIP : "10.23.64.12",
dstIP : "10.45.11.62",
totalMB : "0.25"
}]
I Just want to group data on the basis of srcIP and sum the field totalMB but I just wanna add up on more thing like when group by performing on scrIP then it will match the srcIP value to dstIP value and also sum the totalMB for dstIP.
Output should be like this :
buckets : [{
key : "10.0.11.12",
total_GB_SrcIp :{
value : "0.25"
},
total_GB_dstIP :{
value : "0.75"
}
},
{
key : "10.45.11.62",
total_MB_SrcIp :{
value : "0.50"
},
total_MB_dstIP :{
value : "0.25"
}
}]
I have done normal aggregation for one key but didn't get the final query for my problem.
Query :
GET /index*/_search
{
size : 0,
"aggs": {
"group_by_srcIP": {
"terms": {
"field": "srcIP",
"size": 100,
"order": {
"total_MB_SrcIp": "desc"
}
},
"aggs": {
"total_MB_SrcIp": {
"sum": {
"field": "TotalMB"
}
}
}
}
}
}
Hope you understand my problem on the basis of sample output.
Thanks in advance.

As per my understanding, you need a sum aggregation on field (totalMB) with respect to distinct values in two another fields (srcIP, dstIP).
AFAIK, elastic search is not that good for aggregating on values of multiple fields, unless you combine those fields together using some document ingestion or combine it on application side itself. (I may be wrong here, though).
I gave it a try to get required output using scripted_metric aggregation. (Please read about it if you don't know what it is or how it works)
I experimented on painless script to do following in aggregation:
pick srcIp, dstIp & totalMB from each doc
populate a cross-mapping like IP -> { (src : totalMBs), (dst : totalMBs) } in a map
return this map as result of aggregation
Here is the actual search query with aggregation:
GET /testIndex/testType/_search
{
"size": 0,
"aggs": {
"ip-addr": {
"scripted_metric": {
"init_script": "params._agg.addrs = []",
"map_script": "def lst = []; lst.add(doc.srcIP.value); lst.add(doc.dstIP.value); lst.add(doc.totalMB.value); params._agg.addrs.add(lst);",
"combine_script": "Map ipMap = new HashMap(); for(entry in params._agg.addrs) { def srcIp = entry.get(0); def dstIp = entry.get(1); def mbs = entry.get(2); if(ipMap.containsKey(srcIp)) {def srcMbSum = mbs + ipMap.get(srcIp).get('srcMB'); ipMap.get(srcIp).put('srcMB',srcMbSum); } else {Map types = new HashMap(); types.put('srcMB', mbs); types.put('dstMB', 0.0); ipMap.put(srcIp, types); } if(ipMap.containsKey(dstIp)) {def dstMbSum = mbs + ipMap.get(dstIp).get('dstMB'); ipMap.get(dstIp).put('dstMB',dstMbSum); } else {Map types = new HashMap(); types.put('srcMB', 0.0); types.put('dstMB', mbs); ipMap.put(dstIp, types); } } return ipMap;",
"reduce_script": "Map resultMap = new HashMap(); for(ipMap in params._aggs) {for(entry in ipMap.entrySet()) {def ip = entry.getKey(); def srcDestMap = entry.getValue(); if(resultMap.containsKey(ip)) {Map types = new HashMap(); types.put('srcMB', srcDestMap.get('srcMB') + resultMap.get(ip).get('srcMB')); types.put('dstMB', srcDestMap.get('dstMB') + resultMap.get(ip).get('dstMB')); resultMap.put(ip, types); } else {resultMap.put(ip, srcDestMap); } } } return resultMap;"
}
}
}
}
Here are experiment details:
Index mapping:
GET testIndex/_mapping
{
"testIndex": {
"mappings": {
"testType": {
"dynamic": "true",
"_all": {
"enabled": false
},
"properties": {
"dstIP": {
"type": "ip"
},
"srcIP": {
"type": "ip"
},
"totalMB": {
"type": "double"
}
}
}
}
}
}
Sample input:
POST testIndex/testType
{
"srcIP" : "10.0.11.12",
"dstIP" : "19.67.78.91",
"totalMB" : "0.25"
}
POST testIndex/testType
{
"srcIP" : "10.45.11.62",
"dstIP" : "19.67.78.91",
"totalMB" : "0.50"
}
POST testIndex/testType
{
"srcIP" : "13.67.52.91",
"dstIP" : "10.0.11.12",
"totalMB" : "0.75"
}
POST testIndex/testType
{
"srcIP" : "10.23.64.12",
"dstIP" : "10.45.11.62",
"totalMB" : "0.25"
}
Query output:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"ip-addr": {
"value": {
"13.67.52.91": {
"srcMB": 0.75,
"dstMB": 0
},
"10.23.64.12": {
"srcMB": 0.25,
"dstMB": 0
},
"10.45.11.62": {
"srcMB": 0.5,
"dstMB": 0.25
},
"19.67.78.91": {
"srcMB": 0,
"dstMB": 0.75
},
"10.0.11.12": {
"srcMB": 0.25,
"dstMB": 0.75
}
}
}
}
}
Here is readable query for better understanding.
"scripted_metric": {
"init_script": "params._agg.addrs = []",
"map_script": """
def lst = [];
lst.add(doc.srcIP.value);
lst.add(doc.dstIP.value);
lst.add(doc.totalMB.value);
params._agg.addrs.add(lst);
""",
"combine_script": """
Map ipMap = new HashMap();
for(entry in params._agg.addrs) {
def srcIp = entry.get(0);
def dstIp = entry.get(1);
def mbs = entry.get(2);
if(ipMap.containsKey(srcIp)) {
def srcMbSum = mbs + ipMap.get(srcIp).get('srcMB');
ipMap.get(srcIp).put('srcMB',srcMbSum);
} else {
Map types = new HashMap();
types.put('srcMB', mbs);
types.put('dstMB', 0.0);
ipMap.put(srcIp, types);
}
if(ipMap.containsKey(dstIp)) {
def dstMbSum = mbs + ipMap.get(dstIp).get('dstMB');
ipMap.get(dstIp).put('dstMB',dstMbSum);
} else {
Map types = new HashMap();
types.put('srcMB', 0.0);
types.put('dstMB', mbs);
ipMap.put(dstIp, types);
}
}
return ipMap;
""",
"reduce_script": """
Map resultMap = new HashMap();
for(ipMap in params._aggs) {
for(entry in ipMap.entrySet()) {
def ip = entry.getKey();
def srcDestMap = entry.getValue();
if(resultMap.containsKey(ip)) {
Map types = new HashMap();
types.put('srcMB', srcDestMap.get('srcMB') + resultMap.get(ip).get('srcMB'));
types.put('dstMB', srcDestMap.get('dstMB') + resultMap.get(ip).get('dstMB'));
resultMap.put(ip, types);
} else {
resultMap.put(ip, srcDestMap);
}
}
}
return resultMap;
"""
}
However, prior to going in depth, I would suggest you to test it out on some sample data and check if it works. Scripted metric aggregations do have considerable impact on query performance.
One more thing, to get required key string in aggregation result, replace all occurrences of 'srcMB' & 'dstMB' in script to 'total_GB_SrcIp' & 'total_GB_DstIp' as per your need.
Hope this may help you or some one.
FYI, I tested this on ES v5.6.11.

Elasticsearch custom sorting / adding filter clauses scores

I have this simple documents set:
{
id : 1,
book_ids : [2,3],
collection_ids : ['a','b']
},
{
id : 2,
book_ids : [1,2]
}
If I run this filter query, it will match both documents:
{
bool: {
filter: [
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'book_ids'
}
}
}
},
{
bool: {
filter: {
term: {
book_ids: 2
}
}
}
}
]
}
},
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'collection_ids'
}
}
}
},
{
bool: {
filter: {
term: {
collection_ids: 'a'
}
}
}
}
]
}
}
]
}
}
The thing is I want to sort these documents, and I would like the first one (id: 1) to be returned first because it matched both the book_ids value and the collection_ids values provided.
A simple sort clause like this one is not working:
[
'book_ids',
'collection_ids'
]
because it will return first document 2 due to the book_ids array first value.
Edit: this is a simplified example of the problem I am facing, which has N such clauses in the should clause. Moreover there is an order between the clauses, as I tried to reflect with the sort snippet: results matching the first clause (book_ids) should appear before results matching the second clause (collection_ids). I am really looking for some kind of SQL sort operation where I would only take into account the matching value of the field array. A viable option might be to assign decreasing constant_scores to each term clause, according to the expected sort order, and ES would have to sum this sub-scores to compute the final score. But I cannot figure out how to do it or if it is even possible.
Bonus question:
is there any way for ElasticSearch to return some kind of new document with only the matching values? Here is what I would expect as a response to the above filter query:
{
id : 1,
book_ids : [2],
collection_ids : ['a']
},
{
id : 2,
book_ids : [2]
}

I think you're right about the constant score idea. I think you can do it like this:
{
query: {
bool: {
must: [
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'book_ids'
}
}
}
},
{
constant_score: {
filter: {
term: {
book_ids: 2
}
},
boost: 100
}
}
]
}
},
{
bool: {
should: [
{
bool: {
must_not: {
exists: {
field: 'collection_ids'
}
}
}
},
{
constant_score: {
filter: {
term: {
collection_ids: 'a'
}
},
boost: 50
}
}
]
}
}
]
}
}
}
I think the only thing you were missing using constant score, was likely just that the top level query needs to be must, not filter. (There's no scoring for filters, all the scores are 0.)
An alternative would be to put the filter inside a function_score query (but leave it as a filter), and then compute the score as you want (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html)
As to the bonus question, it's possible if you use a script field to filter and add a new field like you want (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html), but it's not possible in a straightforward way. It's probably easier and makes more sense to do that filtering after you receive the result, unless you have very long lists in your values.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticSearch apply limit on bucket results - elasticsearch

The way to limit the amount of data to avoid this error is to configure the indices.breaker.request.limit.

Related

Elasticsearch random_score pushes documents towards the end of results

Pass each matched record value to filter in Elasticsearch

Query to get top 10 users from MongoDb in Spring

Group Data on elastic search with same value on two key

Elasticsearch custom sorting / adding filter clauses scores

Categories

Resources