ElasticSearch Aggregations: subtracting aggregations based upon match - elasticsearch

Using a simple albeit somewhat artificial example, let's say that I have several inventory docs stored in ElasticSearch where every document represents either the purchase or the sale of an item:
[
{item_id: "foobar", type: "cost", value: 12.34, timestamp:149382734621},
{item_id: "bizbaz", type: "sale", value: 45.12, timestamp:149383464621},
{item_id: "foobar", type: "sale", value: 32.74, timestamp:149384824621},
{item_id: "foobar", type: "cost", value: 12.34, timestamp:149387435621},
{item_id: "bizbaz", type: "sale", value: 45.12, timestamp:149388434621},
{item_id: "bizbaz", type: "cost", value: 41.23, timestamp:149389424621},
{item_id: "foobar", type: "sale", value: 32.74, timestamp:149389914621},
{item_id: "waahoo", type: "sale", value: 11.23, timestamp:149389914621},
...
]
And for a specified time range I want to calculate the current profit for each item. So for example I would want to return:
foobar_profit = sum(value of all documents item_id="foobar" and type="sale")
-sum(value of all documents item_id="foobar" and type="cost")
bizbaz_profit = sum(value of all documents item_id="bizbaz" and type="sale")
-sum(value of all documents item_id="bizbaz" and type="cost")
...
There are two aspects that I don't yet understand how to achieve.
I know how to aggregate over terms, so this would allow me to sum the value of of all "foobar" items regardless of type. But I don't know how to sum over all documents that match on two fields. For instance, I want to aggregate the above data set on the compound key (item_id,type). The dataset above would then yield the aggregations:
(foobar,cost)->24.68
(foobar,sale)->65.48
(bizbaz,cost)->41.23
(bizbaz,sale)->90.24
(waahoo,sale)->11.23
Presuming I can do #1, I will have aggregations like foobar_cost and foobar_sale. But I don't know how to combine two aggregations so that in this case foobar_profit = foobar_sale - foobar_cost. So the above aggregations would become
foobar_profit->40.8
bizbaz_profit->49.01
waahoo_profit->11.23
Some final notes:
In the example above, I only list 3 item_ids, but consider that there will be thousands of item_ids, so I can't do special-case queries per item_id.
Also, for a particular item, the cost and sale items will come in at different times, so we can't put the cost and sale price in the same document and diff the fields.
I can send back all the data and do the last step of the aggregations client side, but this might be a ton of data. Really, I need to do it on server side if possible so that I can sort the results by profit and return the top N.

You can just use nested aggregations. See here for a working example: https://gist.github.com/mattweber/71033b1bf2ebed1afd8e
I use a MatchAll Query in this example but you can replace that with a RangeQuery or whatever you need.

Related

Is there a way to create a runtime field in Elasticsearch that is equal to a 'Value'/'Sum of Value across index'?

I have a task to show the percent of value a set of filtered documents represents vs the entire value represented across a whole year. For example:
[{
name: 'Foo',
value: 12,
year: 2021
},
{
name: 'Bar',
value: 2,
year: 2021
},
{
name: 'Car',
value: 10,
year: 2021
},
{
name: 'Lar',
value: 4,
year: 2022
}]
I'd like to create a runtime field that would equal .5 for 'Foo' (12/(12+2+10)), .42 for 'Car' (10/(12+2+10)) and 1 for 'Lar' (4/4). Is this possible? Is there a better way to achieve this result? The ultimate goal is that if someone creates a query that returns 'Foo' and 'Car' they could sum the runtime field to get .92 (.5+.42) and that such a result could be used in a Kibana Lens visualization.
I've tried creating queries that return the above results, and that is easy enough, but those queries aren't usable inside Kibana which also has global filters to account for. That's why I thought a calculated field that represents the ratio of a document's value in relation to the sum of all documents' values would be useful.

Elasticsearch performance impact on choosing mapping structure for index

I am receiving data in a format like,
{
name:"index_name",
status: "good",
datapoints: [{
paramType: "ABC",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "123"
}]
},
{
paramType: "XYZ",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "124"
}]
}]
}
I would like to store the data into elasticsearch in such a way that I can query based on a timerange, status or paramType.
As mentioned here, I can define datapoints or batch as a nested data type which will allow to index object inside the array.
Another way, I can possibly think is by dividing the structure into separate documents. e.g.
{
name : "index_name",
status: "good",
paramType:"ABC",
time:"timestamp<epoch in sec>",
value: "123"
}
which one will be the most efficient way?
if I choose the 2nd way, I know there may be ~1000 elements in the batch array and 10-15 paramsType array, which means ~15k documents will be generated and 15k*5 fields (= 75K) key values pair will be repeated in the index?
Here this explains about the advantage and disadvantage of using nested but no performance related stats provided. in my case, there won't be any update in the inner object. So not sure which one will be better. Also, I have two nested objects so I would like to know how can I query if I use nested for getting data between a timerange?
Flat structure will perform better than nested. Nested queries are slower compared to term queries ; Also while indexing - internally a single nested document is represented as bunch of documents ; just that they are indexed in same block .
As long as your requirements are met - second option works better.

ES reversed filtering

Pardon the title, not sure how better to describe the problem.
Anyway, I have a table with entries(id, name, age, ..dynamic fields) and filter_groups(id, filters[])
Each filter group as a list of filters of the form {filter, field, value} that is used client-side to filter an HTML table of entries
Exmaples:
[{ filter: 'less_than', field: 'age' value: 10 },
{filter: 'is', field: 'name' value: "john doe" }]
If I wanted to fetch all the entries matching a particular filter group, it seems fairly straightforward to construct the query and send it to ES using the filters.
However, if you reverse the situation and given an entry want to fetch all the filter_groups whose filters match the entry, how would you go about doing this?
Thanks

Elasticsearch array scoring

I'm using elasticsearch to search multiple array fields in my type, which looks something like
t1 = { field1: ["foo", "bar"],
field2: ["foo", "foo", "foo", "foo"]
field3: ["foo", "foo", "foo", "foo", "foo", "foo"]
}
And then I'm using a multi_match query to get matches, something along
multi_match: { query: "foo",
fields: "fields*"
}
When computing the score of t1, elasticsearch adds the score of queries in field1, field2 and field3 which is what I want. However, they are not contributing equally, field3 contributes to the score the most since "foo" occurs multiple times there.
I want now to compute the score within each array field by not adding up the score of all array entries, but by just taking the maximum of them. In my example, all fields contained would have the same score then since they all have one exact match.
This question was already asked on the elasticsearch forum, but has not been answered so far.
I've been stumped on this myself, it really seems like there should be a simple, builtin way to just specify max instead of sum.
Not sure if this is exactly what you're going for, because you lose the match score on any particular item in the array. So you're not getting max of the match score of the best particular item, just a boolean value if anything matches. If it's something more nuanced (say a person's full name, where you want a better match for first and last vs just one or the other) this may not be acceptable because you're throwing out your scores.
If it is acceptable, this workaround seems to work:
{function_score: {
query: {bool: {should: [
{term: {field1: 'foo'}},
{term: {field2: 'foo'}},
{term: {field3: 'foo'}},
]}},
functions: [
{filter: {term: {field1: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
],
score_mode: 'sum',
boost_mode: 'replace',
}}
We need the "query" part to give us the results to further filter, even though we discard the score. This seems like it should really be a filter, but just wrapping this same thing in the filtered query doesn't work. There may be a better option here.
Then, the weight functions just basically give a 1 if there's a match on that field and 0 otherwise. The score_mode tells it to sum those weights, so in your case they all match so we get 3. The boost_mode tells how to combine with the original query, "replace" tells it to ignore the original query score (which has the problem you mentioned that multiple matches in an array are being summed). So, the total score of this query is 3 because there are 3 matches.
It seems more complicated to me, but in my relatively limited testing I haven't noticed performance issues or anything. I'd love to see a better answer if someone more familiar with elasticsearch has one.

How do you calculate the average of a all entries in a field from a specific collection in Mongo using Ruby

Given the following data:
{
_id: ObjectId("51659dc99d62eedc1a000001"),
type: "image_search",
branch: "qa_media_discovery_feelobot",
time_elapsed: 19000,
test: "1365613930 All Media",
search_term: null,
env: "delta",
date: ISODate("2013-04-10T17:13:45.751Z")
}
I would like to run a command like:
avg_image_search_time = #coll.find("type" => "image_search").avg(:time_elapsed)
How would I accomplish this?
I understand the documentation on this is kind of difficult to follow.
avg_image_search_time = #coll.aggregate([ {"$group" => {"_id"=>"$type", "avg"=> {"$avg"=>"$time_elapsed"}}}, {"$match" => {"_id"=>"image_search"}} ]).first['avg']
To break this down:
We are grouping the matches by the type field, and returning the $avg time_elapsed for each type. We name the resulting average avg. Then, of those groups, filter out only the ones where the group _id matches image_search. Finally, since aggregate always returns an array, get the first result (there should only be one), and grab the avg field that we named.
Use the mongodb aggregation framework http://docs.mongodb.org/manual/core/aggregation/

Resources