Elasticsearch arithmetic and nested aggregation - elasticsearch

I've this kind of objects in my ElasticSearch:
"myobject": {
"type": "blah",
"events": [
{
"code": "code1"
"date": "2016-08-03 18:00:00"
},
{
"code": "code2"
"date": "2016-08-03 20:00:00"
}
]
}
I'd like to compute the average time spend in between events with code "code1" and events with type "code2". Basically, I need to subtract the date of "code2" from the date of "code1" for each object and then compute the average.
thanks for you help !

Plan B is definitely MUCH better. Anything you can do at indexing time, you should do. If you know you'll need that date difference, then you should compute it at indexing time and store it into another field.
You should definitely not worry about storing redundant data, Elasticsearch doesn't really care. Your cluster will be much better off storing a few more fields than doing heavy scripting during each query. Your users will appreciate, too, as they won't have to wait for ages to get an answer as your data grows.
So store this instead (time_spent is the number of milliseconds between the second and the first event):
"myobject": {
"type": "blah",
"time_spent": 7200000,
"events": [
{
"code": "code1"
"date": "2016-08-03 18:00:00"
},
{
"code": "code2"
"date": "2016-08-03 20:00:00"
}
]
}
Then you'll be able to run a simple aggregation query like this:
{
"size": 0,
"aggs": {
"avg_duration": {
"avg": {
"field": "time_spent"
}
}
}
}

Related

Elasticsearch - How to down sample large datasets

I'm trying to figure out how to query a large dataset so I could put it up on a js line chart.
The index has millions of documents and I want to be able to show the entire series even if it's zoomed out.
The mapping kinda looks like this:
{
"counter": {
"type": long // used as kind of a sequential ID
},
"deposits": {
"type": "nested",
"properties": {
"depositA": { "type": "long" },
"depositB": { "type": "long" }
}
}
}
I want to show a line chart where the X axis is the counter values and the Y axis is the sum of the depositA and depositB values.
The dataset has about 7M docs so I'm thinking if I could get ES to return the average of every 7 rows,I could trim that down to 1M points for my chart and still have something that looks sensible. Possibly, even take it down to 100k points?
The problem is I don't really know where to start and I'm just very new to ES.
I tried histogram aggregations but it doesn't seem to be what I'm looking for.
POST /data/_search?size=0
{
"aggs": {
"counters": {
"histogram": {
"field": "counter",
"interval": 50
}
}
}
}
While this returns the counter field in 50 intervals, it also only gives me a doc count in those 50 counters (which I guess is just how histograms work?). I would like to know how to get the average value of depositA+depositB across the 50 items along with the counter keys, if possible.
I'm really over my head here honestly but would love to learn.
If anyone could point me to any helpful information that would be very much appreciated.
In fact, histogram aggregation is the correct way to go, as I see. I think you need to put sub-aggregation on it. Here is an example for you:
POST indexa/_bulk
{"index": {"_id": "1"}}
{"counter": 1, "deposits": {"depositA": 10, "depositB": 15}}
{"index": {"_id": "2"}}
{"counter": 2, "deposits": {"depositA": 12, "depositB": 17}}
{"index": {"_id": "3"}}
{"counter": 3, "deposits": {"depositA": 16, "depositB": 16}}
{"index": {"_id": "4"}}
{"counter": 4, "deposits": {"depositA": 18, "depositB": 18}}
POST indexa/_search
{
"size": 0,
"aggs": {
"range": {
"histogram": {
"field": "counter",
"interval": 2
},
"aggs": {
"nested": {
"nested": {
"path": "deposits"
},
"aggs": {
"scripts": {
"avg": {
"script": {
"lang": "painless",
"source": "return doc['deposits.depositA'].value + doc['deposits.depositB'].value"
}
}
}
}
}
}
}
}
}
I think this will work for you. I put an avg aggregation with sub-aggregation. I used nested aggregation because your deposits field is nested typed.

Transforming in elasticsearch not update aggregated data

I am working on a scenario to aggregate daily data per user. The data processed realtime and stored in elasticsearch. Now I wanno use elasticsearch feature for aggregating data in real time.Iv'e read about Transfrom in elasticsearch and found this is the case we need.
The problem is when the source index is updated, the destination index which is proposed to calculate aggregation is not updated. This is the case I have tested:
source_index data model:
{
"my_datetime": "2021-06-26T08:50:59",
"client_no": "1",
"my_date": "2021-06-26",
"amount": 1000
}
and the transform I defined:
PUT _transform/my_transform
{
"source": {
"index": "dest_index"
},
"pivot": {
"group_by": {
"client_no": {
"terms": {
"field": "client_no"
}
},
"my_date": {
"terms": {
"field": "my_date"
}
}
},
"aggregations": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"count_amount": {
"value_count": {
"field": "amount"
}
}
}
},
"description": "total amount sum per client",
"dest": {
"index": "my_analytic"
},
"frequency": "60s",
"sync": {
"time": {
"field": "my_datetime",
"delay": "10s"
}
}
}
Now when I add another document or update current documents in source index, destination index is not updated and not consider new documents.
Also note that elasticsearch version I used is 7.13
I also changed date field to be timestamp(epoch format like 1624740659000) but still have the same problem.
What am I doing wrong here?
Could it be that your "my_datetime" is further in the past than the "delay": "10s" (plus the time of "frequency": "60s")?
The docs for sync.field note:
In general, it’s a good idea to use a field that contains the ingest timestamp. If you use a different field, you might need to set the delay such that it accounts for data transmission delays.
You might just need a higher delay.

Elasticsearch stats aggregation group by date on timeseries

I having some trouble getting a query working. I want to aggregate a weather station's timeseries data in ElasticSearch. I have a value (double) for each day of the year. I would like a query to be able to provide me the sum, min, max of my value field, grouped by month.
My document has a stationid field and a timeseries object array:
}PUT /stations/rainfall/2
{
"stationid":"5678",
"timeseries": [
{
"value": 91.3,
"date": "2016-05-01"
},
{
"value": 82.2,
"date": "2016-05-02"
},
{
"value": 74.3,
"date": "2016-06-01"
},
{
"value": 34.3,
"date": "2016-06-02"
}
]
}
So I am hoping to be able to query this stationid: "5678" or the doc index:2
and see: stationid: 5678, monthlystats: [ month:5, avg:x, sum:y, max:z ]
Many thanks in advance for any help. Also happy to take any advice on my document structure too.

Is it possible to sort elasticsearch query by whether or not the field meets the condition?

I've got elasticsearch document with mapping like this.
{
"date_added": {
"type": "date",
"format": "dateOptionalTime"
}
"expires": {
"type": "date",
"format": "dateOptionalTime"
}
}
What I want to do is query with double sort:
1) newest to oldest (that's an easy sort, no problem here)
2) show documents yet to expire on top (so documents where "expires" is greater than now)
In my response I want to get documents sorted in somewhat two parts - yet to expire newest to oldest, expired newest to oldest.
I struggle to create sort that would achieve the second part. Can I sort by result of range filter? Maybe I can create some property-like boolean field that will change depending on "expires" field and later use it for sort?
This looks like an excellent use case for function score. You can mention a function score function to increase boost for documents having expiry date larger than now.
{
"query": {
"function_score": {
"filter": [
{
"filter": {
"range": {
"expires": {
"gte": "now"
}
}
},
"boost": 1000
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
}
}
You can read more about it here and here.

Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

In relational db our data looks like this:
Company -> Department -> Office
Elasticsearch version of the same data (flattened):
{
"officeID": 123,
"officeName": "office 1",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}
},{
"officeID": 124,
"officeName": "office 2",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}}
We need to find department (or company) by providing office information (such as state).
For example, since all I need is a department info, I can specify it like this (we are using Nest)
searchDescriptor = searchDescriptor.Source(x => x.Include("department"));
and get all departments with qualifying offices.
The problem is - I am getting multiple "department" records with the same id (one for each office).
We are using paging and sorting.
Would it be possible to get paged and sorted Distinct results?
I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client. The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.
So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:
searchDescriptor = searchDescriptor.Aggregations(a => a
.Terms("myunique",
t =>
t.Field("department.departmentID")
.Size(10)
.Aggregations(
x=>x.TopHits("mytophits",
y=>y.Source(true)
.Size(1)
.Sort(k => k.OnField("department.departmentName").Ascending())
)
)
)
);
Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?
Update:
I can get results using top_hits (see below), but in this case I won't be able to use paging (looks like Elasticsearch aggregations feature doesn't support paging), so I am back to square one...
{
"from": 0,
"size": 33,
"explain": false,
"sort": [
{
"departmentID": {
"order": "asc"
}
}
],
"_source": {
"include": [
"department"
]
},
"aggs": {
"myunique": {
"terms": {
"field": "department.departmentID",
"order": {
"mytopscore": "desc"
}
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 5,
"_source": {
"include": [
"department.departmentID"
]
}
}
},
"mytopscore": {
"max": {
"script": "_score"
}
}
}
}
},
"query": {
"wildcard" : { "officeName" : "some office*" }
}
}

Resources