Transforming in elasticsearch not update aggregated data - elasticsearch

I am working on a scenario to aggregate daily data per user. The data processed realtime and stored in elasticsearch. Now I wanno use elasticsearch feature for aggregating data in real time.Iv'e read about Transfrom in elasticsearch and found this is the case we need.
The problem is when the source index is updated, the destination index which is proposed to calculate aggregation is not updated. This is the case I have tested:
source_index data model:
{
"my_datetime": "2021-06-26T08:50:59",
"client_no": "1",
"my_date": "2021-06-26",
"amount": 1000
}
and the transform I defined:
PUT _transform/my_transform
{
"source": {
"index": "dest_index"
},
"pivot": {
"group_by": {
"client_no": {
"terms": {
"field": "client_no"
}
},
"my_date": {
"terms": {
"field": "my_date"
}
}
},
"aggregations": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"count_amount": {
"value_count": {
"field": "amount"
}
}
}
},
"description": "total amount sum per client",
"dest": {
"index": "my_analytic"
},
"frequency": "60s",
"sync": {
"time": {
"field": "my_datetime",
"delay": "10s"
}
}
}
Now when I add another document or update current documents in source index, destination index is not updated and not consider new documents.
Also note that elasticsearch version I used is 7.13
I also changed date field to be timestamp(epoch format like 1624740659000) but still have the same problem.
What am I doing wrong here?

Could it be that your "my_datetime" is further in the past than the "delay": "10s" (plus the time of "frequency": "60s")?
The docs for sync.field note:
In general, it’s a good idea to use a field that contains the ingest timestamp. If you use a different field, you might need to set the delay such that it accounts for data transmission delays.
You might just need a higher delay.

Related

Elasticsearch Entity-centric indexing with Transforms

We're working with Elasticsearch for the first time and we are currently deciding on what would be the best solution for our problem at hand.
We are receiving Event based logs (in JSON form) from our applications directly to Elasticsearch index. These logs are highly interconnected (they share a common unique ID) and therefore we need to convert/aggregate them in an Entity-centered fashion.
Each event usually has a status change in the target field. There are more statuses than just start/end. Document has more data which can be used to create more than just one Entity-centered index.
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStart: {timestamp: 2020-06-01T13:50:55.000Z}
}
}
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStop: {timestamp: 2021-06-01T13:50:55.000Z}
}
}
We were already able to join these documents using Python or Logstash. We basically created an index that contains the following documents:
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStart: {timestamp: 2020-06-01T13:50:55.000Z},
eventStop: {timestamp: 2021-06-01T13:50:55.000Z}
**time_dif_Start_Stop : xxxx**
}
}
We assigned all events document ID that is the same as uniqueID which updated them automatically. Next step just calculated the difference between eventStart and eventStop timestamps.
We have certain requirements for our pipeline so we would prefer if data never has to leave elasticsearch. Therefore, we are wondering if it is possible to do this with any of the tools that already exist in the ELK stack or are hosted in the Elastic cloud? We tried using Transforms but we were only able to calculate aggregated fields in a new index. Is it possible to also basically merge/update all the documents into a single one with this tool or any other? It would be ideal for us as it is running on a schedule and we do not need any external tools to modify documents.
Any other suggestions or help would also be greatly appreciated.
A transform sounds good. I tried the following quick example:
PUT test/_doc/1
{
"uniqueID": "one",
"eventStart": {
"timestamp": "2020-06-01T13:50:55.000Z"
}
}
PUT test/_doc/2
{
"uniqueID": "one",
"eventStop": {
"timestamp": "2020-06-01T13:53:55.000Z"
}
}
PUT test/_doc/3
{
"uniqueID": "one",
"eventStop": {
"timestamp": "2020-06-01T13:54:55.000Z"
}
}
PUT test/_doc/4
{
"uniqueID": "other",
"eventStop": {
"timestamp": "2020-06-01T13:54:55.000Z"
}
}
GET test/_mapping
POST _transform/_preview
{
"source": {
"index": "test"
},
"dest": {
"index": "test_transformed"
},
"pivot": {
"group_by": {
"id": {
"terms": {
"field": "uniqueID.keyword"
}
}
},
"aggregations": {
"event_count": {
"value_count": {
"field": "_id"
}
},
"start": {
"min": {
"field": "eventStart.timestamp"
}
},
"stop": {
"max": {
"field": "eventStop.timestamp"
}
},
"duration": {
"bucket_script": {
"buckets_path": {
"start": "start.value",
"stop": "stop.value"
},
"script": """
return (params.stop - params.start)/1000; //in seconds (initially in ms)
"""
}
}
}
}
}
Which generates the following result — aggregation and calculation look correct:
[
{
"duration" : 240.0,
"stop" : "2020-06-01T13:54:55.000Z",
"event_count" : 3,
"start" : "2020-06-01T13:50:55.000Z",
"id" : "one"
},
{
"stop" : "2020-06-01T13:54:55.000Z",
"event_count" : 1,
"start" : null,
"id" : "other"
}
]
PS: I've turned the answer into a blog post that dives a little deeper into the general topic :)

Pagination with specific search type on ElasticSearch

We are currently using ElasticSearch 6.7 and have a huge amount of data making some request taking too much time.
To avoid this problem, we want to set up pagination within our research towards elasticsearch. The problem is that I can't put one of the pagination methods proposed by ES on the different requests that already exist.
For example, this request contains different aggregations and a query:
https://github.com/trackit/trackit/blob/master/usageReports/lambda/es_request_constructor.go#L61-L75
In addition, the results are sorted after the information is collected.
I tried to set up the Search After method as well as a form of pagination using from & size.
Scroll doesn't works with aggregations and composite aggregation doesn't accept query.
So, there is any good way to do pagination in ElasticSearch combined with other request type and how to do it with the example above?
composite aggregation doesn't accept query
It does accept query. In the example below, the results are filtered based on play_name. The aggregation only get applied to the result of the query and it can be paginated using the after option.
{
"query": {
"term": {
"play_name": "A Winters Tale"
}
},
"size": 0,
"aggs": {
"speaker": {
"composite": {
"after": {
"product": "FLORIZEL"
},
"sources": [
{
"product": {
"terms": {
"field": "speaker"
}
}
}
]
},
"aggs": {
"speech_number": {
"terms": {
"field": "speech_number"
},
"aggs": {
"line_id": {
"terms": {
"field": "line_id"
}
}
}
}
}
}
}
}

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

Elasticsearch: get top nested doc per month without top level duplicates

I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).
A document is structured as follows:
{
_id: "xxx",
#timestamp: 1508244365987,
siteURL: "www.foo.bar",
plugins: [
{
name: "foo",
version: "3.1.4"
},
{
name: "baz",
version: "13.37"
}
]
}
However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing.
I tried to solve this by using the following aggregation:
{
"aggs": {
"normal_dates": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"from": "now-1y/d",
"to": "now"
}
]
},
"aggs": {
"date_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "month"
},
"aggs": {
"top_sites": {
"terms": {
"field": "siteURL.keyword",
"size": 50000
},
"aggs": {
"top_plugin_hits": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"plugins.name"
]
},
"size": 1
}
}
}
}
}
}
}
}
}
}
Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.
However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.
Any help would be most welcome
PS: extra kudos if I can get it in a Kibana 5.3 table...
It turns out it is not possible to further aggregate on a top_hits query.
I ended up loading the results of the posted query into Python and used Python for further processing and visualization.

Aggregation over "LastUpdated" property or _timestamp

My Elasticsearch mapping looks like roughly like this:
{
"myIndex": {
"mappings": {
"myType": {
"_timestamp": {
"enabled": true,
"store": true
},
"properties": {
"LastUpdated": {
"type": "date",
"format": "dateOptionalTime"
}
/* lots of other properties */
}
}
}
}
}
So, _timestamp is enabled, and there's also a LastUpated property on every document. LastUpdated can have a different value than _timestamp: sometimes, documents get updated physically (e.g. updates to denormalized data) which updates _timestamp, but LastUpdated remains unchanged because the document hasn't actually been "updated" from a business perspective.
Also, there are many of documents without a LastUpdated value (mostly old data).
What I'd like to do is run an aggregation which counts the number of documents per calendar day (kindly ignore the fact that the dates need to be midnight-aligned, please). For every document, use LastUpdated if it's there, otherwise use _timestamp.
Here's what I've tried:
{
"aggregations": {
"counts": {
"terms": {
"script": "doc.LastUpdated == empty ? doc._timestamp : doc.LastUpdated"
}
}
}
}
The bucketization appears to work to some extent, but the keys in the result looks weird:
buckets: [
{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#7ba1f463doc_count: 300544
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#5a298acbdoc_count: 257222
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#6e451b5edoc_count: 101117
},
...
]
What's the proper way to run this aggregation and get meaningful keys (i.e. timestamps) in the result?
I've tested and made a groovy script for you,
POST index/type/_search
{
"aggs": {
"counts": {
"terms": {
"script": "ts=doc['_timestamp'].getValue();v=doc['LastUpdated'].getValue();rv=v?:ts;rv",
"lang": "groovy"
}
}
}
}
This returns the required result.
Hope this helps!! Thanks!!

Resources