Elasticsearch: get top nested doc per month without top level duplicates - elasticsearch

I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).
A document is structured as follows:
{
_id: "xxx",
#timestamp: 1508244365987,
siteURL: "www.foo.bar",
plugins: [
{
name: "foo",
version: "3.1.4"
},
{
name: "baz",
version: "13.37"
}
]
}
However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing.
I tried to solve this by using the following aggregation:
{
"aggs": {
"normal_dates": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"from": "now-1y/d",
"to": "now"
}
]
},
"aggs": {
"date_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "month"
},
"aggs": {
"top_sites": {
"terms": {
"field": "siteURL.keyword",
"size": 50000
},
"aggs": {
"top_plugin_hits": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"plugins.name"
]
},
"size": 1
}
}
}
}
}
}
}
}
}
}
Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.
However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.
Any help would be most welcome
PS: extra kudos if I can get it in a Kibana 5.3 table...

It turns out it is not possible to further aggregate on a top_hits query.
I ended up loading the results of the posted query into Python and used Python for further processing and visualization.

Related

Transforming in elasticsearch not update aggregated data

I am working on a scenario to aggregate daily data per user. The data processed realtime and stored in elasticsearch. Now I wanno use elasticsearch feature for aggregating data in real time.Iv'e read about Transfrom in elasticsearch and found this is the case we need.
The problem is when the source index is updated, the destination index which is proposed to calculate aggregation is not updated. This is the case I have tested:
source_index data model:
{
"my_datetime": "2021-06-26T08:50:59",
"client_no": "1",
"my_date": "2021-06-26",
"amount": 1000
}
and the transform I defined:
PUT _transform/my_transform
{
"source": {
"index": "dest_index"
},
"pivot": {
"group_by": {
"client_no": {
"terms": {
"field": "client_no"
}
},
"my_date": {
"terms": {
"field": "my_date"
}
}
},
"aggregations": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"count_amount": {
"value_count": {
"field": "amount"
}
}
}
},
"description": "total amount sum per client",
"dest": {
"index": "my_analytic"
},
"frequency": "60s",
"sync": {
"time": {
"field": "my_datetime",
"delay": "10s"
}
}
}
Now when I add another document or update current documents in source index, destination index is not updated and not consider new documents.
Also note that elasticsearch version I used is 7.13
I also changed date field to be timestamp(epoch format like 1624740659000) but still have the same problem.
What am I doing wrong here?
Could it be that your "my_datetime" is further in the past than the "delay": "10s" (plus the time of "frequency": "60s")?
The docs for sync.field note:
In general, it’s a good idea to use a field that contains the ingest timestamp. If you use a different field, you might need to set the delay such that it accounts for data transmission delays.
You might just need a higher delay.

How to get newest data from Elasticsearch based on a date field

I implemented a scheduled script that inject date into my Elasticsearch. The script doesn't check if the data exist already in Elasticsearch so it inserts duplications. What I want is to get all events that have the latest timestamp field value (insertion dateTime).
Note: I don't have an id or a unique field that can help me group by it and set size to 1 to get the latest.
So can you give some other options?
You could aggregate by the latest available timestamp and get the top, potentially duplicate docs like so:
GET index/_search
{
"size": 0,
"aggs": {
"latest": {
"terms": {
"field": "timestamp",
"order": {
"_key": "desc"
},
"size": 1
},
"aggs": {
"latest_docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}

Pagination with specific search type on ElasticSearch

We are currently using ElasticSearch 6.7 and have a huge amount of data making some request taking too much time.
To avoid this problem, we want to set up pagination within our research towards elasticsearch. The problem is that I can't put one of the pagination methods proposed by ES on the different requests that already exist.
For example, this request contains different aggregations and a query:
https://github.com/trackit/trackit/blob/master/usageReports/lambda/es_request_constructor.go#L61-L75
In addition, the results are sorted after the information is collected.
I tried to set up the Search After method as well as a form of pagination using from & size.
Scroll doesn't works with aggregations and composite aggregation doesn't accept query.
So, there is any good way to do pagination in ElasticSearch combined with other request type and how to do it with the example above?
composite aggregation doesn't accept query
It does accept query. In the example below, the results are filtered based on play_name. The aggregation only get applied to the result of the query and it can be paginated using the after option.
{
"query": {
"term": {
"play_name": "A Winters Tale"
}
},
"size": 0,
"aggs": {
"speaker": {
"composite": {
"after": {
"product": "FLORIZEL"
},
"sources": [
{
"product": {
"terms": {
"field": "speaker"
}
}
}
]
},
"aggs": {
"speech_number": {
"terms": {
"field": "speech_number"
},
"aggs": {
"line_id": {
"terms": {
"field": "line_id"
}
}
}
}
}
}
}
}

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

How to determine ETL health status from time series logs using Elasticsearch?

TL;DR: what is the Elasticsearch equivalent to this Postgres query?
SELECT latest_pipeline_logs.* FROM (
SELECT pipeline_logs.*,
rank() OVER (
PARTITION BY pipeline_name
ORDER BY updated_at DESC
)
FROM pipeline_logs
) latest_pipeline_logs WHERE RANK = 1
I have hundreds of ETL pipelines with logs that are dumped into Elasticsearch. They are each executed independently at different intervals. I would like to derive a simple health status for each of my ETL pipelines using Elasticsearch aggregations.
Every pipeline logs its state on execution. My current thought process is to determine the health of each pipeline based on the two most important states that occur: succeeded and failed.
I know I can make an aggregation query and group by each pipeline with a sub-aggregation for statuses. For example, something along the lines of this:
{
...
"aggs": {
"pipelines": {
"field": "pipeline_name"
},
"aggs": {
"states": {
"terms": {
"field": "pipeline_state"
}
}
}
}
}
The problem with the above example is I could get several states because of the time series data-set, such as this:
{
"key": "some-pipeline-name",
"buckets": [
{
"key": "succeeded",
"doc_count": 123
},
{
"key": "failed",
"doc_count": 567
}
]
}
I could theoretically filter the results based on the date the pipeline is executed, but because some pipelines run every other month or so, I don't think this is an option.
The end state is to drive a simple dashboard using an Elasticsearch result set that looks something like this:
[
{
"key": "some-pipeline-name",
"latest-status": "succeeded"
},
{
"key": "some-other-pipeline",
"latest-status": "failed"
}
]
One thing to note is in this use case the historical data is not important. The dashboard will simply convey the latest state for each pipeline.
How would you go about achieving this with Elasticsearch?
If you're only interested in the latest status for each pipeline you could use top_hits as a sub-aggregation and then sort on time
{
"size": 0,
"aggs": {
"pipeline": {
"terms": {
"field": "pipeline_name",
"size": 1000
},
"aggs": {
"top_hits_status": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"pipeline_state"
]
}
}
}
}
}
}
}

Resources