I am working with Johns Hopkins University CSSE COVID19 data, published on their GitHub. Some of the metrics they publish in their daily US reports are sum aggregations. I would like to perform basic math against the values within a given field so that I can get a daily tally.
JHU publishes their data daily, so let's assume that the numbers reported reflect a 24-hour period.
Example: In the State of New York, I can see the following values for Last_Update and Recovered, where Recovered is a rolling sum of all cases where people have recovered from infection:
Ideally, I would like to create a new field (be it a scripted field, or a new field that is generated via a Logstash Filter processor) called RecoveredToday, where the field value reflects the difference between today's Recovered aggregation and yesterday's Recovered aggregation.
In the above data set, RecoveredToday is calculated from the value of Recovered on 2020-08-05 minus the value of Recovered on 2020-08-04.
73326 - 73279 = 47
With respect to using a Scripted Field in Kibana, according to this blog article, Scripted Fields can only analyze fields within one given document at a time, and cannot perform calculations against a field across multiple documents.
I do see user #agung-darmanto solved a similar problem on StackOverflow, but the solution calls out specific dates rather than performing rolling calculations. It's also unclear from the code snippet if the results are being inserted into a new field that can subsequently be used to build visualizations.
The approach to use Logstash ruby processing on the fly also presents a problem. Logstash, as far as I know, cannot access an already ingested document ... and if it can, it's probably a pretty ugly superpower to wield.
Goal: There are other fields provided in the JHU CSSE data which are also pre-aggregated. I would like to produce visualizations that reflect trends such as:
Number of new cases per day
Number of new hospitalizations per day
Number of new deaths per day
Using the data they provide, I can build visualizations that will plateau, and that plateau reflects a reduction of incidences. I'm trying to produce visualizations that show ZERO.


Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?

Chart a divergence between two time fields in the same elasticsearch index with timelion

So I have an elastic search index with lots of data and I have found an issue with some of the data that I would like to visualise. Some items in the index matched under the itm.description field as say FOO have two timestamp entries called itm.timestamp and itm.jmsTimestamp.
These two fields have started to diverge quite considerably when they were very close a few days ago. Some ActiveMQ processing is going on between the two so that seems like the likely cause but I would like to visualise when this started and what the drift is over the last few days in Kibana using timelion.
So clearly this query is not particularity helpful as it produces two flat lines. What do I need to produce a graph that displays the difference in the drift between the two timestamps using timelion? Is it possible to display the two timestamps overlayed or would graphing the drift be more useful?
Have you tried using the subtract-expression?
Something like this:
.es(index=myindex*, q='itm.description:FOO', timefield='#itm.timestamp', metric=max:'#itm.timestamp').subtract(.es(index=myindex*, q='itm.description:FOO', timefield='#itm.timestamp', metric=max'#itm.jmsTimestamp'))
You cannot use the direct values of a field, Timelion requires an aggregation, as it renders 1 value per bucket. In my example, I used the max-aggregation.
Another solution would be to introduce a new field that you populate at indexing time with the delta of the two timestamp fields. Again you would need to use an aggregation for value to be displayed on the y-axis.

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Update dataset wth ElasticSearch Aggregation result

I'd like to automate a features creation process for large dataset with elastic search.
I'd like to know if it is possible to create a new field in my dataset that will be the result of an aggregation.
I'm currently working on log from a network and wants to implement the moving average (the mean of a field during the past x days) of the filed "bytes_in".
After spending time reading the doc and example, I wasn't able to do so ...
You have two possibilities:
By using the Rollup API you can create a job that will allow you to summarize data on the go and store it in a dedicated index.
A detailed example can be found in this blog article.
By using the Data Frame Transform API, you can pivot your data into a new entity-centric index, aggregate your data in various ways and store the results in a dedicated index.

How to add calculations to an Elastic Search database?

I'm using Elastic Search to index large amounts of sensor data for analytics purposes. The table has 4 million + rows and growing fast - expecting 40 million within the next year. This makes Elastic Search seem like a natural fit, especially with tools such as Kibana to easily display the data.
Elastic Search seems great, however there are are some more complex calculations that have to be performed as well. One such calculation is for our "average user time", where we take two data points (timestamp of item picked up and timestamp of item placed back), subtract them from each other and do an average of all these for one specific customer over a specific timeframe. The SQL query would look something like "select * from events where event_type = 'object picked up' or event_type = 'object placed back down'" then take all these events and get diffs on all their timestamps, add them all together then divide by count.
These types of calculations to my understanding are not the type of thing that Elastic Search is meant to do. I've had people recommend Hadoop but that could take a long time to get set up and we can use a fast language like GO or Node/JavaScript to batch process things and add them to the DB periodically... but what is the right way to do this? Allowing for future scalability and working nicely with Elastic Search.
Our setup is: Rails, AngularJS, Elastic Search, Heroku, Postgres.
Maybe you could try to use scripted metrics. In connection with filters can give you more or less proper solution for your problem
