Kibana not identifying field as time-based - elasticsearch

I'm using java API to index data into ElasticSearch and generate graphs in Kibana.
I have a field named "Event_TS" which holds values of type long (time at which event was created in milliseconds). I could generate Date Histograms using it.
(I'm getting JSON document from a separate method.)
But, when I finally reindexed the whole data, Kibana is not identifying "Event_TS" as time-based anymore and hence I can't generate Date Histograms. How do I resolve this?

Related

Elasticsearch / Kibana: Subtraction across pre-aggregated time-series data

I am working with Johns Hopkins University CSSE COVID19 data, published on their GitHub. Some of the metrics they publish in their daily US reports are sum aggregations. I would like to perform basic math against the values within a given field so that I can get a daily tally.
JHU publishes their data daily, so let's assume that the numbers reported reflect a 24-hour period.
Example: In the State of New York, I can see the following values for Last_Update and Recovered, where Recovered is a rolling sum of all cases where people have recovered from infection:
Last_Update,Recovered
2020-08-05,73326
2020-08-04,73279
2020-08-03,73222
2020-08-02,73134
2020-08-01,73055
2020-07-31,72973
Ideally, I would like to create a new field (be it a scripted field, or a new field that is generated via a Logstash Filter processor) called RecoveredToday, where the field value reflects the difference between today's Recovered aggregation and yesterday's Recovered aggregation.
Last_Update,Recovered,RecoveredToday
2020-08-05,73326,47
2020-08-04,73279,57
2020-08-03,73222,88
2020-08-02,73134,79
2020-08-01,73055,82
2020-07-31,72973,...
In the above data set, RecoveredToday is calculated from the value of Recovered on 2020-08-05 minus the value of Recovered on 2020-08-04.
73326 - 73279 = 47
With respect to using a Scripted Field in Kibana, according to this blog article, Scripted Fields can only analyze fields within one given document at a time, and cannot perform calculations against a field across multiple documents.
I do see user #agung-darmanto solved a similar problem on StackOverflow, but the solution calls out specific dates rather than performing rolling calculations. It's also unclear from the code snippet if the results are being inserted into a new field that can subsequently be used to build visualizations.
The approach to use Logstash ruby processing on the fly also presents a problem. Logstash, as far as I know, cannot access an already ingested document ... and if it can, it's probably a pretty ugly superpower to wield.
Goal: There are other fields provided in the JHU CSSE data which are also pre-aggregated. I would like to produce visualizations that reflect trends such as:
Number of new cases per day
Number of new hospitalizations per day
Number of new deaths per day
Using the data they provide, I can build visualizations that will plateau, and that plateau reflects a reduction of incidences. I'm trying to produce visualizations that show ZERO.

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Kibana - can I add a monitor on a scripted field?

In Kibana (ElasticSearch v6.8) I'm storing documents containing a date field and a LaunchTime field, and I have a scripted field uptime as their difference (in seconds):
(doc['date'].value.millis - doc['LaunchTime'].value.millis) / 1000 / 60
I'm trying to create a monitor (under alerting) on the max value of this field of the index, but the field 'Uptime' doesn't show up in the list of fields I can do a max query on. Its type is number and in visualisations I can do max/min etc. displays of this field.
Is this a limitation of Kibana alerting - that I can't use a scripted field? Or is there some way I can make it available to use?
I'm afraid it is a limitation of kibana's scripted fields. See this post about the same subject referring to the scripted field official documentation. I believe that the watcher are handled by ES itself while the scripted field are handled by kibana (they can be used in discovery and visualisations because kibana is handlind those too)
But have no fear! you already have the script for the calculation and you could just add it into logstash to add the field to you actual documents when you index them, which would enable you to use it for watchers AND would probably optimize the load at runtime, since the val is only calculated one, when you ingest it. Then you could run an update by query with a the script and add the field in you existing documents.
If you don't use logstash, you could look into ES's ingestion pipelines, but it's a rather advanced subject and i'm not sure if it was implemented in 5.x.

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6
I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.
I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time
Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Aggregation by ID on Elasticsearch or by timestamp with unsupervised clustering

I have a data log entry stored in elasticsearch, each with its own timestamp. I now have a dashboard that can get the aggregation by day / week using Date Histogram aggregation.
Now I want to get the data in chunk (data logs are written several time per transaction, spanning for up to several minutes) by analyzing the "cluster" of logs according to its timestamp to identify whether it's the same "transaction". Would that be possible for Elastic search to automatically analyze the meaningful bucket and aggregate the data accordingly?
Another approach I'm trying is to group the data by transaction ID - however there's a warning that to do this I need to enable fielddata which will use a significant amount of memory. Any suggestion?

Resources