Aggregation by ID on Elasticsearch or by timestamp with unsupervised clustering - elasticsearch

I have a data log entry stored in elasticsearch, each with its own timestamp. I now have a dashboard that can get the aggregation by day / week using Date Histogram aggregation.
Now I want to get the data in chunk (data logs are written several time per transaction, spanning for up to several minutes) by analyzing the "cluster" of logs according to its timestamp to identify whether it's the same "transaction". Would that be possible for Elastic search to automatically analyze the meaningful bucket and aggregate the data accordingly?
Another approach I'm trying is to group the data by transaction ID - however there's a warning that to do this I need to enable fielddata which will use a significant amount of memory. Any suggestion?

Related

mongoDB laravel search query taking too much time

i have 400000+ records now stored in MongoDB with a regular indexed but when i fire a update or search query through laravel elenquote it's taking too much time to get the particular records.
in where condition we have use indexed columns only.
we are using atlas M10 cluster instance with multiple replicas
so anyone have a some idea about it please share us
my replication lag graph
this is my profiler data
My Indexs in schema

Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?
Thanks!

Elasticsearch / Kibana: Subtraction across pre-aggregated time-series data

I am working with Johns Hopkins University CSSE COVID19 data, published on their GitHub. Some of the metrics they publish in their daily US reports are sum aggregations. I would like to perform basic math against the values within a given field so that I can get a daily tally.
JHU publishes their data daily, so let's assume that the numbers reported reflect a 24-hour period.
Example: In the State of New York, I can see the following values for Last_Update and Recovered, where Recovered is a rolling sum of all cases where people have recovered from infection:
Last_Update,Recovered
2020-08-05,73326
2020-08-04,73279
2020-08-03,73222
2020-08-02,73134
2020-08-01,73055
2020-07-31,72973
Ideally, I would like to create a new field (be it a scripted field, or a new field that is generated via a Logstash Filter processor) called RecoveredToday, where the field value reflects the difference between today's Recovered aggregation and yesterday's Recovered aggregation.
Last_Update,Recovered,RecoveredToday
2020-08-05,73326,47
2020-08-04,73279,57
2020-08-03,73222,88
2020-08-02,73134,79
2020-08-01,73055,82
2020-07-31,72973,...
In the above data set, RecoveredToday is calculated from the value of Recovered on 2020-08-05 minus the value of Recovered on 2020-08-04.
73326 - 73279 = 47
With respect to using a Scripted Field in Kibana, according to this blog article, Scripted Fields can only analyze fields within one given document at a time, and cannot perform calculations against a field across multiple documents.
I do see user #agung-darmanto solved a similar problem on StackOverflow, but the solution calls out specific dates rather than performing rolling calculations. It's also unclear from the code snippet if the results are being inserted into a new field that can subsequently be used to build visualizations.
The approach to use Logstash ruby processing on the fly also presents a problem. Logstash, as far as I know, cannot access an already ingested document ... and if it can, it's probably a pretty ugly superpower to wield.
Goal: There are other fields provided in the JHU CSSE data which are also pre-aggregated. I would like to produce visualizations that reflect trends such as:
Number of new cases per day
Number of new hospitalizations per day
Number of new deaths per day
Using the data they provide, I can build visualizations that will plateau, and that plateau reflects a reduction of incidences. I'm trying to produce visualizations that show ZERO.

Update dataset wth ElasticSearch Aggregation result

I'd like to automate a features creation process for large dataset with elastic search.
I'd like to know if it is possible to create a new field in my dataset that will be the result of an aggregation.
I'm currently working on log from a network and wants to implement the moving average (the mean of a field during the past x days) of the filed "bytes_in".
After spending time reading the doc and example, I wasn't able to do so ...
You have two possibilities:
By using the Rollup API you can create a job that will allow you to summarize data on the go and store it in a dedicated index.
A detailed example can be found in this blog article.
By using the Data Frame Transform API, you can pivot your data into a new entity-centric index, aggregate your data in various ways and store the results in a dedicated index.

MongoDB Vs Oracle for Real time search

I am building an application where i am tracking user activity changes and showing the activity logs to the users. Here are a few points :
Insert 100 million records per day.
These records to be indexed and available in search results immediately(within a few seconds).
Users can filter records on any of the 10 fields that are exposed.
I think both Mongo and Oracle will not accomplish what you need. I would recommend offloading the search component from your primary data store, maybe something like ElasticSearch:
http://www.elasticsearch.org/
My recommendation is ElasticSearch as your primary use-case is "filter" (Facets in ElasticSearch) and search. Is it written to scale-up (otherwise Lucene is also good) and keeping big data in mind.
100 million records a day sounds like you would need a rapidly growing server farm to store the data. I am not familiar with how Oracle would distribute these data, but with MongoDB, you would need to shard your data based on the fields that your search queries are using (including the 10 fields for filtering). If you search only by shard key, MongoDB is intelligent enough to only hit the machines that contain the correct shard, so it would be like querying a small database on one machine to get what you need back. In addition, if the shard keys can fit into the memory of each machine in your cluster, and are indexed with MongoDB's btree indexing, then your queries would be pretty instant.

Resources