Algorithms for establishing baselines from time series data - algorithm

In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?

You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.

Related

How does Anomaly detection in Elasticsearch work?

I have a question about the Anomaly Detection module provided by elastic stack. As per my understanding of Machine Learning the more data being fed to the model the better learning it will do provided the data is proper. Now I want to use the Anomaly Detection Module in kibana. I did some testing with that and with some reading I found that basically it is better that we have at least 3 weeks of data or 20 buckets worth. Now lets say we receive about 40 million records a day. This will take a whole lot of time for the model to train for a day itself now if get about 3 weeks worth of this amount of data this will put a lot of pressure on the node. But if I feed the model less data and reduce the bucket span it will make my model more sensitive. So what is my best bet for this. How is that I can make the most out of the Anomaly Detection module.
Just FYI: I do have a dedicated Machine learning Node with equipped with more than enough memory but it still takes a whole lot of time to process records for a day so my concern is it will take a whole whole lot of time to process 3 weeks worth of data.
So My question is that if we give large amount of data for short amount of time say 1 week to the model for training and if we give large amount of data for a slightly longer amount of time say 3 weeks to the model for training will these two models detect anomalies with the same accuracy.
If you have a dedicated ML node with ample memory, I don't see what the problem could be. Common sense has it that the more data you have, the better the model can learn, and the more accurate your prediction model will be. Also seasonality might not be well captured with just one week of data. If you have the data and are not using it out of fear that it will take a "some time" to analyze it, what's the point of gathering it in the first place?
It is true that it will take "some time" to build the model initially, but afterwards, the ML process will run more frequently depending on your chosen bucket span size (configurable) and process the new documents that arrived in the meantime, it's really fast. Regarding sensitivity, your mileage may vary, but it's not dependent only on the amount of data you feed, but also on the size of the bucket span you choose.

Frequent Updates on Apache Ignite

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?
It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

High speed time series query postgresql hardware considerations

Just have some general questions on hardware choices for polling lots of low-medium density sensor data time series as fast as possible.
System Overview
The data consists of multiple time series of approx 50-100K data points from various sensors at different locations at different times. ~15 columns of data in the main table, but some columns have long array values in them. I'm interfacing with C#/python to send queries to a local database and then work on the returned data.
The Problem and Some Thoughts
As it stands, the logging server is on average hardware (budget 4tb hdds, basic quadcore, 8gb ram) and depending on the amount of data requested in the query, it takes forever (re: "annoying amounts of waiting") to return an entire time series. I have thousands of independent time series and I want to compare multiple ones against each other. Returning a single time series takes at least 30 for 50K rows with a * can take 30+ seconds, but as low as 100ms for only a few columns. A lot of different queries are used so I don't have the luxury of caching for repeated queries.
So what I was thinking of is, instead of work directly from the online server, make an offline copy of the database onto more dedicated hardware that can be used for faster analysis. I don't need all of the data at any one time (i.e. I can pick a a location and time range, copy that offline, and work on that)
The table design is very basic. The most often queried table data has primary key:
PRIMARY KEY (location_id, time_logged, sensor_id)
With simple query like
SELECT * FROM table
WHERE location_id = 2154321 AND sensor_id = 254;
This is most common, and will be used along with some additional WHERE conditions.
There are thousands of location_id and potentially dozens of sensor_id for each location_id
The Questions
(with regard to the above table/pk/query setup)
How much does more RAM help with faster queries. i.e. 64GB vs 8GB
How much does faster cpu / more cores help with faster queries. What kind of CPU (quad/6/12+) would provide the best speedup. Is there diminishing returns?
How would one set up hard disk drives to help with faster queries (RAID cluster with SSD or mechanical).
Would getting a couple or even 4+ cheap mechanical hard disk drives provide a significant speedup in RAID?
I've read about columnar store and how it can be useful for time series (https://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics). Can anyone shed any insight on this and is it worth setting up?
Will increasing the planner statistics value help significantly?
Any general recommendations / first steps to get the best kind of query speedup? Another dedicated PC with multi-core and lots of ram? A NAS? Dedicated PC with multi-drive RAID
I'm relatively new to working with databases so don't really know what to expect in terms of performance so any pointers would be helpful.
thanks!

Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time

We are experiencing some performance issues or anomalies on a elasticsearch specifically on a system we are currently building.
The requirements:
We need to capture data for multiple of our customers, who will query and report on them on a near real time basis. All the documents received are the same format with the same properties and are in a flat structure (all fields are of primary type and no nested objects). We want to keep each customer’s information separate from each other.
Frequency of data received and queried:
We receive data for each customer at a fluctuating rate of 200 to 700 documents per second – with the peak being in the middle of the day.
Queries will be mostly aggregations over around 12 million documents per customer – histogram/percentiles to show patterns over time and the occasional raw document retrieval to find out what happened a particular point in time. We are aiming to serve 50 to 100 customer at varying rates of documents inserted – the smallest one could be 20 docs/sec to the largest one peaking at 1000 docs/sec for some minutes.
How are we storing the data:
Each customer has one index per day. For example, if we have 5 customers, there will be a total of 35 indexes for the whole week. The reason we break it per day is because it is mostly the latest two that get queried with occasionally the remaining others. We also do it that way so we can delete older indexes independently of customers (some may want to keep 7 days, some 14 days’ worth of data)
How we are inserting:
We are sending data in batches of 10 to 2000 – every second. One document is around 900bytes raw.
Environment
AWS C3-Large – 3 nodes
All indexes are created with 10 shards with 2 replica for the test purposes
Both Elasticsearch 1.3.2 and 1.4.1
What we have noticed:
If I push data to one index only, Response time starts at 80 to 100ms for each batch inserted when the rate of insert is around 100 documents per second. I ramp it up and I can reach 1600 before the rate of insert goes to close to 1sec per batch and when I increase it to close to 1700, it will hit a wall at some point because of concurrent insertions and the time will spiral to 4 or 5 seconds. Saying that, if I reduce the rate of inserts, Elasticsearch recovers nicely. CPU usage increases as rate increases.
If I push to 2 indexes concurrently, I can reach a total of 1100 and CPU goes up to 93% around 900 documents per second.
If I push to 3 indexes concurrently, I can reach a total of 150 and CPU goes up to 95 to 97%. I tried it many times. The interesting thing is that response time is around 109ms at the time. I can increase the load to 900 and response time will still be around 400 to 600 but CPU stays up.
Question:
Looking at our requirements and findings above, is the design convenient for what’s asked? Are there any tests that I can do to find out more? Is there any setting that I need to check (and change)?
I've been hosting thousands of Elasticsearch clusters on AWS over at https://bonsai.io for the last few years, and have had many a capacity planning conversation that sound like this.
First off, it sounds to me like you have a pretty good cluster design and test rig going here. My first intuition here is that you are legitimately approaching the limits of your c3.large instances, and will want to bump up to a c3.xlarge (or bigger) fairly soon.
An index per tenant per day could be reasonable, if you have relatively few tenants. You may consider an index per day for all tenants, using filters to focus your searches on specific tenants. And unless there are obvious cost savings to discarding old data, then filters should suffice to enforce data retention windows as well.
The primary benefit of segmenting your indices per tenant would be to move your tenants between different Elasticsearch clusters. This could help if you have some tenants with wildly larger usage than others. Or to reduce the potential for Elasticsearch's cluster state management to be a single point of failure for all tenants.
A few other things to keep in mind that may help explain the performance variance you're seeing.
Most importantly here, indexing is incredibly CPU bottlenecked. This makes sense, because Elasticsearch and Lucene are fundamentally just really fancy string parsers, and you're sending piles of strings. (Piles are a legitimate unit of measurement here, right?) Your primary bottleneck is going to be the number and speed of your CPU cores.
In order to take the best advantage of your CPU resources while indexing, you should consider the number of primary shards you're using. I'd recommend starting with three primary shards to distribute the CPU load evenly across the three nodes in your cluster.
For production, you'll almost certainly end up on larger servers. The goal is for your total CPU load for your peak indexing requirements ends up under 50%, so you have some additional overhead for processing your searches. Aggregations are also fairly CPU hungry. The extra performance overhead is also helpful for gracefully handling any other unforeseen circumstances.
You mention pushing to multiple indices concurrently. I would avoid concurrency when bulk updating into Elasticsearch, in favor of batch updating with the Bulk API. You can bulk load documents for multiple indices with the cluster-level /_bulk endpoint. Let Elasticsearch manage the concurrency internally without adding to the overhead of parsing more HTTP connections.
That's just a quick introduction to the subject of performance benchmarking. The Elasticsearch docs have a good article on Hardware which may also help you plan your cluster size.

Which are the most Relevant Performance Parameters / Measures for a Web Application

We are re-implementing(yes from scratch) a web application which is currently in production. We have decided to start doing some performance tests on the new app, to get some early information of the capabilities.
As the old application is currently in production and has a good performance we would like to extract some performance parameters, and then use this parameters as a reference or base goal of the performance of the new application.
Which do you think are the most relevant performance parameters we should be obtaining from the current production application?
Thanks!
Determine what pages are used the most.
Measure a latency histogram for the total time it takes to answer the request. Don't just measure the mean, measure a histogram.
From the histogram you can see how many % of requests have which latency in milliseconds. You can choose to key performance indicators by takes the values for 50% and 95%. This will tell you the average latency and the worst latency (for the worst 10% of requests).
Those two numbers alone will bring you great confidence regarding the experience your users will have.
Throughput does not matter for users, but for capacity planning.
I also recommend that you track the performance values over time and review them twice a year.
Just in case you need an HTTP client, there is weighttp, a multi-threaded client written by the guys from Lighttpd.
It has the same syntax used by ApacheBench, but weighttp lets you use several client worker threads (AB is single-threaded so it cannot saturate a modern SMP Web server).
The answer of "usr" is valid, but you can as well record the minnimum, average and maximum latencies (that's useful to see in which range they play). Here is a public-domain C program to automate all this on a given concurrency range.
Disclamer: I am involved in the development of this project.

Resources