forecasting resource utilization - algorithm

I want to create a system to forecast certain resource utilization; for example, CPU utilization. I have data of CPU utilization for each day. How can I predict its usage for next future time, say 2 days? I know that time series analysis can help but I fail to understand how to accommodate other factors associated with the CPU utilization as time series analysis is only time on x-axis and utilization on y-axis.

Check this out, i think it can help you a lot or at least help you start with something. He deals with a similar problem (forecasting of hard disk space requirements)
http://lpenz.github.com/articles/df0pred-1/index.html
http://lpenz.github.com/articles/df0pred-2/index.html
http://lpenz.github.com/articles/df0pred-3/index.html

I deduce that you have multiple time series, and that you want to put this extra information at work (as opposed to a univariate model solely with cpu utilization).
For a univariate model, you can check with arima(), and find a suitable order for this model using auto.arima() in package forecast. Predictions can be made using predict(), on the arima object.
For a multivariate model, you can consider a vector auto-regressive model. Check for function VAR() in package vars.

Related

How does Anomaly detection in Elasticsearch work?

I have a question about the Anomaly Detection module provided by elastic stack. As per my understanding of Machine Learning the more data being fed to the model the better learning it will do provided the data is proper. Now I want to use the Anomaly Detection Module in kibana. I did some testing with that and with some reading I found that basically it is better that we have at least 3 weeks of data or 20 buckets worth. Now lets say we receive about 40 million records a day. This will take a whole lot of time for the model to train for a day itself now if get about 3 weeks worth of this amount of data this will put a lot of pressure on the node. But if I feed the model less data and reduce the bucket span it will make my model more sensitive. So what is my best bet for this. How is that I can make the most out of the Anomaly Detection module.
Just FYI: I do have a dedicated Machine learning Node with equipped with more than enough memory but it still takes a whole lot of time to process records for a day so my concern is it will take a whole whole lot of time to process 3 weeks worth of data.
So My question is that if we give large amount of data for short amount of time say 1 week to the model for training and if we give large amount of data for a slightly longer amount of time say 3 weeks to the model for training will these two models detect anomalies with the same accuracy.
If you have a dedicated ML node with ample memory, I don't see what the problem could be. Common sense has it that the more data you have, the better the model can learn, and the more accurate your prediction model will be. Also seasonality might not be well captured with just one week of data. If you have the data and are not using it out of fear that it will take a "some time" to analyze it, what's the point of gathering it in the first place?
It is true that it will take "some time" to build the model initially, but afterwards, the ML process will run more frequently depending on your chosen bucket span size (configurable) and process the new documents that arrived in the meantime, it's really fast. Regarding sensitivity, your mileage may vary, but it's not dependent only on the amount of data you feed, but also on the size of the bucket span you choose.

Anylogic: Measuring time spent in service

In the following model Image the graph visualizes the service block's utilization. However, this utilization represents the average number of agents being processed.
I would like to find out the amount of time the service block is delaying agents during the model's total run time. This would provide me with a more accurate representation of the capacity utilization. Is this possible?
you can use a dataset or a statistics element (found in the analysis palette) or even a collection and add values like this:
On enter delay:
agent.enterTime=time();
On exit (or on at exit)
data.add(time()-agent.enterTime);
Of course this requires you to add a variable called enterTime in your agent.

Algorithms for establishing baselines from time series data

In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?
You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.

Would communication and inter-connection have any impact on a computation bound application on multinode?

I have a computation bound application. I have executed it on multi-nodes ( 4nodes, 8nodes) I'm wondering if communication between the nodes could have any effect on the run time? If so, how would it be possible? because as far as I found, computation bound application just depends on the computing capability of system.
Also, can I consider CPU amount of my system as computing capability?
Any help would be appreciated.
Updated:
In order to see if the application is memory-bound or compute-bound, I've run the application over 1 nodes using different number of cores. For that application (NPB-LU), the run time decreased linearly by increasing the number of cores. So I found this application could be compute-bound (I didn't have another option to figure it out).
Then, I have predicted the run time of the application with a model which considers the latency(in my case it's message-time) in different connection levels like inter-socket, inter-node. There are some difference in the predicted time which achieved by different latency connection levels although the application seemed to be computation-bound.
n:grid size, p:number of cores, m(total Mops/s), f(Mop/s/core)
Imagine you have horse that is drinking water, let's say 1 liter per minute.
In order to give the water to the horse you have a water well where you can take the water from. Imagine you can pump up to 1.5 liters per minute.
Having this situation your water consumption is horse-bounded.
Then it turns out that you have two horses drinking the same amount of water: 1 liter each per minute. Then your water consumption is no longer horse-bounded but well-bounded.
Your application behavior can change depending of the environment. In order to determine what is happening to your application I recommend you to profile your app. You have a lot of alternatives such as gprof, perf, PAPI and many others to better observe what is your application behaviour.
Then you can determine experimentally very intersting metrics like Instructions per Clock cycle, which can give you a better understanding of the behaviour of your app.

Variation of the job scheduling prob

I'm doing some administration work for an aviation transport company. They build aircraft containers and such here. One of the things they want me to code is a order optimization script that the guys on the floor can use to get the most out of the given material. To give a simple overview: say we order a certain amount beams that are 10 meters per unit. We need beam chunks of 5x 6m, 10x 3.5m, 4x 3m, which are acquired by cutting the 10m in smaller parts. What would be the minimum amount of 10m beams we need to order?
There are some parallels with the multiprocessor job scheduling problem (one beam is a processor, each chunk a job), although that focusses on minimizing the time required to perform all jobs instead of minimizing the amount of processors needed to perform all jobs within a pre-set time. The multiprocessor job scheduling problem is in NP-complete, but I wonder if my variation of the problem is too. Does anybody know similar problems and methods for solving them?
This problem is exactly: http://en.wikipedia.org/wiki/Cutting_stock_problem (more generally http://en.wikipedia.org/wiki/Bin_packing_problem). You can use any old ILP solver. I like http://lpsolve.sourceforge.net/5.5/, its quite friendly to use.

Resources