H2O autoML does not stop after default time budget

H2O autoML does not stop after default time budget - h2o

Have your H2O ever exceeded the default training time budget (1 hour) and you had to stop it? what is your interpretation of this problem?
I am very familiar with this solution and this is happening for the very first time with a multiclass dataset of 20k instances, 40 features and 103 classes. However, surprisingly, when I lower the time budget to 15 mn (for example), h2O returns a model and makes predictions.
Thank you,
Yassine

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.

From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.

In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

I have got an requirement for load testing . i was given both avg. and peak TPH. also avg and peak user load

I have to do performance testing of an ecommerce application where i got the details needed like Avg TPH and peak TPH . also Avg User and Peak User.
for e.g., an average of 1000 orders/hour, the peak of 3000 orders/hour during the holiday season, expected to grow to 6000 orders/hour next holiday season.
I was afraid which value to be considered for current users and TPH for performing load test for an hour.
also what load will be preferable foe stress testing and scalability testing.
It would be a great helpful not only for the test point of view but also will help me in understanding the conceptually which would help me in a great deal down the lane.

This is a high business risk endeavor. Get it wrong and your ledger doesn't go from red to black on the day after thanksgiving, plus you have a high probability of winding up with a bad public relations event on Twitter. Add to that greater than 40% of people who hit a website failure will not return.
That being said, do your skills match the risk to the business. If not, the best thing to do is to advise your management to acquire a higher skilled team. Then you should shadow them in all of their actions.
I think it helps to have some numbers here. There are roughly 35 days in this year's holiday shopping season. This translates to 840 hours.
#$25 avg sale, you are looking at revenue of $21 million
#$50 avg sale, ...42 Million
#100 avg sale, ...84 Million
Numbers based upon the average of 1000 sales per hour over 840 hours.
Every hour of downtime at peak costs you
#$25 avg sale, ...$75K
#$50 avg sale, ...$150K
#$100 avg sale, ...$300K
Numbers based upon 3000 orders per hour at peak. If you have downtime then greater than 40% of people will not return based upon latest studies. And you have the Twitter affect where people complain loudly and draw off potential site visitors.
I would advise you to bring in a team. Act fast, the really good engineers are quickly being snapped up for Holiday work. These are not numbers to take lightly nor is it work to press someone into that hasn't done it before.
If you are seriously in need and your marketing department knows exactly how much increased conversion they get from a faster website, then I can find someone for you. They will do the work upfront at no charge, but they will charge a 12 month residual based upon the decrease in response time and the increased conversion that results

Normally Performance Testing technique is not limited to only one scenario, you need to run different performance test types to assess various aspects of your application.
Load Testing - which goal is to check how does your application behave under anticipated load, in your case it would be simulating 1000 orders per hour.
Stress Testing - putting the application under test under maximum anticipated load (in your case 3000 TPH). Another approach is gradually increasing the load until response time starts exceeding acceptable thresholds or errors start occurring (whatever comes the first) or up to 6000 TPH if you don't plan to scale up. This way you will be able to identify the bottleneck and determine what will be the component which fails which could be in
lack of hardware power
problems with database
inefficient algorithms used in your application
You can also consider executing a Soak Test - putting your application under prolonged load, this way you will be able to catch the majority of memory leaks

How can I reduce computational time in a OPL script using CPLEX?

I'm an engineering student, new user of CPLEX and OPL. I modelled an electric vehicle scheduling problem, using OPL in CPLEX.
It takes around 20min to give me an optimal solution for an instance of 4 service trips, 2 depots and 2 charging stations.
I'm currently trying to run a real example of 100 service trips, 1 depot and 1 charging station, but is taking me forever to get an answer (it is running for the last 17h).
Any suggestions on how to speed up the process?

You could set a time limit and then instead of an optimal solution you'll get a solution after the time limit. You could also try to see with the profiler where time is spent.
A few links to help you:
Performance Tuning Using CPLEX Optimization Studio IDE
MIP Tuning
CPO Introduction

Best algorithm for threshold identitication

Assume I have huge set of data about a system idle time.
Day 1 - 5 mins
Day 2 - 3 mins
Day 3 - 7 mins
...
Day 'n' - 'k' mins
We can assume that even though the idletime is random, the pattern repeats.
Using this as a training data, is it possible for me to identify the idle time behavior of the system. With that, can a abnormality be predicted
Which algorithm would best suit for this purpose
I tried to fit in regression, but it can just answer me " What is the expected idle time today "
But what I want to do is. When the idle time goes away from the pattern, it has to be detected.
Edit:
Or does it make sense to predict for the current day only. i.e Today the expected idle time is 'x' mins. Tomorrow it may differ

I would try a Fourier Transformation and have a look if your system behaves in a periodic way (this would mean there are some peaks in the frequency domain).
Than get rid of the frequencies with low values and use the rest to predict the system behavior in the future.
If the real behavior differs a lot from the prediction that is what you want to detect.
wikipedia: Fast Fourier Transformation

forecasting resource utilization

I want to create a system to forecast certain resource utilization; for example, CPU utilization. I have data of CPU utilization for each day. How can I predict its usage for next future time, say 2 days? I know that time series analysis can help but I fail to understand how to accommodate other factors associated with the CPU utilization as time series analysis is only time on x-axis and utilization on y-axis.

Check this out, i think it can help you a lot or at least help you start with something. He deals with a similar problem (forecasting of hard disk space requirements)
http://lpenz.github.com/articles/df0pred-1/index.html
http://lpenz.github.com/articles/df0pred-2/index.html
http://lpenz.github.com/articles/df0pred-3/index.html

I deduce that you have multiple time series, and that you want to put this extra information at work (as opposed to a univariate model solely with cpu utilization).
For a univariate model, you can check with arima(), and find a suitable order for this model using auto.arima() in package forecast. Predictions can be made using predict(), on the arima object.
For a multivariate model, you can consider a vector auto-regressive model. Check for function VAR() in package vars.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio