A fast solution to obtain the best ARIMA model in R (function `auto.arima`) - performance

I have a data series composed by 2775 elements:
I would like to obtain the best ARIMA model by using function auto.arima of package forecast:
fit=auto.arima(Netherlands,stepwise=F,approximation = F)
But I am having a big problem: RStudio is running for an hour and a half without results. (I developed an R code to perform these calculations, employed on a Windows machine equipped with a 2.80GHz Intel(R) Core(TM) i7 CPU and 16.0 GB RAM.) I suspect that this is due to the length of time series. A solution could be the parallelization? (But I don't know how apply it).
Anyway, suggestions to speed this code? Thanks!

The forecast package has many of its functions built with parallel processing in mind. One of the arguments of the auto.arima() function is 'parallel'.
According to the package documentation, "If [parallel = ] TRUE and stepwise = FALSE, then the specification search is done in parallel.This can give a significant speedup on mutlicore machines."
If parallel = TRUE, it will automatically select how many 'cores' to use (for a laptop or desktop, it is often the number of cores * 2. For example, I have 4 cores and each core has 2 processors = 8 'cores'). If you want to manually set the number of cores, also use the argument num.cores.
I'd recommend checking out the e-book written by Hyndman all about the package. It is like a time-series forecasting bible.


Setting Hugging Face dataloader_num_workers for multi-GPU training

Should the HuggingFace transformers TrainingArguments dataloader_num_workers argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12 (48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12 for DDP but ~47 for DP?

What is the best way to do a very large and repetitive task?

I need to get the duration of 18000+ audios, using the audioread library for each audio takes about 300ms, ie at least 25~30 minutes of processing.
Using a Queue andProcess system, using all available cores of my processor, I can lower the processing average of each audio to 70ms, but it will still take 21 minutes, how can I improve this? I would like to be able to read all the audios in at least 5 minutes, remembering that I have no competition on the machine, it will only run my software, so I can consume all the resources.
Code of function read the duration:
def get_duration(q, au):
while not q.empty():
index = q.get()
with audioread.audio_open(au[index]['absolute_path']) as f:
au[index]['duration'] = f.duration * 1000
Code to create the Processes:
for i in range(os.cpu_count()):
pr = Process(target=get_duration, args=(queue, audios, ))
pr.daemon = True
In my code there is only one Queue with some Process, and I use Manager to edit the objects.
I would look into using a complied solution to speed up your script. Pythons threading still leaves room for improvement.
Try a compiled language.
If you still want python, separate as many functions to different threads as your system supports. use threading.thread.
Also see this wiki for info on efficient loops in python. The gist of the story is that for loops incur overhead.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

Massive memory usage when defining regression model

I'm having problems with memory when running a Poisson regression model. With the data loaded in and ready for the model, python is using about 650 MB of memory. As soon as I create the model,
import theano.tensor as t
with pm.Model() as poisson_model:
# priors for coefficients
coeffs = pm.Uniform('coeffs', -10, 10, shape=(1, predictors.shape[1]))
r = t.exp(pm.sum(coeffs*predictors.values, 1))
obs = pm.Poisson('obs', r, observed=targets)
the memory usage shoots up to 3 GB. There are only 350 data points of 8-bit integers, so I have no idea what is using this amount of memory.
After playing around a bit, I've found that adding anything to the model puts it up to 3 GB in memory, even something as simple as
with pm.Model() as poisson_model:
test = pm.Uniform('test', -1, 1)
Any suggestions as to what is happening or how I can look deeper? I'm using a new iMac, Python 2.7, and the latest version of PyMC3. Thanks.
I've tried replicating this on my system (Macbook Air, Py 2.7) but get ~80MB of memory usage. I would try a couple of things:
Clear the theano cache:
theano-cache clear
Try updating Theano
Re-install PyMC from the master branch
These are all guesses, as I cannot replicate the issue, so I'm hoping one of these will do the trick.

Performance penalty of persistent variables in MATLAB

Recently I profiled some MATLAB code and I was shocked to see the following in a heavily used function:
5.76 198694 58 persistent CONSTANTS;
3.44 198694 59 if isempty(CONSTANTS) % initialize CONSTANTS
In other words, MATLAB spent about 9 seconds, over 198694 function calls, declaring the persistent CONSTANTS and checking if it has been initialized. That represents 13% of the total time spent in that function.
Do persistent variables really carry that much of a performance penalty in MATLAB? Or are we doing something terribly wrong here?
#Andrew I tried your sample script and I am very, very perplexed by the output:
time calls line
6 function has_persistent
6.48 200000 7 persistent CONSTANTS
1.91 200000 8 if isempty(CONSTANTS)
10 end
I tried the bench() command and it showed my machine in the middle range of the sample machines. Running Ubuntu 64 bits on a Intel(R) Core(TM) i7 CPU, 4GB RAM.
That's the standard way of using persistent variables in Matlab. You're doing what you're supposed to. There will be noticable overhead for it, but your timings do seem kind of surprisingly high.
Here's a similar test I ran in 32-bit Matlab R2009b on a 3.0 GHz Intel Core 2 QX9650 machine under Windows XP x64. Similar results on other machines and versions. About 5x faster than your timings.
function call_has_persistent
for i = 1:200000
function has_persistent
persistent CONSTANTS
if isempty(CONSTANTS)
0.89 200000 7 persistent CONSTANTS
0.25 200000 8 if isempty(CONSTANTS)
What Matlab version, OS, and CPU are you running on? What does CONSTANTS get initialized with? Does Matlab's bench() output seem reasonable for your machine?
Your timings do seem high. There may be a bug or config issue there to fix. But if you really want to get Matlab code fast, the standard advice is to "vectorize" it: restructure the code so that it makes fewer function calls on larger input arrays, and makes use of Matlab's built in vectorized functions instead of loops or control structures, to avoid having 200,000 calls to the function in the first place. If possible. Matlab has relatively high overhead per function or method call (see Is MATLAB OOP slow or am I doing something wrong? for some numbers), so you can often get more mileage by refactoring to eliminate function calls instead of making the individual function calls faster.
It may be worth benchmarking some other basic Matlab operations on your machine, to see if it's just "persistent" that seems slow. Also try profiling just this little call_has_persistent test script in isolation to see if the context of your function makes a difference.
