ElasticNet extremely slow - performance

I am runnning an Elastic net model using sklearn. My dataset has 70k observations and 20 features. I want to test different parameters and use the following code:
alpha_plot, l1_ratio_plot = np.linspace(min_xlim, max_xlim, 50), np.linspace(0, 1, 10)
alpha_grid, l1_ratio_grid = np.meshgrid(alpha_plot, l1_ratio_plot)
l1_ratio_alpha_grid = np.array([l1_ratio_grid.ravel(), alpha_grid.ravel()]).T
model_coefficients_analysis = []
for i in l1_ratio_alpha_grid:
model_analysis = ElasticNet(alpha=i[1], l1_ratio=i[0], fit_intercept=True, max_iter=10000).fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
I am aware that this can be done with GridsearchCV but it doesn't do the job for me as I need to store the coefficients for every combination of parameters tested. The current code snippet is exceptionally slow. It takes roughly 10 minutes for each of the 50*10 iterations. Is there a way to speed up the process? For example in GridsearchCV there is a parameter n_jobs which can be set equal to -1 to speed up the process. But here I do not seem to find it.

It takes roughly 10 minutes for each of the 50*10 iterations
That seems very high, but you also have rather large data; I can't fit a randomized such dataset in memory in Colab (where I usually run examples for answers here). You might not be able to shrink the first fit time very much, but maybe you can reduce the subsequent fit times by using warm-starting.
Setting warm_start=True and using the same model object for each iteration, the coefficients will be saved as a starting point for the solver in the next iteration:
model_analysis = ElasticNet(fit_intercept=True, max_iter=10000)
for i in l1_ratio_alpha_grid:
model_analysis.set_params(alpha=i[1], l1_ratio=i[0])
model_analysis.fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
You might consider using ElasticNetCV, since it uses warm-starting internally, and it provides some other niceties. You can use a PredefinedSplit if adding k-fold cross-validation is too much of an added expense, but I believe the n_jobs parameter is only useful in splitting up jobs across hyperparameters and folds, so using more cores might mitigate the issues with k-fold (but then you'll also have k times as many coefficients).
Your large max_iter is a bit worrying; do you get nonconvergence? From your independent variable name it seems like you're scaling, but if not that's the place to start: fast (and maybe correct) convergence depends on features with similar scales. You might also consider increasing the convergence criterion tol. I have no experience with the selection parameter, but the docstring suggests changing it to random may speed up convergence?

Related

Speed Up Gensim's Word2vec for a Massive Dataset

I'm trying to build a Word2vec (or FastText) model using Gensim on a massive dataset which is composed of 1000 files, each contains ~210,000 sentences, and each sentence contains ~1000 words. The training was made on a 185gb RAM, 36-core machine.
I validated that
gensim.models.word2vec.FAST_VERSION == 1
First, I've tried the following:
files = gensim.models.word2vec.PathLineSentences('path/to/files')
model = gensim.models.word2vec.Word2Vec(files, workers=-1)
But after 13 hours I decided it is running for too long and stopped it.
Then I tried building the vocabulary based on a single file, and train based on all 1000 files as follows:
files = os.listdir['path/to/files']
model = gensim.models.word2vec.Word2Vec(min_count=1, workers=-1)
model.build_vocab(corpus_file=files[0])
for file in files:
model.train(corpus_file=file, total_words=model.corpus_total_words, epochs=1)
But I checked a sample of word vectors before and after the training, and there was no change, which means no actual training was done.
I can use some advise on how to run it quickly and successfully. Thanks!
Update #1:
Here is the code to check vector updates:
file = 'path/to/single/gziped/file'
total_words = 197264406 # number of words in 'file'
total_examples = 209718 # number of records in 'file'
model = gensim.models.word2vec.Word2Vec(iter=5, workers=12)
model.build_vocab(corpus_file=file)
wv_before = model.wv['9995']
model.train(corpus_file=file, total_words=total_words, total_examples=total_examples, epochs=5)
wv_after = model.wv['9995']
so the vectors: wv_before and wv_after are exactly the same
There's no facility in gensim's Word2Vec to accept a negative value for workers. (Where'd you get the idea that would be meaningful?)
So, it's quite possible that's breaking something, perhaps preventing any training from even being attempted.
Was there sensible logging output (at level INFO) suggesting that training was progressing in your trial runs, either against the PathLineSentences or your second attempt? Did utilities like top show busy threads? Did the output suggest a particular rate of progress & let you project-out a likely finishing time?
I'd suggest using a positive workers value and watching INFO-level logging to get a better idea what's happening.
Unfortunately, even with 36 cores, using a corpus iterable sequence (like PathLineSentences) puts gensim Word2Vec in a model were you'll likely get maximum throughput with a workers value in the 8-16 range, using far less than all your threads. But it will do the right thing, on a corpus of any size, even if it's being assembled by the iterable sequence on-the-fly.
Using the corpus_file mode can saturate far more cores, but you should still specify the actual number of worker threads to use – in your case, workers=36 – and it is designed to work on from a single file with all data.
Your code which attempts to train() many times with corpus_file has lots of problems, and I can't think of a way to adapt corpus_file mode to work on your many files. Some of the problems include:
you're only building the vocabulary from the 1st file, which means any words only appearing in other files will be unknown and ignored, and any of the word-frequency-driven parts of the Word2Vec algorithm may be working on unrepresentative
the model builds its estimate of the expected corpus size (eg: model.corpus_total_words) from the build_vocab() step, so every train() will behave as if that size is the total corpus size, in its progress-reporting & management of the internal alpha learning-rate decay. So those logs will be wrong, the alpha will be mismanaged in a fresh decay each train(), resulting in a nonsensical jigsaw up-and-down alpha over all files.
you're only iterating over each file's contents once, which isn't typical. (It might be reasonable in a giant 210-billion word corpus, though, if every file's text is equally and randomly representative of the domain. In that case, the full corpus once might be as good as iterating over a corpus that's 1/5th the size 5 times. But it'd be a problem if some words/patterns-of-usage are all clumped in certain files - the best training interleaves contrasting examples throughout each epoch and all epochs.)
min_count=1 is almost always unwise with this algorithm, and especially so in large corpora of typical natural-language word frequencies. Rare words, and especially those appearing only once or a few times, make the model gigantic but those words won't get good word-vectors, and keeping them in acts like noise interfering with the improvement of other more-common words.
I recommend:
Try the corpus iterable sequence mode, with logging and a sensible workers value, to at least get an accurate read of how long it might take. (The longest step will be the initial vocabulary scan, which is essentially single-threaded and must visit all data. But you can .save() the model after that step, to then later re-.load() it, tinker with settings, and try different train() approaches without repeating the slow vocabulary survey.)
Try aggressively-higher values of min_count (discarding more rare words for a smaller model & faster training). Perhaps also try aggressively-smaller values of sample (like 1e-05, 1e-06, etc) to discard a larger fraction of the most-frequent words, for faster training that also often improves overall word-vector quality (by spending relatively more effort on less-frequent words).
If it's still too slow, consider if you could using a smaller subsample of your corpus might be enough.
Consider the corpus_file method if you can roll much or all of your data into the single file it requires.

Measure parallel speedup in randomized-algorithms

I have a randomized program with sequential and parallel variants. The nature of that program is that its run-time varies drastically depending on its "luck". It regularly takes values between 1sec and 2 min in a seemingly geomentric-distribution-ish pattern. Parallel variants show a similar behaviour with different numbers.
What is a "good" way measure parallel speedup in this case?
I have the possibility to just use the mean/median of measured values as a representative for "the run-time"
How would I explain such an approach and is there a (statistically/mathematically) better way to calculate the speedup?
EDIT: Thanks to user3666197, which noted some very important technical details necessary to get good data.
I have done that homework and want to clarify my question.
I made my benchmark process as reliable as I could get it:
the benchmarks are run with seeds, with which the results are reproducible.
every configuration is repeated multiple times (~400 times) with different seeds inside a script
My question remains: How to approach the calculation of the speedup for this program.
What I have done:
Mean sequential run-time is about 8.38, median is 4.8, which is a big difference. For 2 threads, mean run-time is 4.36 while median run-time is 2.42.
If I divide sequential by parallel I get speedups of 1.92 (means) and 1.992 (medians).
For 4 threads similar: means: 2.25 run-time and 3.72 speedups, medians: 1.12 median and 4.3 speedup (superlinear).
Similar numbers exist for 8 threads.
I try to visualize the data in different ways. Plots
The histogram shows the distribtion of run-times using various threads, as does the boxplot to its right. It is visible that some speedup is visible.
If I pair the measurements based on seeds, I get pairs of times: sequential and parallel times.
One of my first ideas was to calculate the speedup by calculating the slope of the regression-line, however, it seems that the regression-line does not properly "summarize" the data and has limited value. In the bottom-right plot, only the points for 4 threads are shown.
I would recommend you compute the speedup based on the arithmetic mean of runtimes of a sufficiently large set of measurements. Make sure to properly communicate what the numbers represent. It can be difficult to ensure that you have a large enough set up measurements to compute a proper mean with a certain confidence, especially since your samples are not normally distributed. Include your findings about distribution and confindence. Make sure to summarize the runtimes first, before computing the speedup.
There is an excellent paper by Torsten Hoefler and Roberto Belli which covers your issues in detail. Particularly Sections 2.1.1 and 3.
How to measure a parallel-speedup versus a pure-[SERIAL] code?
Always be both quantitative and systematic.
That means at least:
1) use all systematic-steps for controlled test-repeatability
2) compare apples to apples, incl. the controlled seed-setup for randomizers
3) best, generate all test-batteries as scripted, auto-repeatable experiments
4) record the performance ( overall and local-sections timing ) in tests' UUID#-distinguishable logs
5) collect rather 1E+3 ~ 1E+4 sized populations of test-runs, not just a few units of individual trials
Given your solution is already implemented in both a pure-[SERIAL] code-execution fashion and in some other [CONCURRENT] or even [PARALLEL], the most exact step is to compare the end-to-end test durations.
It is quite common to use monotonic-clocks, having better than ~ [us] resolution in [TIME]-domain.
For more details about internalities, best view the re-formulated Amdahl's Law and the criticism of of the parallel-speedup initial, unconstrained-resources use formulation.

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Comparison of Sorting Algorithms using running time in terms of seconds

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Resources