I want to know time taken to build a model. But, when running the algorithm multiple times to build the model I got multiple time taken to build the model. Why different? Why not constant time taken to build the model? Even time taken fluctuated for the same algorithm how can I compare time taken for two different algorithms.
Double start_time=System.nanoTime()/Math.pow(10, 9);
KMeansModel clusters = KMeans.train(rdd.rdd(), numClusters,numIterations,init_mode,30);
System.out.println("time taken to build model: "+((System.nanoTime()/Math.pow(10, 9)) - start_time));
You can measure model training time as an average of multiple runs. E.g. run the training 100 times and take the total time / 100. This should provide more consistent values between runs (as a batch), assuming the algorithm doesn't use much randomness during training.
If you're using spark, it might be useful to enable spark's loginfo to see stats per run. For kmeans, you should see among others:
logInfo(f"Initialization with $initializationMode took $initTimeInSeconds%.3f seconds.")
logInfo(f"Iterations took $iterationTimeInSeconds%.3f seconds.")
logInfo(s"KMeans converged in $iteration iterations.")
Related
I am runnning an Elastic net model using sklearn. My dataset has 70k observations and 20 features. I want to test different parameters and use the following code:
alpha_plot, l1_ratio_plot = np.linspace(min_xlim, max_xlim, 50), np.linspace(0, 1, 10)
alpha_grid, l1_ratio_grid = np.meshgrid(alpha_plot, l1_ratio_plot)
l1_ratio_alpha_grid = np.array([l1_ratio_grid.ravel(), alpha_grid.ravel()]).T
model_coefficients_analysis = []
for i in l1_ratio_alpha_grid:
model_analysis = ElasticNet(alpha=i[1], l1_ratio=i[0], fit_intercept=True, max_iter=10000).fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
I am aware that this can be done with GridsearchCV but it doesn't do the job for me as I need to store the coefficients for every combination of parameters tested. The current code snippet is exceptionally slow. It takes roughly 10 minutes for each of the 50*10 iterations. Is there a way to speed up the process? For example in GridsearchCV there is a parameter n_jobs which can be set equal to -1 to speed up the process. But here I do not seem to find it.
It takes roughly 10 minutes for each of the 50*10 iterations
That seems very high, but you also have rather large data; I can't fit a randomized such dataset in memory in Colab (where I usually run examples for answers here). You might not be able to shrink the first fit time very much, but maybe you can reduce the subsequent fit times by using warm-starting.
Setting warm_start=True and using the same model object for each iteration, the coefficients will be saved as a starting point for the solver in the next iteration:
model_analysis = ElasticNet(fit_intercept=True, max_iter=10000)
for i in l1_ratio_alpha_grid:
model_analysis.set_params(alpha=i[1], l1_ratio=i[0])
model_analysis.fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
You might consider using ElasticNetCV, since it uses warm-starting internally, and it provides some other niceties. You can use a PredefinedSplit if adding k-fold cross-validation is too much of an added expense, but I believe the n_jobs parameter is only useful in splitting up jobs across hyperparameters and folds, so using more cores might mitigate the issues with k-fold (but then you'll also have k times as many coefficients).
Your large max_iter is a bit worrying; do you get nonconvergence? From your independent variable name it seems like you're scaling, but if not that's the place to start: fast (and maybe correct) convergence depends on features with similar scales. You might also consider increasing the convergence criterion tol. I have no experience with the selection parameter, but the docstring suggests changing it to random may speed up convergence?
There's a table at the bottom of
https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu that talks about images per second and step time, in the context of performance in deep learning.
How does one compute images/second and step time?
When training in Tensorflow using the Estimator API, there's a number reported as global_step/sec. Is this the same thing? If so, does that take into account the time required to process the input pipeline, or just the time it takes to do the forward pass through the model?
global_step/sec means how many steps per second your tensorflow model is doing. A step is usually a minibatch. So the inverse of global_step/sec is your step time and batch_size * global_step/sec is your number of images per second.
Because these numbers are throughput numbers computed on the steady state of your model training they include the input pipeline (i.e. if your input pipeline cannot produce X minibatches per second then you necessarily have to run less than X global_step/sec).
I'm trying to understand a paper I'm reading regarding genetic algorithms.
They running there a GA with parameters they presented.
Some of the parameters are:
stop criterion - 50 generations.
Runs per configuration - 30.
After they presented the parameters they said that they executed the algorithm 20 times.
I don't understand two things:
1. what the second parameter means? that every configuration runs untill it reaches 30 generations?
2. when they execute the algorithm 20 times, it means untill they reach 20 generations or that they executed 20 configurations?
thanks.
Question 1
Usually in GA:s parameters such as mutation rates, selection rates, etc, need to be customised for the problem at hand. To get a somewhat good measure to quantify what is a good or bad parameter configuration, each configuration is repeated for a number of runs, whereafter the mean and standard deviation of the best solution in each of these runs are computed. Note that these runs have nothing to do with the number of generations or specific parameters in the GA, it's simply a way to reduce noise and increase reliability when comparing and choosing among different parameter configurations.
On the other hand, "number of runs" in itself can also be a parameter of study, but it's quite obvious that increasing the total number of runs will increase the chance to find a really good solution (say a large deviation from mean GA performance). If studying such a case, it's important that the total number of generations (effectively the total number of function evaluations), over all runs, are the same between different parameter settings.
As an example, consider the following parameter configuration:
numberOfRuns = 30
numberOfGenerations = 50
==> a total of 1500 generation analysed
studied vs configuration:
numberOfRuns = 50
numberOfGenerations = 30
==> a total of 1500 generation analysed
A study like this could examine if it is favourable to have more generations in each GA run (first scenario), of if it's better to have fewer generations but more runs (second scenario: probably favourable for a GA with quick convergence to local optima but with large standard deviation).
The following parameter configuration, however, would not have any meaning to benchmark against the two above, since it has a larger number of total generations:
numberOfRuns = 50
numberOfGenerations = 50
==> a total of 2500 generation analysed
Question 2
If they say they execute the algorithm 20 times, it would generally be equivalent with 20 runs as per described above. Hence, 20 runs each running over 50 generations.
I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests
I am reading some papers on simulation and performance modeling. The Y axis in some figures is labeled "Seconds per Simulation Day". I am not sure what it actually means. It span from 0, 20, 40 to 120.
Another label is "Simulation years per day". I guess it means the guest OS inside simulation environment thinks it has passed several years while actually it just passed a day in the real world? But I guess simulation should slow down the execution, so I guess inside simulation environment passed several hours while actually it just passed a day in the real world would be more reasonable.
Thanks.
Without seeing the paper, I assume they are trying to compare the CPU time it takes to get to some physical time in a simulation.
So "Seconds per Simulation Day" is likely the walltime it took to get 24 hours in the simulation.
Likewise, "Simulation Years per Day" is physical time of simulation/real life day.
Of course, without seeing the paper it's impossible to know for sure. It's also possible they are looking at CPU-seconds or CPU-days, which would be nCPUs*walltime.
Simulations typically run in discrete time units, called time steps. If you'd like to simulate a certain process that spans certain time in the simulation, you would have to perform certain number of time steps. If the length of a time step is fixed, the number of steps is then just the simulated time divided by the length of the time step. Calculations in each time step take certain amount of time and the total run time for the simulation would equal the number of time steps times the time it takes to perform one time step:
(1) total_time = (simulation_time / timestep_length) * run_time_per_timestep
Now several benchmark parameters can be obtained by placing different parameters on the left hand side. E.g. if you fix simulation_time = 1 day then total_time would give you the total simulation run time, i.e.
(2) seconds_per_sim_day = (1 day / timestep_length) * run_time_per_timestep
Large values of seconds_per_sim_day could mean:
it takes too much time to compute a single time step, i.e. run_time_per_timestep is too high -> the computation algorithm should be optimised for speed;
the time step is too short -> search for better algorithms that can accept larger time steps and still produce (almost) the same result.
On the other hand, if you solve (1) for simulation_time and fix total_time = 1 day, you get the number of time steps that can be performed per day times the length of the time step, or the total simulation time that can be achieved per day of computation:
(3) simulation_time_per_day = (1 day / run_time_per_step) * timestep_length
Now one can observe that:
larger time steps lead to larger values of simulation_time_per_day, i.e. longer simulations can be computed;
if it takes too much time to compute a time step, the value of simulation_time_per_day would go down.
Usually those figures could be used when making decisions about buying CPU time at some computing centre. For example, you would like to simulate 100 years, then just divide that by the amount of simulation years per day and you get how many compute days you would have to pay (or wait) for. Larger values of simulation_time_per_day definitely benefit you in this case. If, on the other hand, you only have 10 compute days at your disposal, then you can compute how long of a simulation could be computed and make some decisions, e.g. more short simulations but with many different parameters vs. less but longer simulations with some parameters that you have predicted to be the optimal ones.
In real life things are much more complicated. Usually computing each time step could take different time (although there are cases where each time step takes exactly the same amount of time as all other time steps) and it would strongly depend on the simluation size, configuration, etc. That's why standardised tests exist and usually some averaged value is reported.
Just to summarise: given that all test parameters are kept equal,
faster computers would give less "seconds per simulation day" and more "simulation years per day"
slower computers would give more "seconds per simulation day" and less "simulation years per day"
By the way, both quantites are reciprocial and related by this simple equation:
simuation_years_per_day = 236,55 / seconds_per_simulation_day
(that is "simulation years per day" equals 86400 divided by "seconds per simulation day" /which gives you the simulation days per day/ and then dividied by 365.25 to convert the result into years)
So it doesn't really matter if "simulation years per day" or "seconds per simulation day" is presented. One just have to chose the representation which clearly shows how much better the newer system is from the previous/older/existing one :)