Using Gekko Optimization, Why is my model builder so much slower than my solver? - gekko

I am working on a fairly large MINLP with a model size of about 270,000 variable and equations - 5,000 binaries. In using Gekko with the APOPT solver, I can solve the problem in about 868 seconds (less than 15 minutes). However, solving it on a super computer for increased memory, it takes around 27 hours to produce the results.
It seems to be spending all of its time creating the model. In reading a bit about APOPT, it mentions that it works best when the Degrees of Freedom are less than 2,000 (Mine is about 3,500). However, I also read that it's the only mixed integer solver available with Gekko?
I'm curious if this is the case or if there's other options for this program within Gekko? (as I would prefer to code in Python) In application, I will need to run this code multiple times with different uploaded excel sheets so if there's anyway to save the model construction for future runs that could also be helpful.

That is an impressive MINLP problem size. To determine how to make it faster on the pre-processing, you'll need to collect some additional information about where the time is used with DIAGLEVEL>=1.
m.options.DIAGLEVEL = 1
This produces a report of how long it takes for each of the steps. Here is an example MINLP problem (see #10).
from gekko import GEKKO
m = GEKKO() # Initialize gekko
m.options.SOLVER=1 # APOPT is an MINLP solver
m.options.DIAGLEVEL = 1
# optional solver settings with APOPT
m.solver_options = ['minlp_maximum_iterations 500', \
# minlp iterations with integer solution
'minlp_max_iter_with_int_sol 10', \
# treat minlp as nlp
'minlp_as_nlp 0', \
# nlp sub-problem max iterations
'nlp_maximum_iterations 50', \
# 1 = depth first, 2 = breadth first
'minlp_branch_method 1', \
# maximum deviation from whole number
'minlp_integer_tol 0.05', \
# covergence tolerance
'minlp_gap_tol 0.01']
# Initialize variables
x1 = m.Var(value=1,lb=1,ub=5)
x2 = m.Var(value=5,lb=1,ub=5)
# Integer constraints for x3 and x4
x3 = m.Var(value=5,lb=1,ub=5,integer=True)
x4 = m.Var(value=1,lb=1,ub=5,integer=True)
# Equations
m.Equation(x1*x2*x3*x4>=25)
m.Equation(x1**2+x2**2+x3**2+x4**2==40)
m.Obj(x1*x4*(x1+x2+x3)+x3) # Objective
m.solve(disp=True) # Solve
This produces the following timing results:
Timer # 1 0.03/ 1 = 0.03 Total system time
Timer # 2 0.02/ 1 = 0.02 Total solve time
Timer # 3 0.00/ 42 = 0.00 Objective Calc: apm_p
Timer # 4 0.00/ 29 = 0.00 Objective Grad: apm_g
Timer # 5 0.00/ 42 = 0.00 Constraint Calc: apm_c
Timer # 6 0.00/ 0 = 0.00 Sparsity: apm_s
Timer # 7 0.00/ 0 = 0.00 1st Deriv #1: apm_a1
Timer # 8 0.00/ 29 = 0.00 1st Deriv #2: apm_a2
Timer # 9 0.00/ 1 = 0.00 Custom Init: apm_custom_init
Timer # 10 0.00/ 1 = 0.00 Mode: apm_node_res::case 0
Timer # 11 0.00/ 1 = 0.00 Mode: apm_node_res::case 1
Timer # 12 0.00/ 1 = 0.00 Mode: apm_node_res::case 2
Timer # 13 0.00/ 1 = 0.00 Mode: apm_node_res::case 3
Timer # 14 0.00/ 89 = 0.00 Mode: apm_node_res::case 4
Timer # 15 0.00/ 58 = 0.00 Mode: apm_node_res::case 5
Timer # 16 0.00/ 0 = 0.00 Mode: apm_node_res::case 6
Timer # 17 0.00/ 29 = 0.00 Base 1st Deriv: apm_jacobian
Timer # 18 0.00/ 29 = 0.00 Base 1st Deriv: apm_condensed_jacobian
Timer # 19 0.00/ 1 = 0.00 Non-zeros: apm_nnz
Timer # 20 0.00/ 0 = 0.00 Count: Division by zero
Timer # 21 0.00/ 0 = 0.00 Count: Argument of LOG10 negative
Timer # 22 0.00/ 0 = 0.00 Count: Argument of LOG negative
Timer # 23 0.00/ 0 = 0.00 Count: Argument of SQRT negative
Timer # 24 0.00/ 0 = 0.00 Count: Argument of ASIN illegal
Timer # 25 0.00/ 0 = 0.00 Count: Argument of ACOS illegal
Timer # 26 0.00/ 1 = 0.00 Extract sparsity: apm_sparsity
Timer # 27 0.00/ 13 = 0.00 Variable ordering: apm_var_order
Timer # 28 0.00/ 1 = 0.00 Condensed sparsity
Timer # 29 0.00/ 0 = 0.00 Hessian Non-zeros
Timer # 30 0.00/ 1 = 0.00 Differentials
Timer # 31 0.00/ 0 = 0.00 Hessian Calculation
Timer # 32 0.00/ 0 = 0.00 Extract Hessian
Timer # 33 0.00/ 1 = 0.00 Base 1st Deriv: apm_jac_order
Timer # 34 0.01/ 1 = 0.01 Solver Setup
Timer # 35 0.00/ 1 = 0.00 Solver Solution
Timer # 36 0.00/ 53 = 0.00 Number of Variables
Timer # 37 0.00/ 35 = 0.00 Number of Equations
Timer # 38 0.01/ 14 = 0.00 File Read/Write
Timer # 39 0.00/ 0 = 0.00 Dynamic Init A
Timer # 40 0.00/ 0 = 0.00 Dynamic Init B
Timer # 41 0.00/ 0 = 0.00 Dynamic Init C
Timer # 42 0.00/ 1 = 0.00 Init: Read APM File
Timer # 43 0.00/ 1 = 0.00 Init: Parse Constants
Timer # 44 0.00/ 1 = 0.00 Init: Model Sizing
Timer # 45 0.00/ 1 = 0.00 Init: Allocate Memory
Timer # 46 0.00/ 1 = 0.00 Init: Parse Model
Timer # 47 0.00/ 1 = 0.00 Init: Check for Duplicates
Timer # 48 0.00/ 1 = 0.00 Init: Compile Equations
Timer # 49 0.00/ 1 = 0.00 Init: Check Uninitialized
Timer # 50 -0.00/ 13 = -0.00 Evaluate Expression Once
Timer # 51 0.00/ 0 = 0.00 Sensitivity Analysis: LU Factorization
Timer # 52 0.00/ 0 = 0.00 Sensitivity Analysis: Gauss Elimination
Timer # 53 0.00/ 0 = 0.00 Sensitivity Analysis: Total Time
APOPT stores the problem instance between NLP runs so it is fast to re-evaluate with different constraints as it performs branch and bound. APOPT uses a warm-start feature to rapidly evaluate the constrained NLP optimization problems. However, this warm-start feature isn't available to the Gekko user. There are other solvers available with Gekko (one that could be configured for MINLP) but they require a commercial license. There are also free MINLP solvers such as Couenne and Bonmin that are available from COIN-OR but they aren't supported yet. You can add a feature request for Gekko if you determine that APOPT pre-processing is the problem and you'd like to try another solver. Here is the optimization result that shows the timing for each iteration.
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter: 1 I: 0 Tm: 0.00 NLPi: 7 Dpth: 0 Lvs: 3 Obj: 1.70E+01 Gap: NaN
--Integer Solution: 1.75E+01 Lowest Leaf: 1.70E+01 Gap: 3.00E-02
Iter: 2 I: 0 Tm: 0.00 NLPi: 5 Dpth: 1 Lvs: 2 Obj: 1.75E+01 Gap: 3.00E-02
Iter: 3 I: 0 Tm: 0.00 NLPi: 6 Dpth: 1 Lvs: 2 Obj: 1.75E+01 Gap: 3.00E-02
--Integer Solution: 1.75E+01 Lowest Leaf: 1.70E+01 Gap: 3.00E-02
Iter: 4 I: 0 Tm: 0.00 NLPi: 6 Dpth: 2 Lvs: 1 Obj: 2.59E+01 Gap: 3.00E-02
Iter: 5 I: 0 Tm: 0.00 NLPi: 5 Dpth: 1 Lvs: 0 Obj: 2.15E+01 Gap: 3.00E-02
No additional trial points, returning the best integer solution
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 1.649999999790452E-002 sec
Objective : 17.5322673012512
Successful solution
---------------------------------------------------
Here are a few things to try to diagnose or improve your solution time:
Try the IPOPT solver for a non-integer solution. Does it still take 27 hours to complete the solution with this solver? This may be an indication that APOPT is doing pre-processing of the solution.
Replace gekko constants and parameters with Python floats where possible. This reduces the amount of model processing time.
Use built-in gekko objects such as m.sum() versus the Python sum function. This generally improves the model processing performance.
Do automatic model reduction with m.options.REDUCE=3 or manual model reduction with the use of Intermediate variables.

Related

Accuracy for Random Forest Algorithm is 0.0

I'm doing a machine learning project using Jupyter notebook. I'm using Random Forest with GridSearchCV, the execution is working fine, but I got Accuracy = 0.0
When I tried Decision Tree the Accuracy = 99.99
How do I solve this issue?
Input
#Training the RandomForest Algorithm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
rfc=RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth' : [5, 10, 20],
'min_samples_leaf': [1, 2, 3, 4, 5, 10, 20]
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
rfc1=RandomForestClassifier(random_state=42, n_estimators= 50, max_depth=5, criterion='gini')
rfc1.fit(X_train, y_train)
Which gives an output:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False)
INPUT:
pred=rfc1.predict(X_test)
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test,pred))
OUTPUT:
Accuracy for Random Forest on CV data: 0.0
INPUT :
'''
Compute confusion matrix and print classification report.
'''
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# score the model
Ntest = len(y_test)
Ntestpos = len([val for val in y_test if val])
NullAcc = float(Ntest-Ntestpos)/Ntest
print("Mean accuracy on Training set: %s" %rfc1.score(X_train, y_train))
print("Mean accuracy on Test set: %s" %rfc1.score(X_test, y_test))
print("Null accuracy on Test set: %s" %NullAcc)
print(" ")
y_pred = rfc1.predict(X_test)
f1_score(y_test, y_pred, average='weighted')
y_true, y_pred = y_test, rfc1.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\ntn=%6d fp=%6d\nfn=%6d tp=%6d" %(cm[0][0],cm[0][1],cm[1][0],cm[1][1]))
print("\nDetailed classification report: \n%s" %classification_report(y_true, y_pred))
OUTPUT:
Mean accuracy on Training set: 1.0
Mean accuracy on Test set: 0.0
Null accuracy on Test set: 0.0
with That Error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Confusion matrix:
tn= 0 fp= 0
fn=1745395 tp= 0
Detailed classification report:
precision recall f1-score support
0 0.00 0.00 0.00 0
1 0.00 0.00 0.00 1745395
2 0.00 0.00 0.00 143264
3 0.00 0.00 0.00 75044
4 0.00 0.00 0.00 46700
5 0.00 0.00 0.00 31568
6 0.00 0.00 0.00 22966
7 0.00 0.00 0.00 16903
8 0.00 0.00 0.00 13188
9 0.00 0.00 0.00 10160
.
.
.
119 0.00 0.00 0.00 2
123 0.00 0.00 0.00 2
124 0.00 0.00 0.00 1
141 0.00 0.00 0.00 1
165 0.00 0.00 0.00 1
avg / total 0.00 0.00 0.00 2148603

Julia pmap speed - parallel processing - dynamic programming

I am trying to speed up filling in a matrix for a dynamic programming problem in Julia (v0.6.0), and I can't seem to get much extra speed from using pmap. This is related to this question I posted almost a year ago: Filling a matrix using parallel processing in Julia. I was able to speed up serial processing with some great help then, and I'm now trying to get extra speed from parallel processing tools in Julia.
For the serial processing case, I was using a 3-dimensional matrix (essentially a set of equally-sized matrices, indexed by the 1st-dimension) and iterating over the 1st-dimension. I wanted to give pmap a try, though, to more efficiently iterate over the set of matrices.
Here is the code setup. To use pmap with the v_iter function below, I converted the three dimensional matrix into a dictionary object, with the dictionary keys equal to the index values in the 1st dimension (v_dict in the code below, with gcc equal to the 1st-dimension size). The v_iter function takes other dictionary objects (E_opt_dict and gridpoint_m_dict below) as additional inputs:
function v_iter(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp = gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
E_opt_dict=Dict(i => E_opt[i,:,:] for i=1:gcc)
gridpoint_m_dict=Dict(i => gridpoint_m[i,:,:] for i=1:gcc)
For parallel processing, I executed the following two commands:
wp = CachingPool(workers())
addprocs(3)
pmap(wp,v_iter,values(v_dict),values(E_opt_dict),values(gridpoint_m_dict))
...which produced this performance:
135.626417 seconds (3.29 G allocations: 57.152 GiB, 3.74% gc time)
I then tried to serial process instead:
for i=1:gcc
v_iter(v_dict[i],E_opt_dict[i],gridpoint_m_dict[i])
end
...and received better performance.
128.263852 seconds (3.29 G allocations: 57.101 GiB, 4.53% gc time)
This also gives me about the same performance as running v_iter on the original 3-dimensional objects:
v=zeros(Float64,gcc,gm,gz)
for i=1:gcc
v_iter(v[i,:,:],E_opt[i,:,:],gridpoint_m[i,:,:])
end
I know that parallel processing involves setup time, but when I increase the value of gcc, I still get about equal processing time for serial and parallel. This seems like a good candidate for parallel processing, since there is no need for messaging between the workers! But I can't seem to make it work efficiently.
You create the CachingPool before adding the worker processes. Hence your caching pool passed to pmap tells it to use just a single worker.
You can simply check it by running wp.workers you will see something like Set([1]).
Hence it should be:
addprocs(3)
wp = CachingPool(workers())
You could also consider running Julia -p command line parameter e.g. julia -p 3 and then you can skip the addprocs(3) command.
On top of that your for and pmap loops are not equivalent. The Julia Dict object is a hashmap and similar to other languages does not offer anything like element order. Hence in your for loop you are guaranteed to get the same matching i-th element while with the values the ordering of values does not need to match the original ordering (and you can have different order for each of those three variables in the pmap loop).
Since the keys for your Dicts are just numbers from 1 up to gcc you should simply use arrays instead. You can use generators very similar to Python. For an example instead of
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
use
v_dict_a = [zeros(Float64,gm,gz) for i=1:gcc]
Hope that helps.
Based on #Przemyslaw Szufeul's helpful advice, I've placed below the code that properly executes parallel processing. After running it once, I achieved substantial improvement in running time:
77.728264 seconds (181.20 k allocations: 12.548 MiB)
In addition to reordering the wp command and using the generator Przemyslaw recommended, I also recast v_iter as an anonymous function, in order to avoid having to sprinkle #everywhere around the code to feed functions and data to the workers.
I also added return a to the v_iter function, and set v_a below equal to the output of pmap, since you cannot pass by reference to a remote object.
addprocs(3)
v_iter = function(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
return a
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp =gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_a=[zeros(Float64,gm,gz) for i=1:gcc]
E_opt_a=[E_opt[i,:,:] for i=1:gcc]
gridpoint_m_a=[gridpoint_m[i,:,:] for i=1:gcc]
wp = CachingPool(workers())
v_a = pmap(wp,v_iter,v_a,E_opt_a,gridpoint_m_a)

Why python implementation of miller-rabin faster than ruby by a lot?

For one of my classes I recently came across both a ruby and a python implementations of using the miller-rabin algorithm to identify the number of primes between 20 and 29000. I am curious why, even though they are seemingly the same implementation, the python code runs so much faster. I have read that python was typically faster than ruby but is this much of a speed difference to be expected?
miller_rabin.rb
def miller_rabin(m,k)
t = (m-1)/2;
s = 1;
while(t%2==0)
t/=2
s+=1
end
for r in (0...k)
b = 0
b = rand(m) while b==0
prime = false
y = (b**t) % m
if(y ==1)
prime = true
end
for i in (0...s)
if y == (m-1)
prime = true
break
else
y = (y*y) % m
end
end
if not prime
return false
end
end
return true
end
count = 0
for j in (20..29000)
if(j%2==1 and miller_rabin(j,2))
count+=1
end
end
puts count
miller_rabin.py:
import math
import random
def miller_rabin(m, k):
s=1
t = (m-1)/2
while t%2 == 0:
t /= 2
s += 1
for r in range(0,k):
rand_num = random.randint(1,m-1)
y = pow(rand_num, t, m)
prime = False
if (y == 1):
prime = True
for i in range(0,s):
if (y == m-1):
prime = True
break
else:
y = (y*y)%m
if not prime:
return False
return True
count = 0
for j in range(20,29001):
if j%2==1 and miller_rabin(j,2):
count+=1
print count
When I measure the execution time of each using Measure-Command in Windows Powershell, I get the following:
Python 2.7:
Ticks: 4874403
Total Milliseconds: 487.4403
Ruby 1.9.3:
Ticks: 682232430
Total Milliseconds: 68223.243
I would appreciate any insight anyone can give me into why their is such a huge difference
In ruby you are using (a ** b) % c to calculate the modulo of exponentiation. In Python, you are using the much more efficient three-element pow call whose docstring explicitly states:
With three arguments, equivalent to (x**y) % z, but may be more
efficient (e.g. for longs).
Whether you want to count the lack of such built-in operator against ruby is a matter of opinion. On the one hand, if ruby doesn't provide one, you might say that it's that much slower. On the other hand, you're not really testing the same thing algorithmically, so some would say that the comparison is not fair.
A quick googling reveals that there are implementations of modulo exponentiation for ruby.
I think these profile results should answer your question:
%self total self wait child calls name
96.81 43.05 43.05 0.00 0.00 17651 Fixnum#**
1.98 0.88 0.88 0.00 0.00 17584 Bignum#%
0.22 44.43 0.10 0.00 44.33 14490 Object#miller_rabin
0.11 0.05 0.05 0.00 0.00 32142 <Class::Range>#allocate
0.11 0.06 0.05 0.00 0.02 17658 Kernel#rand
0.08 44.47 0.04 0.00 44.43 32142 *Range#each
0.04 0.02 0.02 0.00 0.00 17658 Kernel#respond_to_missing?
0.00 44.47 0.00 0.00 44.47 1 Kernel#load
0.00 44.47 0.00 0.00 44.47 2 Global#[No method]
0.00 0.00 0.00 0.00 0.00 2 IO#write
0.00 0.00 0.00 0.00 0.00 1 Kernel#puts
0.00 0.00 0.00 0.00 0.00 1 IO#puts
0.00 0.00 0.00 0.00 0.00 2 IO#set_encoding
0.00 0.00 0.00 0.00 0.00 1 Fixnum#to_s
0.00 0.00 0.00 0.00 0.00 1 Module#method_added
Looks like Ruby's ** operator is slow as compared to Python.
It looks like (b**t) is often too big to fix in a Fixnum, so you are using Bignum (or arbitrary-precision) arithmetic, which is much slower.

What do large times spent in Thread#initialize and Thread#join mean in JRuby profiling?

I'm trying to profile an application using JRuby's built-in profiler.
Most of the time is taken in ClassIsOfInterest.method_that_is_of_interest, which in turn has most of its time taken in Thread#initialize and Thread#join
total self children calls method
----------------------------------------------------------------
31.36 0.02 31.35 4525 Array#each
31.06 0.00 31.06 2 Test::Unit::RunCount.run_once
31.06 0.00 31.06 1 Test::Unit::RunCount.run
31.06 0.00 31.06 1 MiniTest::Unit#run
31.06 0.00 31.05 1 MiniTest::Unit#_run
31.01 0.00 31.01 2219 Kernel.send
31.00 0.00 31.00 1 MiniTest::Unit#run_tests
31.00 0.00 31.00 1 MiniTest::Unit#_run_anything
30.99 0.00 30.99 1 Test::Unit::Runner#_run_suites
30.99 0.00 30.99 5 MiniTest::Unit#_run_suite
30.99 0.00 30.98 21629 Array#map
30.98 0.00 30.98 1 Test::Unit::TestCase#run
30.98 0.00 30.98 1 MiniTest::Unit::TestCase#run
30.98 0.00 30.98 659 BasicObject#__send__
30.98 0.00 30.98 1 MyTestClass#my_test_method
30.80 0.00 30.80 18 Enumerable.each_with_index
30.77 0.00 30.77 15 MyTestHelper.generate_call_parser_based_on_barcoded_sequence
30.26 0.00 30.25 4943 Class#new_proxy
26.13 0.00 26.13 15 MyProductionClass1#my_production_method1
<snip boring methods with zero self time>
24.27 0.00 24.27 15 ClassIsOfInterest.method_that_is_of_interest
13.71 0.01 13.71 541 Enumerable.map
13.48 0.86 12.63 30 Range#each
12.62 0.22 12.41 450 Thread.new
12.41 12.41 0.00 450 Thread#initialize
10.78 10.78 0.00 450 Thread#join
4.03 0.12 3.91 539 Kernel.require
3.34 0.00 3.34 248 Kernel.require
2.49 0.00 2.49 15 MyTestFixture.create_fixture
<snip boring methods with small total times>
Each invocation of ClassIsOfInterest.method_that_is_of_interest is creating 30 threads, which is probably overkill, but I assume it shouldn't degrade performance that much. When I only had three threads created per invocation, I got
23.16 0.00 23.15 15 ClassIsOfInterest.method_that_is_of_interest
22.73 22.73 0.00 45 Thread#join
4.18 0.08 4.10 539 Kernel.require
3.56 0.00 3.56 248 Kernel.require
2.78 0.00 2.78 15 MyTestFixture.create_fixture
Do large time values for Thread#initialize (in the first profile) and Thread#join indicate that the code responsible for threading is taking a while, or merely that the code that is executed within the thread is taking a while?
The reason you see Thread#join is that your main thread is spending lots of time waiting for the other threads to finish. Most of the time spent in method_that_is_of_interest is spent blocking on Thread#join because it's not doing any other work. I wouldn't worry too much about it -- the profile is just saying that one of your threads is blocking on what other threads are doing. A better performance measurement in this case is the total running time, run the code with different numbers of threads and see where the sweet spot is.
The reason why Thread.new/Thread#initialize shows up is that threads are expensive objects to create. If you're calling this method often and it creates all those threads every time I suggest you look into Java's Executors API. Create a thread pool with Executors once (when your application starts up) and submit all the tasks to the pool instead of creating new threads (you can use ExecutorCompletionService to wait for all tasks to complete, or just call #get on the FutureTask instances you get when you submit your tasks).

Why is it slower to prespecify type in a data.frame?

I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:
n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
> system.time(f1())
user system elapsed
0.219 0.042 0.260
> system.time(f2())
user system elapsed
1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.
--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!
Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
print(system.time(f1()))
# user system elapsed
# 0.194 0.023 0.216
print(system.time(f2()))
# user system elapsed
# 0.336 0.037 0.374
print(system.time(f3()))
# user system elapsed
# 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length n is created. eg
str(data.frame(x=1:2, y = character(2)))
## 'data.frame': 2 obs. of 2 variables:
## $ x: int 1 2
## $ y: Factor w/ 1 level "": 1 1
Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.
#David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.
The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others
library(profr)
profr(f1())
## Read 9 items
## f level time start end leaf source
## 8 f1 1 0.16 0.00 0.16 FALSE <NA>
## 9 data.frame 2 0.04 0.00 0.04 TRUE base
## 10 $<- 2 0.02 0.04 0.06 FALSE base
## 11 sample 2 0.04 0.06 0.10 TRUE base
## 12 $<- 2 0.06 0.10 0.16 FALSE base
## 13 $<-.data.frame 3 0.12 0.04 0.16 TRUE base
profr(f2())
## Read 15 items
## f level time start end leaf source
## 8 f2 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.frame 2 0.12 0.00 0.12 TRUE base
## 10 : 2 0.02 0.12 0.14 TRUE base
## 11 $<- 2 0.02 0.18 0.20 FALSE base
## 12 sample 2 0.02 0.20 0.22 TRUE base
## 13 $<- 2 0.06 0.22 0.28 FALSE base
## 14 as.data.frame 3 0.08 0.04 0.12 FALSE base
## 15 $<-.data.frame 3 0.10 0.18 0.28 TRUE base
## 16 as.data.frame.character 4 0.08 0.04 0.12 FALSE base
## 17 factor 5 0.08 0.04 0.12 FALSE base
## 18 unique 6 0.06 0.04 0.10 FALSE base
## 19 match 6 0.02 0.10 0.12 TRUE base
## 20 unique.default 7 0.06 0.04 0.10 TRUE base
profr(f3())
## Read 4 items
## f level time start end leaf source
## 8 f3 1 0.06 0.00 0.06 FALSE <NA>
## 9 $<- 2 0.02 0.00 0.02 FALSE base
## 10 sample 2 0.04 0.02 0.06 TRUE base
## 11 $<-.data.frame 3 0.02 0.00 0.02 TRUE base
clearly f2() is slower than f1() as there is a lot of character to factor conversions, and recreating levels etc.
For efficient use of memory I would suggest the data.table package. This avoids (as much as possible) the internal copying of objects
library(data.table)
f4 <- function(){
f <- data.table(c1 = 1:n)
f[,c2:=1L:n]
f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}
system.time(f1())
## user system elapsed
## 0.15 0.02 0.18
system.time(f2())
## user system elapsed
## 0.19 0.00 0.19
system.time(f3())
## user system elapsed
## 0.09 0.00 0.09
system.time(f4())
## user system elapsed
## 0.04 0.00 0.04
Note, that using data.table you could add two columns at once (and by reference)
# Thanks to #Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]
EDIT -- functions that will return the required object (Well picked up #Dwin)
n= 1e7
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size = n, replace = TRUE)
a
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size = n, replace = TRUE)
b
}
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size = n, replace = TRUE)
c
}
f4 <- function() {
f <- data.table(c1 = 1:n)
f[, `:=`(c2, 1L:n)]
f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]
}
system.time(f1())
## user system elapsed
## 1.62 0.34 2.13
system.time(f2())
## user system elapsed
## 2.14 0.66 2.79
system.time(f3())
## user system elapsed
## 0.78 0.25 1.03
system.time(f4())
## user system elapsed
## 0.37 0.08 0.46
profr(f1())
## Read 105 items
## f level time start end leaf source
## 8 f1 1 2.08 0.00 2.08 FALSE <NA>
## 9 data.frame 2 0.66 0.00 0.66 FALSE base
## 10 : 2 0.02 0.66 0.68 TRUE base
## 11 $<- 2 0.32 0.84 1.16 FALSE base
## 12 sample 2 0.40 1.16 1.56 TRUE base
## 13 $<- 2 0.32 1.76 2.08 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.12 0.54 0.66 TRUE base
## 17 $<-.data.frame 3 1.24 0.84 2.08 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f2())
## Read 145 items
## f level time start end leaf source
## 8 f2 1 2.88 0.00 2.88 FALSE <NA>
## 9 data.frame 2 1.40 0.00 1.40 FALSE base
## 10 : 2 0.04 1.40 1.44 TRUE base
## 11 $<- 2 0.36 1.64 2.00 FALSE base
## 12 sample 2 0.40 2.00 2.40 TRUE base
## 13 $<- 2 0.36 2.52 2.88 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 numeric 3 0.06 0.02 0.08 TRUE base
## 16 character 3 0.04 0.08 0.12 TRUE base
## 17 as.data.frame 3 1.06 0.12 1.18 FALSE base
## 18 unlist 3 0.20 1.20 1.40 TRUE base
## 19 $<-.data.frame 3 1.24 1.64 2.88 TRUE base
## 20 as.data.frame.integer 4 0.04 0.12 0.16 TRUE base
## 21 as.data.frame.numeric 4 0.16 0.18 0.34 TRUE base
## 22 as.data.frame.character 4 0.78 0.40 1.18 FALSE base
## 23 factor 5 0.74 0.40 1.14 FALSE base
## 24 as.data.frame.vector 5 0.04 1.14 1.18 TRUE base
## 25 unique 6 0.38 0.40 0.78 FALSE base
## 26 match 6 0.32 0.78 1.10 TRUE base
## 27 unique.default 7 0.38 0.40 0.78 TRUE base
profr(f3())
## Read 37 items
## f level time start end leaf source
## 8 f3 1 0.72 0.00 0.72 FALSE <NA>
## 9 data.frame 2 0.10 0.00 0.10 FALSE base
## 10 : 2 0.02 0.10 0.12 TRUE base
## 11 $<- 2 0.08 0.14 0.22 FALSE base
## 12 sample 2 0.26 0.22 0.48 TRUE base
## 13 $<- 2 0.16 0.56 0.72 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.02 0.08 0.10 TRUE base
## 17 $<-.data.frame 3 0.58 0.14 0.72 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f4())
## Read 15 items
## f level time start end leaf source
## 8 f4 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.table 2 0.02 0.00 0.02 FALSE data.table
## 10 [ 2 0.26 0.02 0.28 FALSE base
## 11 : 3 0.02 0.00 0.02 TRUE base
## 12 [.data.table 3 0.26 0.02 0.28 FALSE <NA>
## 13 eval 4 0.26 0.02 0.28 FALSE base
## 14 eval 5 0.26 0.02 0.28 FALSE base
## 15 : 6 0.02 0.02 0.04 TRUE base
## 16 sample 6 0.24 0.04 0.28 TRUE base

Resources