ruby-prof says Ruby increment operator (+=) takes 25 seconds - ruby

I'm trying to profile some Ruby code I wrote using ruby-prof gem and see that basic operations like i += 1 (listed as Fixnum#+ in the table below) take over 24 seconds to run (in this particular test, the operation is performed 2,199,978 times). Is this normal?
Thread 582936
%Total %Self Total Self Wait Child Calls Name
203.93 81.72 0.00 122.21 100001/100001 InputFile#parse
46.96% 18.82% 203.93 81.72 0.00 122.21 100001 InputFile#split_on_semicolon
24.59 24.59 0.00 0.00 2199978/3200094 Fixnum#+
16.02 16.02 0.00 0.00 100001/399998 String#split
14.72 14.72 0.00 0.00 999990/999991 String#[]
13.12 13.12 0.00 0.00 1199988/1199990 Fixnum#<
10.97 10.97 0.00 0.00 999990/2239978 String#empty?
10.49 10.49 0.00 0.00 1199988/1199988 String#<<
9.75 9.75 0.00 0.00 1199988/1200074 Array#[]
7.77 7.77 0.00 0.00 999990/999990 String#eql?
6.76 6.76 0.00 0.00 599994/599994 Fixnum#-
4.62 4.62 0.00 0.00 599994/599994 Array#delete_at
1.25 1.25 0.00 0.00 100001/1339989 Kernel#nil?
1.14 1.14 0.00 0.00 100001/300003 Array#size
1.01 1.01 0.00 0.00 100001/300002 Fixnum#>

Your results don't say += takes 25 seconds. They say that 2199978 calls to + took 24.59 seconds, which comes to 89.5 calls per ms. That's a bit slow, but probably only because it's being profiled. I don't see anything unusual in that.

Related

Accuracy for Random Forest Algorithm is 0.0

I'm doing a machine learning project using Jupyter notebook. I'm using Random Forest with GridSearchCV, the execution is working fine, but I got Accuracy = 0.0
When I tried Decision Tree the Accuracy = 99.99
How do I solve this issue?
Input
#Training the RandomForest Algorithm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
rfc=RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth' : [5, 10, 20],
'min_samples_leaf': [1, 2, 3, 4, 5, 10, 20]
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
rfc1=RandomForestClassifier(random_state=42, n_estimators= 50, max_depth=5, criterion='gini')
rfc1.fit(X_train, y_train)
Which gives an output:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False)
INPUT:
pred=rfc1.predict(X_test)
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test,pred))
OUTPUT:
Accuracy for Random Forest on CV data: 0.0
INPUT :
'''
Compute confusion matrix and print classification report.
'''
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# score the model
Ntest = len(y_test)
Ntestpos = len([val for val in y_test if val])
NullAcc = float(Ntest-Ntestpos)/Ntest
print("Mean accuracy on Training set: %s" %rfc1.score(X_train, y_train))
print("Mean accuracy on Test set: %s" %rfc1.score(X_test, y_test))
print("Null accuracy on Test set: %s" %NullAcc)
print(" ")
y_pred = rfc1.predict(X_test)
f1_score(y_test, y_pred, average='weighted')
y_true, y_pred = y_test, rfc1.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\ntn=%6d fp=%6d\nfn=%6d tp=%6d" %(cm[0][0],cm[0][1],cm[1][0],cm[1][1]))
print("\nDetailed classification report: \n%s" %classification_report(y_true, y_pred))
OUTPUT:
Mean accuracy on Training set: 1.0
Mean accuracy on Test set: 0.0
Null accuracy on Test set: 0.0
with That Error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Confusion matrix:
tn= 0 fp= 0
fn=1745395 tp= 0
Detailed classification report:
precision recall f1-score support
0 0.00 0.00 0.00 0
1 0.00 0.00 0.00 1745395
2 0.00 0.00 0.00 143264
3 0.00 0.00 0.00 75044
4 0.00 0.00 0.00 46700
5 0.00 0.00 0.00 31568
6 0.00 0.00 0.00 22966
7 0.00 0.00 0.00 16903
8 0.00 0.00 0.00 13188
9 0.00 0.00 0.00 10160
.
.
.
119 0.00 0.00 0.00 2
123 0.00 0.00 0.00 2
124 0.00 0.00 0.00 1
141 0.00 0.00 0.00 1
165 0.00 0.00 0.00 1
avg / total 0.00 0.00 0.00 2148603

Julia pmap speed - parallel processing - dynamic programming

I am trying to speed up filling in a matrix for a dynamic programming problem in Julia (v0.6.0), and I can't seem to get much extra speed from using pmap. This is related to this question I posted almost a year ago: Filling a matrix using parallel processing in Julia. I was able to speed up serial processing with some great help then, and I'm now trying to get extra speed from parallel processing tools in Julia.
For the serial processing case, I was using a 3-dimensional matrix (essentially a set of equally-sized matrices, indexed by the 1st-dimension) and iterating over the 1st-dimension. I wanted to give pmap a try, though, to more efficiently iterate over the set of matrices.
Here is the code setup. To use pmap with the v_iter function below, I converted the three dimensional matrix into a dictionary object, with the dictionary keys equal to the index values in the 1st dimension (v_dict in the code below, with gcc equal to the 1st-dimension size). The v_iter function takes other dictionary objects (E_opt_dict and gridpoint_m_dict below) as additional inputs:
function v_iter(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp = gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
E_opt_dict=Dict(i => E_opt[i,:,:] for i=1:gcc)
gridpoint_m_dict=Dict(i => gridpoint_m[i,:,:] for i=1:gcc)
For parallel processing, I executed the following two commands:
wp = CachingPool(workers())
addprocs(3)
pmap(wp,v_iter,values(v_dict),values(E_opt_dict),values(gridpoint_m_dict))
...which produced this performance:
135.626417 seconds (3.29 G allocations: 57.152 GiB, 3.74% gc time)
I then tried to serial process instead:
for i=1:gcc
v_iter(v_dict[i],E_opt_dict[i],gridpoint_m_dict[i])
end
...and received better performance.
128.263852 seconds (3.29 G allocations: 57.101 GiB, 4.53% gc time)
This also gives me about the same performance as running v_iter on the original 3-dimensional objects:
v=zeros(Float64,gcc,gm,gz)
for i=1:gcc
v_iter(v[i,:,:],E_opt[i,:,:],gridpoint_m[i,:,:])
end
I know that parallel processing involves setup time, but when I increase the value of gcc, I still get about equal processing time for serial and parallel. This seems like a good candidate for parallel processing, since there is no need for messaging between the workers! But I can't seem to make it work efficiently.
You create the CachingPool before adding the worker processes. Hence your caching pool passed to pmap tells it to use just a single worker.
You can simply check it by running wp.workers you will see something like Set([1]).
Hence it should be:
addprocs(3)
wp = CachingPool(workers())
You could also consider running Julia -p command line parameter e.g. julia -p 3 and then you can skip the addprocs(3) command.
On top of that your for and pmap loops are not equivalent. The Julia Dict object is a hashmap and similar to other languages does not offer anything like element order. Hence in your for loop you are guaranteed to get the same matching i-th element while with the values the ordering of values does not need to match the original ordering (and you can have different order for each of those three variables in the pmap loop).
Since the keys for your Dicts are just numbers from 1 up to gcc you should simply use arrays instead. You can use generators very similar to Python. For an example instead of
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
use
v_dict_a = [zeros(Float64,gm,gz) for i=1:gcc]
Hope that helps.
Based on #Przemyslaw Szufeul's helpful advice, I've placed below the code that properly executes parallel processing. After running it once, I achieved substantial improvement in running time:
77.728264 seconds (181.20 k allocations: 12.548 MiB)
In addition to reordering the wp command and using the generator Przemyslaw recommended, I also recast v_iter as an anonymous function, in order to avoid having to sprinkle #everywhere around the code to feed functions and data to the workers.
I also added return a to the v_iter function, and set v_a below equal to the output of pmap, since you cannot pass by reference to a remote object.
addprocs(3)
v_iter = function(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
return a
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp =gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_a=[zeros(Float64,gm,gz) for i=1:gcc]
E_opt_a=[E_opt[i,:,:] for i=1:gcc]
gridpoint_m_a=[gridpoint_m[i,:,:] for i=1:gcc]
wp = CachingPool(workers())
v_a = pmap(wp,v_iter,v_a,E_opt_a,gridpoint_m_a)

gprof on both OpenMP and without OpenMP codes produces different flat profile

After successfully implementing OpenMP to my code, I am trying to check how much the implementation has improved my code performance, but using gprof it gives me totally different flat profile. Below is my main program calling all subroutines.
program main
use my_module
call inputf !to read inputs from a file
! call echo !to check if the inputs are read in correctly, but is muted
call allocv !to allocate dimension to all array variable
call bathyf !to read in the computational domain
call inicon !to setup initial conditions
call comput !computation from iteration 1 to n
call deallv !to deallocate all array variables
end program main
Following is the cpu_time and OMP_GET_WTIME() for both serial and parallel codes. The OpenMP parallel region is within subroutine comput.
!serial code
CPU time elapsed = 260.5080 seconds.
!parallel code
CPU time elapsed = 153.3600 seconds.
OMP time elapsed = 49.3521 seconds.
And the following are the flat profile for both serial and parallel codes.
!Serial code
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
96.26 227.63 227.63 1 227.63 236.45 comput_
3.60 236.13 8.50 2001 0.00 0.00 update_
0.08 236.32 0.19 2000 0.00 0.00 openbc_
0.05 236.45 0.13 41 0.00 0.00 output_
0.01 236.47 0.02 1 0.02 0.02 bathyf_
0.01 236.49 0.02 1 0.02 0.03 inicon_
0.00 236.50 0.01 1 0.01 0.01 opwmax_
0.00 236.50 0.00 1001 0.00 0.00 timser_
0.00 236.50 0.00 2 0.00 0.00 timestamp_
0.00 236.50 0.00 1 0.00 0.00 allocv_
0.00 236.50 0.00 1 0.00 0.00 deallv_
0.00 236.50 0.00 1 0.00 0.00 inputf_
!Parallel code
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
95.52 84.90 84.90 openbc_
1.68 86.39 1.49 2001 0.74 0.74 update_
0.10 86.48 0.09 41 2.20 2.20 output_
0.00 86.48 0.00 1001 0.00 0.00 timser_
0.00 86.48 0.00 2 0.00 0.00 timestamp_
0.00 86.48 0.00 1 0.00 0.00 allocv_
0.00 86.48 0.00 1 0.00 0.00 bathyf_
0.00 86.48 0.00 1 0.00 0.00 deallv_
0.00 86.48 0.00 1 0.00 2.20 inicon_
0.00 86.48 0.00 1 0.00 0.00 inputf_
0.00 86.48 0.00 1 0.00 0.00 comput_
0.00 86.48 0.00 1 0.00 0.00 opwmax_
subroutine update, openbc, output and timser are called within subroutine comput. As you can see, the subroutine comput is suppose to spend the most runtime, but the flat profile of the parallel code shows otherwise. Please let me know if you need other information.
gprof is poorly suited for analysis of parallel programs as it doesn't understand the intricacies of OpenMP. You should instead use something like a combination of Score-P and Cube. The former is an instrumentation framework while the latter is a visualisation tool for hierarchical performance data. Both are open-source projects. On the commercial front, Intel VTune Amplifier could be used.
This article says:
One problem with gprof under certain kernels (such as Linux) is that it doesn’t behave correctly with multithreaded applications. It actually only profiles the main thread, which is quite useless.
The article also provides a work-around, but since you don't create your threads manually, but instead use OpenMP (which creates the threads transparently), you will have to modify it to make it work for you.
You could also choose a profiler that is able to work with parallel programs instead.

Ruby-prof with graph printer and sorting by self puts out total percentages higher than 100%

If I run
ruby-prof -p graph -s self aggregate.rb > graph.txt
the first few lines of my graph.txt will look like:
Total Time: 40.092432
%total %self total self wait child calls Name
--------------------------------------------------------------------------------
5.16 5.16 0.00 0.00 98304/98304 Object#totalDurationFromFile
100.00% 100.00% 5.16 5.16 0.00 0.00 98304 IO#read
--------------------------------------------------------------------------------
4.91 4.91 0.00 0.00 98304/98304 <Class::IO>#new
95.17% 95.17% 4.91 4.91 0.00 0.00 98304 File#initialize
--------------------------------------------------------------------------------
0.37 0.19 0.00 0.17 32768/32769 Hash#each
28.89 4.67 0.00 24.22 1/32769 Object#readFiles
566.81% 94.24% 29.26 4.86 0.00 24.39 32769 Array#collect
14.71 1.98 0.00 12.73 98304/98304 Object#totalDurationFromFile
9.11 0.64 0.00 8.48 98304/131072 Class#new
0.39 0.39 0.00 0.00 98304/196609 <Class::File>#basename
0.00 0.17 0.00 0.00 98304/1202331 Object#main
--------------------------------------------------------------------------------
3.76 3.35 0.00 0.42 524288/524288 Module#class_eval
72.94% 64.85% 3.76 3.35 0.00 0.42 524288 Module#define_method
0.42 0.42 0.00 0.00 524288/524288 BasicObject#singleton_method_added
I don't think that this is specific to my script aggregate.rb. Therefore, I am leaving the source code out for the sake of brevity.
Question is: Why are there percentages higher than 100% in the %total column? Is sorting by self not allowed with the graph printer? Is this a bug or did I overlook something. Help greatly appreciated.
Thanks!
Have you checked if this change on Github resolves the issue? Apparently, the gem version is out of date and/or does not include that change (as it would also increase the number of decimal places to three).

What do large times spent in Thread#initialize and Thread#join mean in JRuby profiling?

I'm trying to profile an application using JRuby's built-in profiler.
Most of the time is taken in ClassIsOfInterest.method_that_is_of_interest, which in turn has most of its time taken in Thread#initialize and Thread#join
total self children calls method
----------------------------------------------------------------
31.36 0.02 31.35 4525 Array#each
31.06 0.00 31.06 2 Test::Unit::RunCount.run_once
31.06 0.00 31.06 1 Test::Unit::RunCount.run
31.06 0.00 31.06 1 MiniTest::Unit#run
31.06 0.00 31.05 1 MiniTest::Unit#_run
31.01 0.00 31.01 2219 Kernel.send
31.00 0.00 31.00 1 MiniTest::Unit#run_tests
31.00 0.00 31.00 1 MiniTest::Unit#_run_anything
30.99 0.00 30.99 1 Test::Unit::Runner#_run_suites
30.99 0.00 30.99 5 MiniTest::Unit#_run_suite
30.99 0.00 30.98 21629 Array#map
30.98 0.00 30.98 1 Test::Unit::TestCase#run
30.98 0.00 30.98 1 MiniTest::Unit::TestCase#run
30.98 0.00 30.98 659 BasicObject#__send__
30.98 0.00 30.98 1 MyTestClass#my_test_method
30.80 0.00 30.80 18 Enumerable.each_with_index
30.77 0.00 30.77 15 MyTestHelper.generate_call_parser_based_on_barcoded_sequence
30.26 0.00 30.25 4943 Class#new_proxy
26.13 0.00 26.13 15 MyProductionClass1#my_production_method1
<snip boring methods with zero self time>
24.27 0.00 24.27 15 ClassIsOfInterest.method_that_is_of_interest
13.71 0.01 13.71 541 Enumerable.map
13.48 0.86 12.63 30 Range#each
12.62 0.22 12.41 450 Thread.new
12.41 12.41 0.00 450 Thread#initialize
10.78 10.78 0.00 450 Thread#join
4.03 0.12 3.91 539 Kernel.require
3.34 0.00 3.34 248 Kernel.require
2.49 0.00 2.49 15 MyTestFixture.create_fixture
<snip boring methods with small total times>
Each invocation of ClassIsOfInterest.method_that_is_of_interest is creating 30 threads, which is probably overkill, but I assume it shouldn't degrade performance that much. When I only had three threads created per invocation, I got
23.16 0.00 23.15 15 ClassIsOfInterest.method_that_is_of_interest
22.73 22.73 0.00 45 Thread#join
4.18 0.08 4.10 539 Kernel.require
3.56 0.00 3.56 248 Kernel.require
2.78 0.00 2.78 15 MyTestFixture.create_fixture
Do large time values for Thread#initialize (in the first profile) and Thread#join indicate that the code responsible for threading is taking a while, or merely that the code that is executed within the thread is taking a while?
The reason you see Thread#join is that your main thread is spending lots of time waiting for the other threads to finish. Most of the time spent in method_that_is_of_interest is spent blocking on Thread#join because it's not doing any other work. I wouldn't worry too much about it -- the profile is just saying that one of your threads is blocking on what other threads are doing. A better performance measurement in this case is the total running time, run the code with different numbers of threads and see where the sweet spot is.
The reason why Thread.new/Thread#initialize shows up is that threads are expensive objects to create. If you're calling this method often and it creates all those threads every time I suggest you look into Java's Executors API. Create a thread pool with Executors once (when your application starts up) and submit all the tasks to the pool instead of creating new threads (you can use ExecutorCompletionService to wait for all tasks to complete, or just call #get on the FutureTask instances you get when you submit your tasks).

Resources