Why is PyTorch inference faster when preloading all mini-batches to list? - parallel-processing

While benchmarking different dataloaders I noticed some peculiar behavior with the PyTorch built-in dataloader. I am running the below code on a cpu-only machine with the MNIST dataset.
It seems that a simple forward pass in my model is much faster when mini-batches are preloaded to a list rather than fetched during iteration:
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from torch.profiler import profile, record_function, ProfilerActivity
mnist_dataset = torchvision.datasets.MNIST(root=".", train=True, transform=T.ToTensor(), download=True)
loader = torch.utils.data.DataLoader(dataset=mnist_dataset, batch_size=128,shuffle=False, pin_memory=False, num_workers=4)
model = nn.Sequential(nn.Flatten(), nn.Linear(28*28, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Linear(256, 10))
model.train()
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
for (images_iter, labels_iter) in loader:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
train_list = [sample for sample in loader]
for (images_iter, labels_iter) in train_list:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
The subset of most interesting output from the Torch profiler is:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 0.02% 644.000us 4.57% 134.217ms 286.177us 469
Self CPU time total: 2.937s
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 70.48% 6.888s 70.62% 6.902s 14.717ms 469
Self CPU time total: 9.773s
Seems like aten::batch_norm (batch normalization) is taking significantly more time in the case where samples are not preloaded to a list, but I can't figure out why since it should be the same operation?
The above was tested on a 4-core cpu with python 3.8
If anything the version of pre-loading to a list should be slight slower overall due to the overhead of creating the list

with
torch=1.10.2+cu102
torchvision=0.11.3+cu102
had following results
Self CPU time total: 2.475s
Self CPU time total: 2.800s
Try to reproduce this code again using different lib versions

Related

Finding the number of context switches

In order to measure the number of context switches for a multi-thread application, I followed two methods: 1) with perf sched and 2) with the information in /proc/pid/status. The difference is quite large, though. The steps I did are:
1- Using perf command, the number of switches is 7848.
$ sudo perf stat -e sched:sched_switch,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions ./mm_double_omp 4
Using 4 threads
PID = 395944
Performance counter stats for './mm_double_omp 4':
7,601 sched:sched_switch # 0.044 K/sec
173,377.19 msec task-clock # 3.973 CPUs utilized
7,601 context-switches # 0.044 K/sec
2 cpu-migrations # 0.000 K/sec
24,780 page-faults # 0.143 K/sec
164,393,781,352 cycles # 0.948 GHz
69,723,515,498 instructions # 0.42 insn per cycle
43.636463582 seconds time elapsed
173.244505000 seconds user
0.123880000 seconds sys
Please note that sched:sched_switch and context-switches are the same. If I only use sched:sched_switch the number is still in the order of 7000.
2- I modified the code to copy /proc/pid/status file two times: At the beginning and finish of the program.
int main() {
char cmdbuf[256];
int pid_num = getpid();
printf("PID = %d\n", pid_num);
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "start.txt" );
system(cmdbuf);
// DO
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "finish.txt" );
system(cmdbuf);
return 0;
}
After the execution I see:
$ tail -n2 start.txt
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 0
$ tail -n2 finish.txt
voluntary_ctxt_switches: 5
nonvoluntary_ctxt_switches: 573
So, there are less than 600 context switches which is far less than the perf result. Questions are:
Does perf code affect the measurement? If yes, then it has a large overhead.
Is the meaning of context switch is the same in both methods?
Which one is more reliable then?

Is it possible to vectorize annotation for matplotlib?

As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)
The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:
Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs
The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.
def annotate(row, ax):
ax.annotate(row.name, (row.exp, row.model),
xytext=(10, 20), textcoords='offset points',
arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
family='sans-serif', fontsize=8, color='darkslategrey')
def plot2File(df, file, seq, z, p, s):
""" Plot predictions vs experimental """
plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
df.apply(annotate, ax=ax, axis=1)
# for row in df.itertuples():
# ax.annotate(row.Index, (row.exp, row.model),
# xytext=(10, 20), textcoords='offset points',
# arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
# family='sans-serif', fontsize=8, color='darkslategrey')
plt.savefig(file, bbox_inches='tight', format='pdf')
plt.close()
Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?

Optimizing Groovy Performance

I'm working on groovy code perfomance optimization. I've used jvisualvm to connect to running applicaton and gather CPU samples. Samples say that org.codehaus.groovy.reflection.CachedMethod.inkove takes the most CPU time. I don't see any other application methods in samples.
What is the right way to dig into CachedMethod.invoke and understand what code lines really give perfomance penalties?
Thanks.
UPD:
I do use Indy, it didn't help me.
I didn't try to introduce #CompileStatic since I want to find my bottlenecks before rewriting groovy to java.
My problem a bit similar to this thread: Call site caching faster than invokedynamic?
I have a code that dynamically composes groovy script. Script template looks this way:
def evaluateExpression(Map context){
def user = context.user
%s
}
where %s replaced with
user.attr1 == '1' || user.attr2 == '2' || user.attr3 = '3'
There is a set (20 in total) of replacements have taken from Databases.
The code gets replacements from DB, creates GroovyScript and evaluates it.
I suppose the bottleneck is in the script execution. What is the right way to fix it?
So, I've tried various things
groovy-indy, doesn't work
groovy-indy with some code "optimization", doesn't work. BTW, I'started to play around with try/catch and it as a result I made my "hotspot" run 4 times faster. I'm not good at JVM internals, but internet says - try/catch prevents optimizations. I assumed it as a ground truth. Need to g deeper to understand who it really works.
I gave up, turned off invokedynamic and rewrote my "hottest" code with #CompileStatic. It took about 3-4 hours and I my code runs 100 time faster now.
Here are initial metrics with "invokedynamic support"
count = 83043
mean rate = 395.52 calls/second
1-minute rate = 555.30 calls/second
5-minute rate = 217.78 calls/second
15-minute rate = 82.92 calls/second
min = 0.29 milliseconds
max = 12.98 milliseconds
mean = 1.59 milliseconds
stddev = 1.08 milliseconds
median = 1.39 milliseconds
75% <= 2.46 milliseconds
95% <= 3.14 milliseconds
98% <= 3.44 milliseconds
99% <= 3.76 milliseconds
99.9% <= 12.19 milliseconds
Here are #CompileStatic metrics with ind turned off. BTW, there is no reason to use #CompileStatic if "indy" is turned on.
count = 139724
mean rate = 8950.43 calls/second
1-minute rate = 2011.54 calls/second
5-minute rate = 426.96 calls/second
15-minute rate = 143.76 calls/second
min = 0.02 milliseconds
max = 24.18 milliseconds
mean = 0.08 milliseconds
stddev = 0.72 milliseconds
median = 0.06 milliseconds
75% <= 0.08 milliseconds
95% <= 0.11 milliseconds
98% <= 0.15 milliseconds
99% <= 0.20 milliseconds
99.9% <= 1.27 milliseconds

Julia parallel computing in IPython Jupyter

I'm preparing a small presentation in Ipython where I want to show how easy it is to do parallel operation in Julia.
It's basically a Monte Carlo Pi calculation described here
The problem is that I can't make it work in parallel inside an IPython (Jupyter) Notebook, it only uses one.
I started Julia as: julia -p 4
If I define the functions inside the REPL and run it there it works ok.
#everywhere function compute_pi(N::Int)
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0
for i = 1:N
x = rand() * 2 - 1 # uniformly distributed number on x-axis
y = rand() * 2 - 1 # uniformly distributed number on y-axis
r2 = x*x + y*y # radius squared, in radial coordinates
if r2 < 1.0
n_landed_in_circle += 1
end
end
return n_landed_in_circle / N * 4.0
end
 
function parallel_pi_computation(N::Int; ncores::Int=4)
"""
Compute pi in parallel, over ncores cores, with a Monte Carlo simulation throwing N total darts
"""
# compute sum of pi's estimated among all cores in parallel
sum_of_pis = #parallel (+) for i=1:ncores
compute_pi(int(N/ncores))
end
return sum_of_pis / ncores # average value
end
 
julia> #time parallel_pi_computation(int(1e9))
elapsed time: 2.702617652 seconds (93400 bytes allocated)
3.1416044160000003
But when I do:
using IJulia
notebook()
And try to do the same thing inside the Notebook it only uses 1 core:
In [5]: #time parallel_pi_computation(int(10e8))
elapsed time: 10.277870808 seconds (219188 bytes allocated)
Out[5]: 3.141679988
So, why isnt Jupyter using all the cores? What can I do to make it work?
Thanks.
Using addprocs(4) as the first command in your notebook should provide four workers for doing parallel operations from within your notebook.
One way to solve this is to create a kernel that always uses 4 cores. For that some manual work is required. I assume that you are on a unix machine.
In the folder ~/.ipython/kernels/julia-0.x, you will find following kernel.json file:
{
"display_name": "Julia 0.3.9",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
If you copy the whole folder cp -r julia-0.x julia-0.x-p4, and modify the newly copied kernel.json file:
{
"display_name": "Julia 0.3.9 p4",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-p",
"4",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
The paths will probably be different for you. Note that I only gave the kernel a new name and added the command line argument `-p 4.
You should see a new kernel named Julia 0.3.9 p4 which should always use 4 cores.
Also note that this kernel file will not get updated when you update IJulia, so you have to update it manually whenever you update julia or IJulia.
You can add new kernels using this command:
using IJulia
#for 4 cores
installkernel("Julia_4_threads", env=Dict("JULIA_NUM_THREADS"=>"4"))
#or for 8 cores
installkernel("Julia_8_threads", env=Dict("JULIA_NUM_THREADS"=>"8"))
After restart your VSCode this options will apear you your select kernel option.

how to judge of the trade-off of lua closure and lua coroutine?(when both of them can perform the same task)

ps:let alone the code complexity of closure implementation of the same task.
The memory overhead for a closure will be less than for a coroutine (unless you've got a lot of "upvalues" in the closure, and none in the coroutine). Also the time overhead for invoking the closure is negligible, whereas there is some small overhead for invoking the coroutine. From what I've seen, Lua does a pretty good job with coroutine switches, but if performance matters and you have the option not to use a coroutine, you should explore that option.
If you want to do benchmarks yourself, for this or anything else in Lua:
You use collectgarbage("collect");collectgarbage("count") to report the size of all non-garbage-collectable memory. (You may want to do "collect" a few times, not just one.) Do that before and after creating something (a closure, a coroutine) to know how much size it consumes.
You use os.clock() to time things.
See also Programming in Lua on profiling.
see also:
https://gist.github.com/LiXizhi/911069b7e7f98db76d295dc7d1c5e34a
-- Testing coroutine overhead in LuaJIT 2.1 with NPL runtime
--[[
Starting function test...
memory(KB): 0.35546875
Functions: 500000
Elapsed time: 0 s
Starting coroutine test...
memory(KB): 13781.81640625
Coroutines: 500000
Elapsed time: 0.191 s
Starting single coroutine test...
memory(KB): 0.4453125
Coroutines: 500000
Elapsed time: 0.02800000000002
conclusions:
1. memory overhead: 0.26KB per coroutine
2. yield/resume pair overhead: 0.0004 ms
if you have 1000 objects each is calling yield/resume at 60FPS, then the time overhead is 0.2*1000/500000*60*1000 = 24ms
and if you do not reuse coroutine, then memory overhead is 1000*60*0.26 = 15.6MB/sec
]]
local total = 500000
local start, stop
function loopy(n)
n = n + 1
return n
end
print "Starting function test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
for i = 1, total do
loopy(i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Functions:", total)
print("Elapsed time:", stop-start, " s")
print "Starting coroutine test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
for i = 1, total do
co = coroutine.create(loopy)
coroutine.resume(co, i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Coroutines:", total)
print("Elapsed time:", stop-start, " s")
print "Starting single coroutine test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
co = coroutine.create(function()
for i = 1, total do
loopy(i)
coroutine.yield();
end
end)
for i = 1, total do
coroutine.resume(co, i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Coroutines:", total)
print("Elapsed time:", stop-start, " s")

Resources