Given a network (take OpenPCDet as example), which runs on distributed GPUs.
How could I know which module costs the most of time during training?
I don't want to manually test each module by
torch.cuda.synchronize()
start = time.time()
result = model(input)
torch.cuda.synchronize()
end = time.time()
And with torch.autograd.profiler.profile(enabled=True) as prof: only shows me the CPU time:
(which doesn't seem right either, I am not sure I used it properly)
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::randperm 28.77% 77.040us 61.28% 164.094us 82.047us 2
aten::random_ 22.25% 59.584us 22.25% 59.584us 29.792us 2
aten::empty 21.79% 58.333us 21.79% 58.333us 19.444us 3
aten::item 10.91% 29.225us 21.22% 56.807us 28.403us 2
aten::_local_scalar_dense 10.30% 27.582us 10.30% 27.582us 13.791us 2
aten::scalar_tensor 4.19% 11.222us 4.19% 11.222us 11.222us 1
aten::resize_ 0.94% 2.507us 0.94% 2.507us 1.253us 2
aten::is_floating_point 0.85% 2.268us 0.85% 2.268us 1.134us 2
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 267.761us
Answers that only work on a single GPU are also welcomed.
In that case, you'd have to enable cuda on your model, input (.cuda() or .to(device)) and on the profiler
with profile(activities=[ProfilerActivity.CPU, profilerActivity.CUDA]) as prof: as described in the docu
If you want to profile the training performance, it's also important to call loss.backward() inside the profiler context/with block, as the backward pass performance might differ from the forward pass by quite a bit.
Ps.: I also find a bit easier to read the profiler output as a Pandas DataFrame:
df = pd.DataFrame({e.key:e.__dict__ for e in prof.key_averages()}).T
df[['count', 'cpu_time_total', 'cuda_time_total']].sort_values(['cuda_time_total', 'cpu_time_total'], ascending=False)
Related
While benchmarking different dataloaders I noticed some peculiar behavior with the PyTorch built-in dataloader. I am running the below code on a cpu-only machine with the MNIST dataset.
It seems that a simple forward pass in my model is much faster when mini-batches are preloaded to a list rather than fetched during iteration:
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from torch.profiler import profile, record_function, ProfilerActivity
mnist_dataset = torchvision.datasets.MNIST(root=".", train=True, transform=T.ToTensor(), download=True)
loader = torch.utils.data.DataLoader(dataset=mnist_dataset, batch_size=128,shuffle=False, pin_memory=False, num_workers=4)
model = nn.Sequential(nn.Flatten(), nn.Linear(28*28, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Linear(256, 10))
model.train()
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
for (images_iter, labels_iter) in loader:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
train_list = [sample for sample in loader]
for (images_iter, labels_iter) in train_list:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
The subset of most interesting output from the Torch profiler is:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 0.02% 644.000us 4.57% 134.217ms 286.177us 469
Self CPU time total: 2.937s
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 70.48% 6.888s 70.62% 6.902s 14.717ms 469
Self CPU time total: 9.773s
Seems like aten::batch_norm (batch normalization) is taking significantly more time in the case where samples are not preloaded to a list, but I can't figure out why since it should be the same operation?
The above was tested on a 4-core cpu with python 3.8
If anything the version of pre-loading to a list should be slight slower overall due to the overhead of creating the list
with
torch=1.10.2+cu102
torchvision=0.11.3+cu102
had following results
Self CPU time total: 2.475s
Self CPU time total: 2.800s
Try to reproduce this code again using different lib versions
I have a very long list with hits from a HMMer search in the following form:
Query: Alvin_0001|ID:9263667| [L=454]
Description: chromosomal replication initiator protein DnaA [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
7.5e-150 497.8 0.2 9e-150 497.5 0.2 1.0 1 COG0593
8e-11 40.6 0.5 1.5e-10 39.7 0.5 1.6 1 COG1484
4.5e-07 28.1 0.2 6e-07 27.7 0.2 1.1 1 COG1373
2.5e-05 22.3 0.1 3.4e-05 21.8 0.1 1.4 1 COG1485
Query: Alvin_0005|ID:9265207| [L=334]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
------ inclusion threshold ------
0.018 13.4 12.9 0.068 11.5 3.6 2.2 2 COG3247
0.024 13.1 9.0 0.053 12.0 9.0 1.5 1 COG2246
0.046 12.4 7.3 0.049 12.4 5.3 1.8 1 COG2020
Query: Alvin_0004|ID:9265206| [L=154]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
[No hits detected that satisfy reporting thresholds]
This file contains so much information that I am not interested in, so I need a script that only outputs certain values, that is the line with Query: and the first COG#### in the column Model
So as an expected output (tab delimited file would be the best):
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
note that in the last line, no COG has been found
Now the file structure is a bit too complicated for me to use a simple grep or awk command:
In the first block, the 1st and the 6st line would be the target (awk '/Query: /{nr[NR]; nr[NR+6]}; NR in nr')
In the second block, it is the 1st and the 7th line
and in the third, there is only the line with Query
So what would be now a good approach to parse this file?
Short awk solution:
awk '/^Query:/{ if(q) print q; q=$0 }q && $9~/^COG.{4}$/{ printf("%s\t%s\n",q,$9); q="" }
END{ if(q) print q }' file
The output:
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
Details:
/^Query:/{ q=$0 } - capturing "Query" line
q && $9~/^COG.{4}$/ - capturing the first "Model" field value (ensured by resetting the preceding "Query" line q="")
$ cat tst.awk
BEGIN { OFS="\t" }
/^Query/ { qry=$0 }
$1 ~ /^[0-9]/ { if (qry!="") print qry, $9; qry="" }
/\[No hits/ { print qry }
$ awk -f tst.awk file
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
I'm trying to extract filenames from rar packages in a directory. I'm using 7z which returns a multi-line string, and would like to search the output for "mkv", "avi", or "srt" files.
Here's my code:
ROOT_DIR = "/users/ken/extract"
# Check each directory for Rar packages
# Returns an arary of directories with filenames from the rar's
def checkdirs()
pkgdirs = {}
Dir.foreach(ROOT_DIR) do |d|
if !Dir.glob("#{ROOT_DIR}/#{d}/*.rar").empty?
rarlist = `7z l #{ROOT_DIR}/#{d}/*.rar`
puts rarlist # Returns multilinen output from 7z l
puts rarlist.scan('*.mkv').first
pkgdirs[d] = 'filename'
end
end
pkgdirs
end
I can get the 7z output but I can't figure out how to search the output for my strings. How can I search the output and return the matching lines?
This is an example of the 7z output:
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs x64)
Scanning the drive for archives:
1 file, 15000000 bytes (15 MiB)
Listing archive: Gotham.S03E19.HDTV.x264-KILLERS/gotham.s03e19.hdtv.x264-killers.rar
--
Path = Gotham.S03E19.HDTV.x264-KILLERS/gotham.s03e19.hdtv.x264-killers.rar
Type = Rar
Physical Size = 15000000
Total Physical Size = 285988640
Characteristics = Volume FirstVolume VolCRC
Solid = -
Blocks = 1
Multivolume = +
Volume Index = 0
Volumes = 20
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 ..... 285986500 285986500 Gotham.S03E19.HDTV.x264-KILLERS.mkv
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 285986500 285986500 1 files
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 285986500 285986500 1 files
Archives: 1
Volumes: 20
Total archives size: 285988640
I expect this output:
2017-05-23 02:30:52 ..... 285986500 285986500 Gotham.S03E19.HDTV.x264-KILLERS.mkv
You can use this:
puts rarlist.scan(/^.*\.mkv/)
The regex will match from the beginning of lines.
To match .mkv, .avi, or .srt, you can use:
rarlist.scan(/(^.*\.(mkv|avi|srt))/) {|a,_| puts a}
The solution is much simpler than what you're making it.
Starting with:
TARGET_EXTENSIONS = %w[mkv avi srt]
TARGET_EXTENSION_RE = /\.(?:#{ Regexp.union(TARGET_EXTENSIONS).source})/
# => /\.(?:mkv|avi|srt)/
output = <<EOT
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs x64)
Scanning the drive for archives:
1 file, 15000000 bytes (15 MiB)
Listing archive: Gotham.S03E19.HDTV.x264-KILLERS/gotham.s03e19.hdtv.x264-killers.rar
--
Path = Gotham.S03E19.HDTV.x264-KILLERS/gotham.s03e19.hdtv.x264-killers.rar
Type = Rar
Physical Size = 15000000
Total Physical Size = 285988640
Characteristics = Volume FirstVolume VolCRC
Solid = -
Blocks = 1
Multivolume = +
Volume Index = 0
Volumes = 20
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 ..... 285986500 285986500 Gotham.S03E19.HDTV.x264-KILLERS.mkv
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 285986500 285986500 1 files
------------------- ----- ------------ ------------ ------------------------
2017-05-23 02:30:52 285986500 285986500 1 files
Archives: 1
Volumes: 20
Total archives size: 285988640
EOT
All it takes is to iterate over the lines in the output and puts the matches:
puts output.lines.grep(TARGET_EXTENSION_RE)
Which would output:
2017-05-23 02:30:52 ..... 285986500 285986500 Gotham.S03E19.HDTV.x264-KILLERS.mkv
The above is a basic solution, but there are things that could be done to speed up the code, depending on the output being received:
TARGET_EXTENSIONS = %w[mkv avi srt].map { |e| '.' << e } # => [".mkv", ".avi", ".srt"]
puts output.split(/\r?\n/).select { |l| l.end_with?(*TARGET_EXTENSIONS) }
I'd have to run benchmarks, but that should be faster, since regular expressions can drastically slow code if not written correctly.
You could try:
TARGET_EXTENSION_RE = /\.(?:#{ Regexp.union(TARGET_EXTENSIONS).source})$/
# => /\.(?:mkv|avi|srt)$/
puts output.split(/\r?\n/).grep(TARGET_EXTENSION_RE)
as anchored patterns are much faster than unanchored.
If the 7z archives will generate huge listings (in the MB range) it'd be better to iterate over the input to avoid scalability issues. In the above example output.lines would be akin to slurping the output. See "Why is "slurping" a file not a good practice?" for more information.
I am using CPLEX in Cpp.
After googling I found out what parameters need to be set to avoid cplex from printing to terminal and I use them like this:
IloCplex cplex(model);
std::ofstream logfile("cplex.log");
cplex.setOut(logfile);
cplex.setWarning(logfile);
cplex.setError(logfile);
cplex.setParam(IloCplex::MIPInterval, 1000);//Controls the frequency of node logging when MIPDISPLAY is set higher than 1.
cplex.setParam(IloCplex::MIPDisplay, 0);//MIP node log display information-No display until optimal solution has been found
cplex.setParam(IloCplex::SimDisplay, 0);//No iteration messages until solution
cplex.setParam(IloCplex::BarDisplay, 0);//No progress information
cplex.setParam(IloCplex::NetDisplay, 0);//Network logging display indicator
if ( !cplex.solve() ) {
....
}
but yet cplex prints such things:
Warning: Bound infeasibility column 'x11'.
Presolve time = 0.00 sec. (0.00 ticks)
Root node processing (before b&c):
Real time = 0.00 sec. (0.01 ticks)
Parallel b&c, 4 threads:
Real time = 0.00 sec. (0.00 ticks)
Sync time (average) = 0.00 sec.
Wait time (average) = 0.00 sec.
------------
Total (root+branch&cut) = 0.00 sec. (0.01 ticks)
Is there any way to avoid printing them?
Use setOut method from IloAlgorithm class (IloCplex inherits from IloAlgorithm). You can set a null output stream as a parameter and prevent logging the message on the screen.
This is what works in C++ according to cplex parameters doc:
cplex.setOut(env.getNullStream());
cplex.setWarning(env.getNullStream());
cplex.setError(env.getNullStream());
What would be more efficient:
For i = 0 to 2
if x[i] == y[i] then do something
//or
if x[0] == y[0] do something
if x[1] == y[1] do something
If I am only doing it twice. Also, ignore the readability.
What would be more efficient:
I think there will be absolutely no difference between the two as your for loop is just from 0 to 2 so you may prefer which ever if more readable to you. However if you for loop is huge(ie, index is very large) then I would recommend to use for loop as it would be more readable.
Also, ignore the readability.
I would not recommend that as it is always advisable to all the programmers to write the code which is more readable for oneself and also for others.
It depends. That's almost always the answer to these things.
There will be several effects at play here.
Firstly, having a loop (ie actually a loop in the machine code, the high level source is really irrelevant except that it may influence the machine code, a loop by 2 has an extremely high chance of being unrolled by a compiler) clearly executes more branches in total, and while a correctly predicted branch usually has no latency, they typically do have a limited throughput.
Secondly, having two distinct branches means that distinct branch prediction histories can be attached to them. That can improve their predictability, particularly if the patterns taken separately both fit in a branch history buffer, but taken together the aggregate pattern is too long to fit. That's very machine dependent, does not happen on all microarchitectures, and very rare in any case since it requires predictable patterns of behaviour of a carefully balanced "long enough but not too long" length.
Thirdly, unrolling that loop likely leads to more code (unless of course the loop overhead is more than the loop body). That puts more pressure on the code cache and the decoders. This effect, unlike the first two, favours the loop.
Lastly, all of these effects are small. In the presence of just about anything else (such as cache misses), they're likely to completely disappear in the noise.
This isn't exactly the same question but I had been wondering what the performance benefit is between doing more work within the same loop and iterating over the set twice. I assumed that was that it would be faster to loop once but I wasn't sure how much or if optimization would equal them out at all.
I finally did a simple test and as you would expect, more work in the same loop is a bit faster.
The test was pretty simple and was done in Swift 3.1 and run with optimization ON:
let itr = 10000000
let passes = 10
print("Running \(itr) iterations through \(passes) passes")
for run in 0..<passes {
print("---------")
var time = CFAbsoluteTimeGetCurrent()
var val1 = 0
for i in 0..<itr {
val1 += i
}
for i in 0..<itr {
val1 += i
}
let t1 = CFAbsoluteTimeGetCurrent() - time
print("\(run).1 - \(val1) -- \(t1)")
time = CFAbsoluteTimeGetCurrent()
var val2 = 0
for i in 0..<itr {
val2 += i
val2 += i
}
let t2 = CFAbsoluteTimeGetCurrent() - time
print("\(run).2 - \(val2) -- \(t2)")
}
And the results:
Running 10000000 iterations through 10 passes
---------
0.1 - 99999990000000 -- 0.127476990222931
0.2 - 99999990000000 -- 0.0763950347900391
---------
1.1 - 99999990000000 -- 0.121748030185699
1.2 - 99999990000000 -- 0.0743749737739563
---------
2.1 - 99999990000000 -- 0.123345971107483
2.2 - 99999990000000 -- 0.0756909847259521
---------
3.1 - 99999990000000 -- 0.11965000629425
3.2 - 99999990000000 -- 0.0711749792098999
---------
4.1 - 99999990000000 -- 0.117263972759247
4.2 - 99999990000000 -- 0.0712859630584717
---------
5.1 - 99999990000000 -- 0.116972029209137
5.2 - 99999990000000 -- 0.0708900094032288
---------
6.1 - 99999990000000 -- 0.121819019317627
6.2 - 99999990000000 -- 0.0748890042304993
---------
7.1 - 99999990000000 -- 0.124098002910614
7.2 - 99999990000000 -- 0.0734890103340149
---------
8.1 - 99999990000000 -- 0.122666001319885
8.2 - 99999990000000 -- 0.07710200548172
---------
9.1 - 99999990000000 -- 0.121197044849396
9.2 - 99999990000000 -- 0.0715969800949097