Finding the number of context switches - performance

In order to measure the number of context switches for a multi-thread application, I followed two methods: 1) with perf sched and 2) with the information in /proc/pid/status. The difference is quite large, though. The steps I did are:
1- Using perf command, the number of switches is 7848.
$ sudo perf stat -e sched:sched_switch,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions ./mm_double_omp 4
Using 4 threads
PID = 395944
Performance counter stats for './mm_double_omp 4':
7,601 sched:sched_switch # 0.044 K/sec
173,377.19 msec task-clock # 3.973 CPUs utilized
7,601 context-switches # 0.044 K/sec
2 cpu-migrations # 0.000 K/sec
24,780 page-faults # 0.143 K/sec
164,393,781,352 cycles # 0.948 GHz
69,723,515,498 instructions # 0.42 insn per cycle
43.636463582 seconds time elapsed
173.244505000 seconds user
0.123880000 seconds sys
Please note that sched:sched_switch and context-switches are the same. If I only use sched:sched_switch the number is still in the order of 7000.
2- I modified the code to copy /proc/pid/status file two times: At the beginning and finish of the program.
int main() {
char cmdbuf[256];
int pid_num = getpid();
printf("PID = %d\n", pid_num);
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "start.txt" );
system(cmdbuf);
// DO
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "finish.txt" );
system(cmdbuf);
return 0;
}
After the execution I see:
$ tail -n2 start.txt
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 0
$ tail -n2 finish.txt
voluntary_ctxt_switches: 5
nonvoluntary_ctxt_switches: 573
So, there are less than 600 context switches which is far less than the perf result. Questions are:
Does perf code affect the measurement? If yes, then it has a large overhead.
Is the meaning of context switch is the same in both methods?
Which one is more reliable then?

Related

Why is PyTorch inference faster when preloading all mini-batches to list?

While benchmarking different dataloaders I noticed some peculiar behavior with the PyTorch built-in dataloader. I am running the below code on a cpu-only machine with the MNIST dataset.
It seems that a simple forward pass in my model is much faster when mini-batches are preloaded to a list rather than fetched during iteration:
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from torch.profiler import profile, record_function, ProfilerActivity
mnist_dataset = torchvision.datasets.MNIST(root=".", train=True, transform=T.ToTensor(), download=True)
loader = torch.utils.data.DataLoader(dataset=mnist_dataset, batch_size=128,shuffle=False, pin_memory=False, num_workers=4)
model = nn.Sequential(nn.Flatten(), nn.Linear(28*28, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Linear(256, 10))
model.train()
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
for (images_iter, labels_iter) in loader:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
train_list = [sample for sample in loader]
for (images_iter, labels_iter) in train_list:
outputs_iter = model(images_iter)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
The subset of most interesting output from the Torch profiler is:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 0.02% 644.000us 4.57% 134.217ms 286.177us 469
Self CPU time total: 2.937s
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
aten::batch_norm 70.48% 6.888s 70.62% 6.902s 14.717ms 469
Self CPU time total: 9.773s
Seems like aten::batch_norm (batch normalization) is taking significantly more time in the case where samples are not preloaded to a list, but I can't figure out why since it should be the same operation?
The above was tested on a 4-core cpu with python 3.8
If anything the version of pre-loading to a list should be slight slower overall due to the overhead of creating the list
with
torch=1.10.2+cu102
torchvision=0.11.3+cu102
had following results
Self CPU time total: 2.475s
Self CPU time total: 2.800s
Try to reproduce this code again using different lib versions

AWK loops test on local vm

I'm trying to understand this difference of 5 seconds between 1000 and 10000 loops extracted from gawk article. Would appreciate your help.
#!/usr/bin/gawk -f
# how long does it take to do a few loops?
BEGIN {
LOOPS=100;
# do the test twice
start=systime();
for (i=0;i<LOOPS;i++) {
}
end = systime();
# calculate how long it takes to do a dummy test
do_nothing = end-start;
# now do the test again with the *IMPORTANT* code inside
start=systime();
for (i=0;i<LOOPS;i++) {
# How long does this take?
while ("date" | getline) {
date = $0;
}
close("date");
}
end = systime();
newtime = (end - start) - do_nothing;
if (newtime <= 0) {
printf("%d loops were not enough to test, increase it\n",
LOOPS);
exit;
} else {
printf("%d loops took %6.4f seconds to execute\n",
LOOPS, newtime);
printf("That's %10.8f seconds per loop\n",
(newtime)/LOOPS);
# since the clock has an accuracy of +/- one second, what is the error
printf("accuracy of this measurement = %6.2f%%\n",
(1/(newtime))*100);
}
exit;
}
[root#krislasnetwork ~]# ./testSpeed.sh
1000 loops took 2.0000 seconds to execute
That's 0.00200000 seconds per loop
accuracy of this measurement = 50.00%
[root#krislasnetwork ~]# vi testSpeed.sh # changing loops from 1000 to 10000
[root#krislasnetwork ~]# ./testSpeed.sh
10000 loops took 15.0000 seconds to execute
That's 0.00150000 seconds per loop
accuracy of this measurement = 6.67%
It appears that each loop during the 10,000 loops is executed faster than each loop during the 1,000. Can you help me understand this behavior?

fast loading of large hash table in Perl

I have about 30 text files with the structure
wordleft1|wordright1
wordleft2|wordright2
wordleft3|wordright3
...
The total size of the files is about 1 GB with about 32 million lines of word combinations.
I tried a few approaches to load them as fast as possible and store the combinations within a hash
$hash{$wordleft} = $wordright
Opening file by file and reading line by line takes about 42 seconds. I then store the hash with the Storable module
store \%hash, $filename
Loading the data again
$hashref = retrieve $filename
reduces the time to about 28 seconds. I use a fast SSD drive and a fast CPU and have enough RAM to hold all the data (it takes about 7 GB).
I'm searching for a faster way to load this data into the RAM (I can't keep it there for a few reasons).
You could try using Dan Bernstein's CDB file format using a tied hash, which will require minimal code change. You may need to install CDB_File. On my laptop, the cdb file is opened very quickly and I can do about 200-250k lookups per second. Here is an example script to create/use/benchmark a cdb:
test_cdb.pl
#!/usr/bin/env perl
use warnings;
use strict;
use Benchmark qw(:all) ;
use CDB_File 'create';
use Time::HiRes qw( gettimeofday tv_interval );
scalar #ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n";
my ($size) = $ARGV[0] || 1000;
my ($seconds) = $ARGV[1] || 10;
my $t0;
tic();
# Create CDB
my ($file, %data);
%data = map { $_ => 'something' } (1..$size);
print "Created $size element hash in memory\n";
toc();
$file = 'data.cdb';
create %data, $file, "$file.$$";
my $bytes = -s $file;
print "Created data.cdb [ $size keys and values, $bytes bytes]\n";
toc();
# Read from CDB
my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n";
print "Opened data.cdb as a tied hash.\n";
toc();
timethese( -1 * $seconds, {
'Pick Random Key' => sub { int rand $size },
'Fetch Random Value' => sub { $h{ int rand $size }; },
});
tic();
print "Fetching Every Value\n";
for (0..$size) {
no warnings; # Useless use of hash element
$h{ $_ };
}
toc();
sub tic {
$t0 = [gettimeofday];
}
sub toc {
my $t1 = [gettimeofday];
my $elapsed = tv_interval ( $t0, $t1);
$t0 = $t1;
print "==> took $elapsed seconds\n";
}
Output ( 1 million keys, tested over 10 seconds )
./test_cdb.pl 1000000 10
Created 1000000 element hash in memory
==> took 2.882813 seconds
Created data.cdb [ 1000000 keys and values, 38890944 bytes]
==> took 2.333624 seconds
Opened data.cdb as a tied hash.
==> took 0.00015 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 10 wallclock secs (10.46 usr + 0.01 sys = 10.47 CPU) # 236984.72/s (n=2481230)
Pick Random Key: 9 wallclock secs (10.11 usr + 0.02 sys = 10.13 CPU) # 3117208.98/s (n=31577327)
Fetching Every Value
==> took 3.514183 seconds
Output ( 10 million keys, tested over 10 seconds )
./test_cdb.pl 10000000 10
Created 10000000 element hash in memory
==> took 44.72331 seconds
Created data.cdb [ 10000000 keys and values, 398890945 bytes]
==> took 25.729652 seconds
Opened data.cdb as a tied hash.
==> took 0.000222 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 14 wallclock secs ( 9.65 usr + 0.35 sys = 10.00 CPU) # 209811.20/s (n=2098112)
Pick Random Key: 12 wallclock secs (10.40 usr + 0.02 sys = 10.42 CPU) # 2865335.22/s (n=29856793)
Fetching Every Value
==> took 38.274356 seconds
It sounds like you do have a good use case for wanting an in-memory perl hash.
For faster storing/retrieving, I would recommend Sereal (Sereal::Encoder/Sereal::Decoder). If your disk storage is slow, you may even want to enable Snappy compression.

Rolling list over unequal times in XTS

I have stock data at the tick level and would like to create a rolling list of all ticks for the previous 10 seconds. The code below works, but takes a very long time for large amounts of data. I'd like to vectorize this process or otherwise make it faster, but I'm not coming up with anything. Any suggestions or nudges in the right direction would be appreciated.
library(quantmod)
set.seed(150)
# Create five minutes of xts example data at .1 second intervals
mins <- 5
ticks <- mins * 60 * 10 + 1
times <- xts(runif(seq_len(ticks),1,100), order.by=seq(as.POSIXct("1973-03-17 09:00:00"),
as.POSIXct("1973-03-17 09:05:00"), length = ticks))
# Randomly remove some ticks to create unequal intervals
times <- times[runif(seq_along(times))>.3]
# Number of seconds to look back
lookback <- 10
dist.list <- list(rep(NA, nrow(times)))
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
> user system elapsed
6.12 0.00 5.85
You should check out the window function, it will make your subselection of dates a lot easier. The following code uses lapply to do the work of the for loop.
# Your code
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
# user system elapsed
# 10.09 0.00 10.11
# My code
system.time(dist.list<-lapply(index(times),
function(x) window(times,start=x-lookback-1,end=x))
)
# user system elapsed
# 3.02 0.00 3.03
So, about a third faster.
But, if you really want to speed things up, and you are willing to forgo millisecond accuracy (which I think your original method implicitly does), you could just run the loop on unique date-hour-second combinations, because they will all return the same time window. This should speed things up roughly twenty or thirty times:
dat.time=unique(as.POSIXct(as.character(index(times)))) # Cheesy method to drop the ms.
system.time(dist.list.2<-lapply(dat.time,function(x) window(times,start=x-lookback-1,end=x)))
# user system elapsed
# 0.37 0.00 0.39

Is it faster to give Perl's print a list or a concatenated string?

option A:
print $fh $hr->{'something'}, "|", $hr->{'somethingelse'}, "\n";
option B:
print $fh $hr->{'something'} . "|" . $hr->{'somethingelse'} . "\n";
The answer is simple, it doesn't matter. As many folks have pointed out, this is not going to be your program's bottleneck. Optimizing this to even happen instantly is unlikely to have any effect on your performance. You must profile first, otherwise you are just guessing and wasting your time.
If we are going to waste time on it, let's at least do it right. Below is the code to do a realistic benchmark. It actually does the print and sends the benchmarking information to STDERR. You run it as perl benchmark.plx > /dev/null to keep the output from flooding your screen.
Here's 5 million iterations writing to STDOUT. By using both timethese() and cmpthese() we get all the benchmarking data.
$ perl ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.84 usr + 0.12 sys = 3.96 CPU) # 1262626.26/s (n=5000000)
list: 4 wallclock secs ( 3.57 usr + 0.12 sys = 3.69 CPU) # 1355013.55/s (n=5000000)
Rate concat list
concat 1262626/s -- -7%
list 1355014/s 7% --
And here's 5 million writing to a temp file
$ perl ~/tmp/bench.plx 5000000
Benchmark: timing 5000000 iterations of concat, list...
concat: 6 wallclock secs ( 3.94 usr + 1.05 sys = 4.99 CPU) # 1002004.01/s (n=5000000)
list: 7 wallclock secs ( 3.64 usr + 1.06 sys = 4.70 CPU) # 1063829.79/s (n=5000000)
Rate concat list
concat 1002004/s -- -6%
list 1063830/s 6% --
Note the extra wallclock and sys time underscoring how what you're printing to matters as much as what you're printing.
The list version is about 5% faster (note this is counter to Pavel's logic underlining the futility of trying to just think this stuff through). You said you're doing tens of thousands of these? Let's see... 100k takes 146ms of wallclock time on my laptop (which has crappy I/O) so the best you can do here is to shave off about 7ms. Congratulations. If you spent even a minute thinking about this it will take you 40k iterations of that code before you've made up that time. This is not to mention the opportunity cost, in that minute you could have been optimizing something far more important.
Now, somebody's going to say "now that we know which way is faster we should write it the fast way and save that time in every program we write making the whole exercise worthwhile!" No. It will still add up to an insignificant portion of your program's run time, far less than the 5% you get measuring a single statement. Second, logic like that causes you to prioritize micro-optimizations over maintainability.
Oh, and its different in 5.8.8 as in 5.10.0.
$ perl5.8.8 ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.69 usr + 0.04 sys = 3.73 CPU) # 1340482.57/s (n=5000000)
list: 5 wallclock secs ( 3.97 usr + 0.06 sys = 4.03 CPU) # 1240694.79/s (n=5000000)
Rate list concat
list 1240695/s -- -7%
concat 1340483/s 8% --
It might even change depending on what Perl I/O layer you're using and operating system. So the whole exercise is futile.
Micro-optimization is a fool's game. Always profile first and look to optimizing your algorithm. Devel::NYTProf is an excellent profiler.
#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
#open my $fh, ">", "/tmp/test.out" or die $!;
#open my $fh, ">", "/dev/null" or die $!;
my $fh = *STDOUT;
my $hash = {
foo => "something and stuff",
bar => "and some other stuff"
};
select *STDERR;
my $r = timethese(shift || -3, {
list => sub {
print $fh $hash->{foo}, "|", $hash->{bar};
},
concat => sub {
print $fh $hash->{foo}. "|". $hash->{bar};
},
});
cmpthese($r);
Unless you are executing millions of these statements, the performance difference will not matter. I really suggest concentrating on performance problems where they do exist - and the only way to find that out is to profile your application.
Premature optimization is something that Joel and Jeff had a podcast on, and whined about, for years. It's just a waste of time to try to optimize something until you KNOW that it's slow.
Perl is a high-level language, and as such the statements you see in the source code don't map directly to what the computer is actually going to do. You might find that a particular implementation of perl makes one thing faster than the other, but that's no guarantee that another implementation might take away the advantage (although they try not to make things slower).
If you're worried about I/O speed, there are a lot more interesting and useful things to tweak before you start worrying about commas and periods. See, for instance, the discussion under Perl write speed mystery.
UPDATE:
I just ran my own test.
1,000,000 iterations of each version took each < 1 second.
10mm iterations of each version took an average of 2.35 seconds for list version vs. 2.1 seconds for string concat version
Have you actually tried profiling this? Only takes a few seconds.
On my machine, it appears that B is faster. However, you should really have a look at Pareto Analysis. You've already wasted far, far more time thinking about this question then you'd ever save in any program run. For problems as trivial as this (character substitution!), you should wait to care until you actually have a problem.
Of the three options, I would probably choose string interpolation first and switch to commas for expressions that cannot be interpolated. This, humorously enough, means that my default choice is the slowest of the bunch, but given that they are all so close to each other in speed and that disk speed is probably going to be slower than anything else, I don't believe changing the method has any real performance benefits.
As others have said, write the code, then profile the code, then examine the algorithms and data structures you have chosen that are in the slow parts of the code, and, finally, look at the implementation of the algorithms and data structures. Anything else is foolish micro-optimizing that wastes more time than it saves.
You may also want to read perldoc perlperf
Rate string concat comma
string 803887/s -- -0% -7%
concat 803888/s 0% -- -7%
comma 865570/s 8% 8% --
#!/usr/bin/perl
use strict;
use warnings;
use Carp;
use List::Util qw/first/;
use Benchmark;
sub benchmark {
my $subs = shift;
my ($k, $sub) = each %$subs;
my $value = $sub->();
croak "bad" if first { $value ne $_->() and print "$value\n", $_->(), "\n" } values %$subs;
Benchmark::cmpthese -1, $subs;
}
sub fake_print {
#this is, plus writing output to the screen is what print does
no warnings;
my $output = join $,, #_;
return $output;
}
my ($x, $y) = ("a", "b");
benchmark {
comma => sub { return fake_print $x, "|", $y, "\n" },
concat => sub { return fake_print $x . "|" . $y . "\n" },
string => sub { return fake_print "$x|$y\n" },
};

Resources