Averaging runtimes for performance analysis - runtime

So modern computers and OSes are complicated, and there is a lot of stuff going on that makes accurate predictions and repeatability of runtimes difficult, like schedulers, branch predictors, caches, prefetchers, etc. I don't understand these things, but I thought I understood the implication: running it once isn't enough.
Luckily, perf stat provides a --repeat command, and even gives you rudimentary statistics. So to test this, I ran
#include <stdio.h>
int main(int argc, char *argv[])
{
puts("Hello, World!");
return 0;
}
compiled with gcc -O2 hello.c -o hello with the command perf stat -r 100 ./hello. This gives me nice output like this
0,00043149 +- 0,00000688 seconds time elapsed ( +- 1,59% )
However, if I now run this whole thing again a couple of times, the average runtime can be far away from the previous run:
0,00043149 +- 0,00000688 seconds time elapsed ( +- 1,59% )
0,00043222 +- 0,00000657 seconds time elapsed ( +- 1,52% )
0,00041690 +- 0,00000612 seconds time elapsed ( +- 1,47% )
0,00045048 +- 0,00000832 seconds time elapsed ( +- 1,85% )
0,0005051 +- 0,0000232 seconds time elapsed ( +- 4,60% )
0,00043595 +- 0,00000676 seconds time elapsed ( +- 1,55% )
0,0004271 +- 0,0000168 seconds time elapsed ( +- 3,94% )
0,00043166 +- 0,00000604 seconds time elapsed ( +- 1,40% )
0,0010521 +- 0,0000548 seconds time elapsed ( +- 5,21% )
0,00042799 +- 0,00000714 seconds time elapsed ( +- 1,67% )
Here the relative deviation of the averages is 37%, largely caused by the 2nd to last outlier. But even if I discount that run, its still 5.5%, much larger than the deviations from a "single" run.
So what is happening here? Why doesn't averaging work (in this case)? What should I be doing?
Edit: This also happens when the frequency scaling is disabled (sudo cpupower frequency-set --governor performance), but the outliers seem less frequent.

Related

What part does priority play in round robin scheduling?

I am trying to solve the following homework problem for an operating systems class:
The following processes are being scheduled using a preemptive, round robin scheduling algorithm. Each process is assigned a numerical priority, with a higher number indicating a higher relative priority.
In addition to the processes listed below, the system also has an idle task (which consumes no CPU resources and is identified as Pidle ). This task has priority 0 and is scheduled whenever the system has no other available processes to run.
The length of a time quantum is 10 units.
If a process is preempted by a higher-priority process, the preempted process is placed at the end of the queue.
+--+--------+----------+-------+---------+
| | Thread | Priority | Burst | Arrival |
+--+--------+----------+-------+---------+
| | P1 | 40 | 15 | 0 |
| | P2 | 30 | 25 | 25 |
| | P3 | 30 | 20 | 30 |
| | P4 | 35 | 15 | 50 |
| | P5 | 5 | 15 | 100 |
| | P6 | 10 | 10 | 105 |
+--+--------+----------+-------+---------+
a. Show the scheduling order of the processes using a Gantt chart.
b. What is the turnaround time for each process?
c. What is the waiting time for each process?
d. What is the CPU utilization rate?
My question is --- What role does priority play when we're considering that this uses the round robin algorithm? I have been thinking about it a lot what I have come up with is that it only makes sense if the priority is important at the time of its arrival in order to decide if it should preempt another process or not. The reason I have concluded this is because if it was checked every time there was a context switch then the process with the highest priority would always be run indefinitely and other processes would starve. This is against the idea of round robin making sure that no process executes longer than one time quantum and the idea that after a process executes it goes to the end of the queue.
Using this logic I have worked out the problem as such:
Could you please advise me if I'm on the right track of the role priority has in this situation and if I'm approaching it the right way?
I think you are on the wrong track. Round robin controls the run order within a priority. It is as if each priority has its own queue, and corresponding round robin scheduler. When a given priority’s queue is empty, the subsequent lower priority queues are considered. Eventually, it will hit idle.
If you didn’t process it this way, how would you prevent idle from eventually being scheduled, despite having actual work ready to go?
Most high priority processes are reactive, that is they execute for a short burst in response to an event, so for the most part on not on a run/ready queue.
In code:
void Next() {
for (int i = PRIO_HI; i >= PRIO_LO; i--) {
Proc *p;
if ((p = prioq[i].head) != NULL) {
Resume(p);
/*NOTREACHED*/
}
}
panic(“Idle not on runq!”);
}
void Stop() {
unlink(prioq + curp->prio, curp);
Next();
}
void Start(Proc *p) {
p->countdown = p->reload;
append(prioq + p->prio, p);
Next();
}
void Tick() {
if (--(curp->countdown) == 0) {
unlink(prioq + curp->prio, curp);
Start(curp);
}
}

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while the single reducer executes. In order to better use the resources of the cluster I would like to also have many reducers running in parallel.
Even if I set the parallelism in the Pig script using the SET DEFAULT_PARALLEL command I still result in having only 1 reducer.
The code part issuing the problem is the following:
SET DEFAULT_PARALLEL 5;
inputData = LOAD 'input_data.txt' AS (group_name:chararray, item:int);
inputDataGrouped = GROUP inputData BY (group_name);
-- The GeneratePairsUDF generates a bag containing pairs of integers, e.g. {(1, 5), (1, 8), ..., (8, 5)}
pairs = FOREACH inputDataGrouped GENERATE GeneratePairsUDF(inputData.item) AS pairs_bag;
pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);
The 'inputData' and 'inputDataGrouped' aliases are computed in the mapper.
The 'pairs' and 'pairsFlat' in the reducer.
If I change the script by removing the line with the FLATTEN command (pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);) then the execution results in 5 reducers (and thus in a parallel execution).
It seems that the FLATTEN command is the problem and avoids that many reducers are created.
How could I reach the same result of FLATTEN but having the script being executed in parallel (with many reducers)?
Edit:
EXPLAIN plan when having two FOREACH (as above):
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-32
| |
| Project[chararray][0] - scope-33
|
|---inputData: New For Each(false,false)[bag] - scope-29
| |
| Cast[chararray] - scope-24
| |
| |---Project[bytearray][0] - scope-23
| |
| Cast[int] - scope-27
| |
| |---Project[bytearray][1] - scope-26
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-22--------
Reduce Plan
pairsFlat: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-42
|
|---pairsFlat: New For Each(true)[bag] - scope-41
| |
| Project[bag][0] - scope-39
|
|---pairs: New For Each(false)[bag] - scope-38
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-36
| |
| |---Project[bag][1] - scope-35
| |
| |---Project[bag][1] - scope-34
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-31--------
Global sort: false
EXPLAIN plan when having only one FOREACH with FLATTEN wrapping the UDF:
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-29
| |
| Project[chararray][0] - scope-30
|
|---inputData: New For Each(false,false)[bag] - scope-26
| |
| Cast[chararray] - scope-21
| |
| |---Project[bytearray][0] - scope-20
| |
| Cast[int] - scope-24
| |
| |---Project[bytearray][1] - scope-23
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-19--------
Reduce Plan
pairs: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---pairs: New For Each(true)[bag] - scope-35
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-33
| |
| |---Project[bag][1] - scope-32
| |
| |---Project[bag][1] - scope-31
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-28--------
Global sort: false
There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).
inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;
If still it is not working then you might have to see your data for skewness issue.
I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.
I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.
It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.
We are wasting time on grouping, but it is much faster for me because now I can use more reducers.
Here is the code (SUBMITTER is my own UDF):
tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;
groupTmpRecord = GROUP tmpRecord BY rnd;
result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));
To answer your question we must first know how many reducers pig enforces to accomplish the - Global Rearrange process. Because as per my understanding the Generate / Projection should not require a single reducer. I cannot say the same thing about Flatten. However we know from common-sense that during flatten the aim is to de-nestify the tuples from bags and vice versa. And to do that all the tuples belonging to a bag should definitely be available in the same reducer. I might be wrong. But can anyone add something here to get this user an answer please ?

Ruby subtracting two times giving incorrect answer

I am trying to time how long a method takes to execute, so I record the start time and then at the end subtract it from the current time which should give me the difference in seconds. I get back 123 seconds when it actually took over 10 minutes to run.
def perform_cluster_analysis
start = Time.now
# A whole lot of tasks performed here
puts 'time taken: '
puts (Time.now - start)
end
The output I get is:
time taken:
123.395808311
But when timed with a stopwatch it actually took over 10 minutes, so why am I getting back 123 seconds instead of +- 600 (10 minutes)

Matlab structure dismal performance when used with objects

Here is a simple class with two properties: PStruct is a property that will contain a structure.
classdef anobj < handle
properties
PStruct
PNum=1;
end
methods
function obj = anobj()
end
end
end
Here is a script filling the structure in an object with 1’s (pretty fast):
clear all
a = anobj(); % an object
b = anobj(); % another object for future use
ntrials=10; niterations=1000;
a.PStruct(ntrials,niterations).field1=0; % 'initialize' the struct array
for t=1:ntrials
tic;
for i=1:niterations
a.PStruct(t,i).field1=1; % store data
end
toc;
end
yielding:
Elapsed time is 0.001008 seconds.
Elapsed time is 0.000967 seconds.
Elapsed time is 0.000972 seconds.
Elapsed time is 0.001206 seconds.
Elapsed time is 0.000992 seconds.
Elapsed time is 0.000981 seconds.
Elapsed time is 0.000975 seconds.
Elapsed time is 0.001072 seconds.
Elapsed time is 0.000951 seconds.
Elapsed time is 0.000994 seconds.
When instead I use a property of another object (=1 as well), changing the line within the loops to:
a.PStruct(t,i).field1=b.PNum; % store data
I get:
Elapsed time is 0.112418 seconds.
Elapsed time is 0.107359 seconds.
Elapsed time is 0.118347 seconds.
Elapsed time is 0.127111 seconds.
Elapsed time is 0.138606 seconds.
Elapsed time is 0.152675 seconds.
Elapsed time is 0.162610 seconds.
Elapsed time is 0.172921 seconds.
Elapsed time is 0.184254 seconds.
Elapsed time is 0.190802 seconds.
Not only performance is orders of magnitude slower, but also there is a very clear trend (verified more generally) of slowing down with each trial. I don’t get it. Furthermore, if I instead use a standalone uninitialized struct array which is not an object property (this line replaces the one within the loops):
PStruct(t,i).field1=b.PNum; % store data
I get ok performance with no trends:
Elapsed time is 0.007143 seconds.
Elapsed time is 0.004208 seconds.
Elapsed time is 0.004312 seconds.
Elapsed time is 0.004382 seconds.
Elapsed time is 0.004302 seconds.
Elapsed time is 0.004545 seconds.
Elapsed time is 0.004499 seconds.
Elapsed time is 0.005840 seconds.
Elapsed time is 0.004210 seconds.
Elapsed time is 0.004177 seconds.
There is some weird interaction between struct arrays and objects. Does anybody know what is happening and how to fix this? Thanks.
Very strange.
I found that if you do either of the following the code returns to normal speed
c = b.PNum;
a.PStruct(t,i).field1=c; % store data
or
a.PStruct(t,i).field1=int32(b.PNum); % store data
but if you use double the code is still slow
a.PStruct(t,i).field1=double(b.PNum); % store data
and if you use both 'fast' methods at the same time
c = b.PNum;
a.PStruct(t,i).field1=c; % store data
a.PStruct(t,i).field1=int32(b.PNum); % store data
the slow speed returns.

Getting surprising elapsed time in windows and linux

I have written one function which is platform independent and working nicely in windows as well as linux. I wanted to check the execution time of that function. I am using QueryPerformanceCounter to calculate the execution time in windows and "gettimeofday" in linux.
The problem is in windows the execution time is 60 mili seconds and in linux its showing 4 ms. Its a huge difference b/w them. Can anybody suggest what might went wrong....or If any body knows the some other APIs better than these to calculate elapsed time please let me know...
here is the code for i have written using gettimeofday......
void main()
{
timeval start_time;
timeval end_time;
gettimeofday(&start_time,NULL);
function_invoke(........);
gettimeofday(&end_time,NULL);
timeval res;
timersub(&start_time,&end_time,&res);
cout<<"function_invoke took seconds = "<<res.tv_sec<<endl;
cout<<"function_invoke took microsec = "<<res.tv_usec<<endl;
}
OUTPUT :
function_invoke took seconds = 0
function_invoke took microsec = 4673 ( 4.673 mili seconds )

Resources