How can i calculate for the estimated completion time of both process - algorithm

A certain computer system runs in a multi-programming environment using a non-preemptive
algorithm. In this system, two processes A and B are stored in the process queue,
and A has a higher priority than B. The table below shows estimated execution time for each
process; for example, process A uses CPU, I/O, and then CPU sequentially for 30, 60, and 30
milliseconds respectively. Which of the following is the estimated time in milliseconds
to complete both A and B? Here, the multi-processing overhead of OS is negligibly
small. In addition, both CPU and I/O operations can be executed concurrently, but I/O
operations for A and B cannot be performed in parallel.
UNIT : millisecond
CPU I/O CPU
A_______________30___________________60_________________30
B_______________45___________________45__________________--
Please help me.. i need to explain this in front of the class tomorrow but i cant seem get the idea of it...

A has the highest priority, but since the system is non-preemptive, this is only a tiebreaker when both processes need a resource at the same time.
At t=0, A gets the CPU for 30 ms, B waits as it needs the CPU.
At t=30, A releases the CPU, B gets the CPU for 45 ms, while A gets the I/O for 60 ms.
At t=75, the CPU sits idle as B is waiting for A to finish I/O, and A is not ready to use the CPU.
At t=90, A releases I/O and gets the CPU for another 30 ms, while B gets the I/O for 45 ms.
At t=120, A releases the CPU and is finished.
At t=135, B releases I/O and is finished.

It takes the longest path:
Non-preemptive multitasking or cooperative multitasking means that the process is kind of sharing a.e. the CPU time. In the worst case they use the worst time to achieve theire task.
CPU:
B = 45 is longer than A=30
45 +
I/O
A = 60 and B = 45
45 + 60
CPU again:
A = 30
45 + 60 + 30 = 135

i will explain in brief and please elaborate for your classroom discussion:
For your answer :135
when Process A waits for the I/O task,the CPU time will be given to Process B. so the complete time for process A and B would be
Process A (CPU )+ Process A I/O and Process B CPU + Process B I/O
30+60+45 = 135 ms

Related

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.
Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.
Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

Pipeline processor vs. Single-cycle processor

I have to compare the speed of execution of the following code (see picture) using DLX-pipeline and single-cycle processor.
Given:
an instruction in the single-cycle model takes 800 ps
a stage in the pipeline model takes 200 ps (based on MA)
My approach was as follows.
CPU time = CPI * CC * IC
Single-cycle:
CPU time = 1 * 800 ps * 10 instr. = 8000 ps.
Pipeline:
CPI = 21 cycles / 10 instr. = 2.1 cycles per instruction
CPU time = 2.1 * 200 ps * 10 = 4200 ps.
CPU time single-cycle / CPU time pipeline = 8000/4200 = 1.9, so the pipeline code runs 1.9 faster.
But I was said, I have to work with clock cycles and not with the time -- "It doesn't matter how much time a CC takes".
I don't see how to make a comparison otherwise. Could you please help me?
Your analysis is indeed correct, but I guess your professor is looking for an explanation like this:
Suppose the single cycle processor also has the stages that you have mentioned, namely IF, ID, EX, MA and WB and that the instruction spends roughly the same time in each stage as compared to the pipelined processor version. Now you can draw a pipeline diagram for this single cycle processor, and see that it would take 50 cycles on a single cycle processor (which can work on 1 instruction at a time) compared to the 19 cycles on a pipelined processor.
Again, I prefer the way you have analyzed it (as the single cycle processor wouldn't really have each of those stages in a different clock cycle, it would just have a very long clock cycle to cover all the stages). Also, you've not mentioned whether this is a stalling-only MIPS pipeline (for which your answer is correct) or if this is a bypassed-MIPS pipeline. If this is the latter, you can shave off a few more cycles and get it down to 15 cycles.

How is CPU time measured on Windows?

I am currently creating a program which identifies processes which are hung/out-of-control, and using an entire CPU core. The program then terminates them, so the CPU usage can be kept under control.
However, I have run into a problem: When I execute the 'tasklist' command on Windows, it outputs this:
Image Name: Blockland.exe
PID: 4880
Session Name: Console
Session#: 6
Mem Usage: 127,544 K
Status: Running
User Name: [removed]\[removed]
CPU Time: 0:00:22
Window Title: C:\HammerHost\Blockland\Blockland.exe
So I know that the line which says "CPU Time" is an indication of the total time, in seconds, used by the program ever since it started.
But let's suppose there are 4 CPU cores on the system. Does this mean that it used up 22 seconds of one core, and therefore used 5.5 seconds on the entire CPU in total? Or does this mean that the process used up 22 seconds on the entire CPU?
It's the total CPU time across all cores. So, if the task used 10 seconds on one core and then 15 seconds later on a different core it would report 25 seconds. If it used 5 seconds on all four cores simultaneously, it would report 20 seconds.

ISR - Maximum Data Rate

The interrupt service routine (ISR) for a device transfers 4 bytes of data from the
device on each device interrupt. On each interrupt, the ISR executes 90 instructions
with each instruction taking 2 clock cycles to execute. The CPU takes 20 clock cycles
to respond to an interrupt request before the ISR starts to execute instructions.
Calculate the maximum data rate, in bits per second, that can be input from this
device, if the CPU clock frequency is 100MHz.
Any help on how to solve will be appreciated.
What I'm thinking - 90 instructions x 2 cycles = 180
20 cycles delay = 200 cycles per one interrupt
so in 100mhz = 100million cycles = 100million/200 = 500,000 cycles each with 4 bytes
so 2million bytes or 16million bits
I think its right but im not 100% sure can anyone confirm?
cheers/
Your calculation looks good to me. If you want an "Engineering answer" then I'd add a 10% margin. Something like: "Theoretical max data rate is 16m bits per sec. Using a 10% margin, no more that 14.4m bits per sec"

Resources