solving computer performance by amdal's low speedup factor - performance

a program runs 100s. multiyply instructions are 80% of the program.There is need to make the program 2 times faster. how much should the multiply instruction be speedup achieve to reach the overall speedup level ?
pls help me to solve this question by amdal's low....

Hint:
The running time is 80 s for multiplies and 20 s for the rest.
You need to bring this down to 30 / 20 s.

Related

Calculating which compiler is faster in terms of cycling

I just have a simple question, a bit silly, but I just need some clarification for an upcoming exam so I don't make a stupid mistake. I am currently taking a class in computer organization and design and am learning about execution time, CPI, clock cycles, etc.
For a problem, I have to calculate the amount of cycles for 2 compilers and find out which one is faster and by how much given the number of instructions and the cycles for each instruction. My main problem is figuring how much faster the faster compiler is.
For example lets say their are two compilers:
Compiler 1 has 3 load instructions, 4 store instructions, and 5 add
instructions.
Compiler 2 has 5 load instructions, 4 store instructions, and 3 add
instructions
A load instruction takes 2 cycles, a store instruction takes 3 cycles and a add instruction takes 1 cycle
So what I would do this add up to the instructions (3+4+5) and (5+4+3) which both equal to 12 instructions.
I'd then calculate the cycles by multiplying the number of instructions by the cycles and adding them all together like this
Compiler 1: (3*2)+(4*3)+(5*1) = 23 cycles
Compiler 2: (5*2)+(4*3)+(3*1) = 25 cycles
So obviously compiler 1 is faster because it requires less cycles. To find out how much faster compiler 1 is against compiler 2 would I just divide the ratio of the cycles?
My calculation was 23/25 = 0.92, so compiler 1 is 0.92 times faster than compiler 2 (92% faster).
A classmate of mine was discussing this with me and claims that it would be 25/23 which would mean it is 1.08 times faster.
I know I can also calculate this by dividing the cycles by the instructions like:
23 cycles/12 instructions = 1.91
25 cycles/12 instructions = 2.08
and then 1.91/2.08 = 0.92 which is the same as the above answer.
I'm not sure which way would be correct.
I was also wondering if the amount of instructions are difference for the second compiler, let's say 15 instructions. Would calculating the ratio of the cycles be sufficient enough?
Or would I have to divide the cycles with the instructions (cycles/instructions) but put 15 instructions for both?
(ex. 23/15 and 25/15?) and then divide the quotients of both to get
the times faster? I also get the same number(0.92) in that case.
Thank you for any clarification.
The first compiler would be 1.08 times the speed of the second compiler, which is 8% faster (because 1.0 + 0.08 = 1.08).
Probably both calculations are innacurate, with modern/multi-core processors a compiler that generates more instruction may actually produce faster code.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

Calculating CPU Utiliization for Round Robin Algorithm

I have been stuck on this for the past day. Im not sure how to calculate cpu utilization percentage for processes using round robin algorithm.
Let say we have these datas with time quantum of 1. Job Letter followed by arrival and burst time. How would i go about calculating the cpu utilization? I believe the formula is
total burst time / (total burst time + idle time). I know idle time means when the cpu are not busy but not sure how to really calculate it the processes. If anyone can walk me through it, it is greatly appreciated
A 2 6
B 3 1
C 5 9
D 6 7
E 7 10
Well,The formula is correct but in order to know the total-time you need to know the idle-time of CPU and you know when your CPU becomes idle? During the context-swtich it becomes idlt and it depends on short-term-scheduler how much time it take to assign the next proccess to CPU.
In 10-100 milliseconds of time quantua , context swtich time is arround 10 microseconds which is very small factor , now you can guess the context-switch time with time quantum of 1 millisecond. It will be ignoreable but it also results in too many context-switches.

inherent parallelism for a program

Hi I have a question regarding inherent parallelism.
Let's say we have a sequential program which takes 20 seconds to complete execution. Suppose the execution time consists of 2 seconds of setup time at the beginning and 2 seconds of finalization time at the end of the execution, and the remaining work can be parallelized. How do we calculate the inherent parallelism of this program?
How do you define "inherent parallelism"? I've not heard the term. We can talk about "possible speedup".
OP said "remaining work can be parallelized"... to what degree?
Can it run with infinite parallelism? If this were possible (it isn't practical), then the total runtime would be 4 seconds with a speedup of 20/4 --> 5.
If the remaining work can be run on N processors perfectly in parallel,
then the total runtime would be 4+16/N. The ratio of that to 20 seconds is 20/(4+16/N) which can have pretty much any degree of speedup from 1 (no speedup) to 5 (he the limit case) depending on the value of N.

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.
Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

Resources