Raspberry Pi vs PC (comparing algorithm speedup) - parallel-processing

I've been trying to solve this question:
if 75% of a code can be distributed over a cluster, and you had one of the options: 40 raspberry pis or 2 pcs each is 4 times the performance of a cluster.
Which option would you pick?
So this is likely to be solved by calculating the speed up of each.
considering Amdahl's law, the max speed up is <= 1/(f+(1-f)/n) where f is the percent of code that can be done in parallel and n is number of processors.
My solution:
Max Raspberry speed up = 1/(0.75 + (1-0.75)/40), that is 1.322
Now if this statement "with each PC having four times the performance of a
Raspberry Pi" means that each PC is equal to 4 raspberry pis then i'd say n = 8.
Thus, S PC max = 1/(0.75 + (1-0.75)/8) = 1.28
I would pick the 40-node raspberry pi over the 2 PCs, but i am not quite sure if i had computed the PC's speed-up correctly. Is it safe to assume that each PC is equivalent to 4 raspberry pis (4 processors)?

Related

Industrial high Load capable single Board Advise

We are looking a single board computer to handle heavy cpu load (24 hour) and good connectivity. it will run for both deep-learning model's and also some sensor information. WiFi , Bluetooth , 1xLan , 4G Gsm min 1G ram 2Gig preferred.
We came cross with :
Raspberry pi 3 B+ Quad Core 1.4 GHz
Banana Pi M3 Octa-core 1.8GHz CPU.
ASUS Tinker Board QuadCore 1.8GHz
and more which we dont listed here.
Main requirements we need 24 hour heavy load (%90 all cpu's) more GHz , connectivity and sensor availibity..
Raspberry pi 3 B : when we start program it using %95 cores and heats 84-85 Celcius and sticking there. I am not sure how lon we keep that high temperature.
I need experienced advise which single board computer could be a good solution. Could be Arm / intel ?
budget range around 100 usd
Thanks A Lot.

Different speedups on different machines for same program

I have a dell laptop, i7 5th gen processor(1000MHz) with 4 logical cores and 2 physical cores with 16 GB RAM. I have taken a course on High performance computing, for which I have to draw graphs of speed-up.
Compared to a Desktop machine(800 MHz) (i5 5th gen and 8GB RAM), having 4 physical and logical cores, for same program, my laptop takes ~3 seconds while on the desktop machine it takes around 12 seconds. Ideally, since the laptop is 1.25 times faster, the time on my laptop should have been around 9 to 10 seconds
This might not have been the problem if I had got almost similar speedup. But in my laptop, using 4 threads, the speedup is nearly equal to 1.3 and for the same number of cores, on my desktop the speedup is nearly equal to 3.5. If my laptop was fast, then it would have also reflected that property for parallel program, but the parallel program was only ~1.3 times faster. What could have been the reason?

Compute betweenness centrality on a (1M,3M)-graph using graph-tool

I am trying to compute betweenness centrality on a 1M nodes 3M edges graph. I am using graph-tool and the following lines of code:
from graph_tool.all import *
g = load_graph("youtube.graphml")
scores = graph_tool.centrality.betweenness(g)
In its performance comparisons page, it is reported that to compute betweenness on a (40k, 300k)-directed_graph, graph-tool takes about 4 minutes. https://graph-tool.skewed.de/performance
Since graph-tool uses Brandes algorithm, which has a O(VE) complexity, I was expecting an approximated running time of:
(1M/40k)*(3M/300k)*4m=25*10*4m=1000m~17h
I found this calculation coherent with the following stack post, where for a (2M,5M)-graph an user gave an approximated running time of 6 months using NetworkX,which is x180 slower than graph-tool. Hence:
6 months = 180 days(NetworkX) ~ 1 day(graph-tool)
The point is that my program is running on a 4-core machine since 2 days, so I am starting to wonder if my reasoning makes any sense.
Moreover graph-tools benchmarks are performed on a directed graph, for which Brandes algorithm has a complexity of O(VE+V(V+E)logV). Given this point, shouldn't be the expected running time even smaller than previously written? And more importantly Is it feasible at all to compute betweenness centrality on a (1M,3M)-network using graph-tool and a 4-core machine?
I am using an Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
Eventually the algorithm stopped:
It took slightly more than 2
days (a couple of hours more).
I was not able to get the result:
During the computation the
UI crashed and I was not able to reconnect to the ipython notebook
kernel which was running the program (which however kept running).
I am going to update this post as soon as I get the actual result.
UPDATE
I recomputed the betweenness using a 16 core machine and it took about 18 hours.

Matlab matrix multiplication speed

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:
a = rand(2160);
b = rand(2160);
tic;a*b;toc
2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)
I got:
Elapsed time is 1.164289 seconds.
(I'm running on 2.4Ghz notebook and no threading occurs)
which mean my computer made ~10^10 operation in a little more than 1 second.
How this could be??
It's a combination of several things:
Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.
Here's the numbers on my machine: Core i7 920 # 3.5 GHz (4 cores)
>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.
Task Manager shows 4 cores of CPU usage.
Now for some math:
Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12
Max multiplies in 53 secs =
(3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12
So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.
I see nothing out of the ordinary.
To check whether you do or not use multi-threading in MATLAB use this command
maxNumCompThreads(n)
This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:
A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.
maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.
So there is multi-threading.
Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is
5000^3*2/10.8 = 23 GFLOP/s
Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:
3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance
So single core performance is around 85% peak, which is to be expected for this problem.
You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Time 0 2 2 2 2 2 2 2 2 2 2 = 20t
Input 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Calculation in two 1t cycles, pipelined, 10 calculations
Time 0 1 1 1 1 1 1 1 1 1 1 1 = 11t
Input 1 2 3 4 5 6 7 8 9 10
Stage1 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
execution_time = (latency + (N * computation_cycles) ) * cycle_time
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).
Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Resources