Fortran MPI slowdown of highly parallel task - parallel-processing

I have been trying to diagnose why my Fortran code will not scale well and have reduced the program down to a very simple test case which still will not scale well. The test case is below. I am trying to create an array, split it evenly between processors, and then do some operation on it (in this case just scale it by weights).
I am not passing any information between processes so it seems to me that this should scale well, however, as I run with increasing number of processors and normalize by the number of elements in the array that each processor operates on I see that there is very poor scaling:
2 Processors:
For Rank: 0 Average time for 100 iterations was 6.3680603710015505E-009 with 25525500 points per loop
For Rank: 1 Average time for 100 iterations was 6.3611264474244413E-009 with 25576551 points per loop
3 Processors:
For Rank: 2 Average time for 100 iterations was 8.0085945661011481E-009 with 17102085 points per loop
For Rank: 0 Average time for 100 iterations was 8.2051102639337855E-009 with 16999983 points per loop
For Rank: 1 Average time for 100 iterations was 8.2249291072820462E-009 with 16999983 points per loop
4 Processors:
For Rank: 0 Average time for 100 iterations was 1.0044801473036765E-008 with 12762750 points per loop
For Rank: 3 Average time for 100 iterations was 1.0046922454937459E-008 with 12813801 points per loop
For Rank: 1 Average time for 100 iterations was 1.0178132064014425E-008 with 12762750 points per loop
For Rank: 2 Average time for 100 iterations was 1.0260574719398254E-008 with 12762750 points per loop
6 Processors:
For Rank: 1 Average time for 100 iterations was 1.5841797042924197E-008 with 8525517 points per loop
For Rank: 4 Average time for 100 iterations was 1.5990067816415119E-008 with 8525517 points per loop
For Rank: 0 Average time for 100 iterations was 1.6105490894647526E-008 with 8474466 points per loop
For Rank: 3 Average time for 100 iterations was 1.6141289610460415E-008 with 8474466 points per loop
For Rank: 5 Average time for 100 iterations was 1.5936059738580745E-008 with 8576568 points per loop
For Rank: 2 Average time for 100 iterations was 1.6052278119907569E-008 with 8525517 points per loop
I am running on a MacPro 8 core desktop with 64 GB of RAM so it shouldn't be constrained by system resources and without any actual message passing I don't know why it should progressively run slower as more cores are utilized. Am I missing something obvious that should cause this issue? Using GCC 5.1.0 and Open MPI 1.6.5 (EDIT: using -O3 flag). Any help would be appreciated. Thanks!
Code:
PROGRAM MAIN
use mpi
implicit none
real*8,allocatable::MX(:,:,:)
real*8,allocatable::XFEQ(:,:,:,:)
integer:: rank, iter, nte
INTEGER:: top,bottom,xmin,xmax,zmin,zmax,q,ymax,ymin
integer:: num_procs, error
call MPI_Init ( error ) ! Initialize MPI.
call MPI_Comm_size ( MPI_COMM_WORLD, num_procs, error ) ! Get the number of processes.
call MPI_Comm_rank ( MPI_COMM_WORLD, rank, error ) ! Get the individual process ID.
q = 7
xmin = 0
ymin = 0
zmin = 0
ymax = 1000
xmax = 1000
zmax = 50
nte = 100
top = rank *ymax/num_procs
bottom = (rank+1)*ymax/num_procs-1
if (rank+1 == num_procs) bottom = ymax
allocate(MX ((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))
allocate(xfeq (0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))
DO ITER = 1, nte
MX = 1
CALL COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter, nte, xfeq, mx)
ENDDO
!clean up and exit MPI
call MPI_Finalize ( error )
contains
SUBROUTINE COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter, nte, xfeq, mx)
implicit none
INTEGER::I,J,L,top,bottom,xmin,xmax,zmin,zmax,q,rank, iter, nte
real*8::xfeq(0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX), (top):(bottom))
real*8::MX((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom))
real*8::weight(0:q-1)
real*8::time_start, time_stop, time_col = 0
integer :: count
count = 0
weight(0) = 0.25
weight(1:q-1) = 0.125
CALL CPU_TIME ( TIME_start )
DO J=top,bottom
DO I=XMIN,XMAX
DO L=zmin, zmax
XFEQ(:,L,I,J) = weight*MX(L,I,J)
count = count +1
ENDDO
ENDDO
ENDDO
CALL CPU_TIME ( TIME_stop )
time_col = time_col + (time_stop - time_start)/count
if (iter == nte) print*, "For Rank: ",rank, "Average time for ",nte,'iterations was', &
time_col/(iter+nte), "with ", count, "points per loop"
END SUBROUTINE
END PROGRAM

Related

Ruby travel time

I'm new to ruby and I'm having a hard time trying to figure out how to calculate a random travel time until it passes 1000 miles. So far, I can't figure out why it doesn't output the results, it just stays with the user input. Help please
def travel_time()
printf "Pick a vehicle: \n"
printf "1. Bicycle \n"
printf "2. Car \n"
printf "3. Jet Plane \n"
puts "Choose 1-3: \n"
vehicle = gets.chomp
case vehicle
when "1"
#Bicycle: 5-15 miles per hour
time = 0
distance = 0
until distance > 1000 do
speed = Random.rand(5...15)
distance = speed * 1
time = time + 1
end
puts "The number of hours it took to travel 1000 miles was #{time} hours"
when "2"
#Car: 20-70 miles per hour
time = 0
distance = 0
until distance > 1000 do
speed = Random.rand(20...70)
distance = speed * 1
time = time + 1
end
puts "The number of hours it took to travel 1000 miles was #{time} hours"
when "3"
#Jet Plane: 400-600 miles per hour
time = 0
distance = 0
until distance > 1000 do
speed = Random.rand(400...600)
distance = speed * 1
time = time + 1
end
puts "The number of hours it took to travel 1000 miles was #{time} hours"
end
end
travel_time
You have an infinite loop here:
until distance > 1000 do
speed = Random.rand(5...15)
distance = speed * 1
time = time + 1
end
This loop will never ende because the biggest value distance can get is 15, so i think you want to add to distance, not replace it; son try using += instead of =:
until distance > 1000 do
speed = Random.rand(5...15)
distance += speed * 1
time = time + 1
end
Same goes for all loops in each case.
How can I save the maximum speed and return it in a statement like I
have done?
One way to do it would be to add another variable (i.e. max_speed) and assign speed value to it whenever speed is greater than max_speed:
time = 0
distance = 0
max_speed = 0
until distance > 1000 do
speed = Random.rand(5...15)
max_speed = speed if speed > max_speed
distance += speed * 1
time = time + 1
end
puts "The maximum speed was #{max_speed} miles per hour"
Another way would be to use an array (although i prefer the first option):
speed = []
until distance > 1000 do
speed << Random.rand(5...15)
distance += speed.last * 1
time = time + 1
end
puts "The maximum speed was #{speed.max} miles per hour"
You're not actually summing up the distance - distance is never going to increase past 1000
until distance > 1000 do
speed = Random.rand(5...15)
distance = speed * 1 # will never equal more than 15
time = time + 1
end
You probably want
distance += speed * 1 # not sure why you're multiplying by 1 though
As a style comment: don't use case statements as control flow like if/then statements. Just use them to set values, and move everything else out. This can eliminate a lot of redundant code. Example:
time = 0
distance = 0
until distance > 1000 do
speed = case vehicle
when "1" then Random.rand(5...15) #Bicycle
when "2" then Random.rand(20...70) #Car
when "3" then Random.rand(400...600) #Plane
end
distance += speed * 1
time = time + 1
end
puts "The number of hours it took to travel 1000 miles was #{time} hours"

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

PRAM CREW algorithm for counting odd numbers

So I try to solve the following task:
Develop an CREW PRAM algorithm for counting the odd numbers of a sequence of integers x_1,x_2,...x_n.
n is the number of processors - the complexity should be O(log n) and log_2 n is a natural number
My solution so far:
Input: A:={x_1,x_2,...,x_n} Output:=oddCount
begin
1. global_read(A(n),a)
2. if(a mod 2 != 0) then
oddCount += 1
The problem is, due to CREW I am not allowed to use multiple write instructions at the same time oddCount += 1 is reading oddCount and then writes oddCount + 1, so there would be multiple writes.
Do I have to do something like this
Input: A:={x_1,x_2,...,x_n} Output:=oddCount
begin
1. global_read(A(n),a)
2. if(a mod 2 != 0) then
global_write(1, B(n))
3. if(n = A.length - 1) then
for i = 0 to B.length do
oddCount += B(i)
So first each process determines wether it is a odd or even number and the last process calculates the sum? But how would this affect the complexity and is there a better solution?
Thanks to libik I came to this solution: (n starts with 0)
Input: A:={x_1,x_2,...,x_n} Output:=A(0):=number off odd numbers
begin
1. if(A(n) mod 2 != 0) then
A(n) = 1
else
A(n) = 0
2. for i = 1 to log_2(n) do
if (n*(2^i)+2^(i-1) < A.length)
A(n*(2^i)) += A(n*(2^i) + (2^(i-1)))
end
i = 1 --> A(n * 2): 0 2 4 6 8 10 ... A(n*2 + 2^0): 1 3 5 7 ...
i = 2 --> A(n * 4): 0 4 8 12 16 ... A(n*4 + 2^1): 2 6 10 14 18 ...
i = 3 --> A(n * 8): 0 8 16 24 32 ... A(n*8 + 2^2): 4 12 20 28 36 ...
So the first if is the 1st Step and the for is representing log_2(n)-1 steps so over all there are log_2(n) steps. Solution should be in A(0).
Your solution is O(n) as there is for cycle that has to go through all the numbers (which means you dont utilize multiple processors at all)
The CREW means you cannot write into the same cell (in your example cell=processor memory), but you can write into multiple cells at once.
So how to do it as fast as possible?
At initialization all processors start with 1 or 0 (having odd number or not)
In first round just sum the neighbours x_2 with x_1, then x_4 with x_3 etc.
It will be done in O(1) as every second processor "p_x" look to "p_x+1" processor in parallel and add 0 or 1 (is there odd number or not)
Then in processors p1,p3,p5,p7.... you have part of solution. Lets do this again but now with p1 looks to p3, p5 looks to p7 and p_x looks to o_x+2
Then you have part of the solution only in processors p1, p5, p9 etc.
Repeat the process. Every step the number of processors halves, so you need log_2(n) steps.
If this would be real-life example, there is often calculated cost of synchronization. Basically after each step, all processors have to synchronize themselves so they now, they can do the second step (as you run the described code in each processor, but how do you know if you can already add number from processor p_x, because you can do it after p_x finished work).
You need either some kind of "clock" or synchronization.
At this example, the final complexity would be log(n)*k, where k is the complexity of synchronization.
The cost depends on machine, or definition. One way how to notify processors that you have finished is basically the same one as the one described here for counting the odd numbers. Then it would also cost k=log(n) which would result in log^2(n)

Running time/time complexity for while loop with square root

This question looks relatively simple, but I can't seem to find the running time in terms of n.
Here is the problem:
j = n;
while(j >= 2) {
j = j^(1/2)
}
I don't really need the total running time, I just need to know how to calculate the amount of times the second and third lines are hit (they should be the same). I'd like to know if there is some sort of formula for finding this, as well. I can see that the above is the equivalent of:
for(j = n; n >= 2; j = j^(1/2)
Please note that the type of operation doesn't matter, each time a line is executed, it counts as 1 time unit. So line 1 would just be 1 time unit, line 2 would be:
0 time units if n were 1,
1 time unit if n were 2,
2 time units if n were 4,
3 time units if n were 16, etc.
Thanks in advance to anyone who offers help! It is very much appreciated!
Work backwards to get the number of time units for line 2:
time
n n log_2(n) units
1 1 0 0
2 2 1 1
4 4 2 2
16 16 4 3
16^2 256 8 4
(16^2)^2 65536 16 5
((16^2)^2)^2) ... 32 6
In other words, for the number of time units t, n is 2^(2^(t-1)) except for the case t = 0 in which case n = 1.
To reverse this, you have
t = 0 when n < 2
t = log2(log2(n)) + 1 when n >= 2
where log2(x) is known as the binary logarithm of x.

Optimizing a program and calculating % of total execution time improved

So I was told to ask this on here instead of StackExchage:
If I have a program P, which runs on a 2GHz machine M in 30seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized, what is the percent of total execution time improved?
Here is what I've deduced so far.
For P, we have:
time (30s)
CPI: 12
Frequency (2GHz)
For P', we have:
CPI (6) [2*3]
Frequency (2GHz)
So I need to figure our how to calculate the time of P' in order to compare the times. But I have no idea how to achieve this. Could someone please help me out?
Program P, which runs on a 2GHz machine M in 30 seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized,
From this information we can compute time needed to execute all POWER4 ("raise to the power 4) instructions, we have total count of such instructions (all POWER4 was replaced, count is 10^9 or 1 G). Every POWER4 instruction needs 12 clock cycles (CPI = clock per instruction), so all POWER4 were executed in 1G * 12 = 12G cycles.
2GHz machine has 2G cycles per second, and there are 30 seconds of execution. Total P program execution is 2G*30 = 60 G cycles (60 * 10^9). We can conclude that P program has some other instructions. We don't know what instructions, how many executions they have and there is no information about their mean CPI. But we know that time needed to execute other instructions is 60 G - 12 G = 48 G (total program running time minus POWER4 running time - true for simple processors). There is some X executed instructions with Y mean CPI, so X*Y = 48 G.
So, total cycles executed for the program P is
Freq * seconds = POWER4_count * POWER4_CPI + OTHER_count * OTHER_mean_CPI
2G * 30 = 1G * 12 + X*Y
Or total running time for P:
30s = (1G * 12 + X*Y) / 2GHz
what is the percent of total execution time improved?
After replacing 1G POWER4 operations with 3 times more MUL instructions (multiply by) we have 3G MUL operations, and cycles needed for them is now CPI * count, where MUL CPI is 2: 2*3G = 6G cycles. X*Y part of P' was unchanged, and we can solve the problem.
P' time in seconds = ( MUL_count * MUL_CPI + OTHER_count * OTHER_mean_CPI ) / Frequency
P' time = (3G*2 + X*Y) / 2GHz
Improvement is not so big as can be excepted, because POWER4 instructions in P takes only some part of running time: 12G/60G; and optimization converted 12G to 6G, without changing remaining 48 G cycles part. By halving only some part of time we get not half of time.

Resources