MPI : Wondering about process sequence and for-loops - for-loop

I am trying to build something with MPI, so since i am not very familiar with it, i started with some arrays and printing stuff. I noticed that a plain C command (not an MPI one) works simultaneously on every process, i.e. printing something like that :
printf("Process No.%d",rank);
Them i noticed that the numbers of the processes got all scrambled and because the right sequence of the processes would fit me, i tried using a for-loop like that :
for(rank=0; rank<processes; rank++) printf("Process No.%d",rank);
And that started a third world war in my computer. Lots of strange errors in a strange format that i couldn't understand and that made me suspicious. How is it possible since an if-loop stating a ranks value , like the master rank:
if(rank==0) printf("Process No.%d",rank);
cant use a for-loop for the same reason. Well, that is my first question.
My second question is about an other for-loop i used, that it got ignored.
printf("PROCESS --------------->**%d**\n",id);
for (i = 0; i < PARTS; ++i){
printf("Array No.%d\n", i+1);
for (j = 0; j < MAXWORDS; ++j)
printf("%d, ",0);
printf("\n\n");
}
I run that for-loop and every process printed only the first line:
$ mpiexec -n 6 `pwd`/test
PROCESS --------------->**0**
PROCESS --------------->**1**
PROCESS --------------->**3**
PROCESS --------------->**2**
PROCESS --------------->**4**
PROCESS --------------->**5**
And not the following amount of zeros (there was an array there at first that i removed cause i was trying to figure out why it didn't get printed).
So, why is it about MPI and for-loops that don't get along?
--edit 1: grammar
--edit 2: Code paste
It is not the same as above, but same problem in the the last for-loop with fprintf.
This is a paste zone, sorry for that, i couldn't deal with the code system here
--edit 3: fixed
Well i finally figured it out. For first i have to say that the fprintf function when used inside MPI is a mess. Apparently there is a kind of overlap while every process writes in a text file. I tested it with the printf function and it worked. The second thing i was doing is, i was calling the MPI_Scatter function from inside root:
if(rank==root) MPI_Scatter();
..which only scatters the data inside the process and not the others.
Now that i have fixed those two issues, the program works as it should, apart a minor problem when i printf the my_list arrays. It seems like every array has a random number of inputs, but when i tested using a counter for every array, it's only the data that is printed like this. Tried using fflush(stdout); but it returned me an error.
usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: `Input/output error collect2: ld returned 1 exit status`

MPI in and of itself does not have a problem with for loops. However, just like with any other software, you should always remember that it will work the way you code it, not the way you intend it. It appears that you are having two distinct issues, both of which are only tangentially related to MPI.
The first issue is that the variable PARTS is defined in such a way that it depends on another variable, procs, which is not initialized at before-hand. This means that the value of PARTS is undefined, and probably ends up causing a divide-by-zero as often as not. PARTS should actually be set after line 44, when procs is initialized.
The second issue is with the loop for(i = 0; i = LISTS; i++) labeled /*Here is the problem*/. First of all, the test condition of the loop always sets i to the value of LISTS, regardless of the initial value of 0 and the increment at the end of the loop. Perhaps it was intended to be i < LISTS? Secondly, LISTS is initialized in a way that depends on PARTS, which depends on procs, before that variable is initialized. As with PARTS, LISTS must be initialized after the statement MPI_Comm_size(MPI_COMM_WORLD, &procs); on line 44.
Please be more careful when you write your loops. Also, make sure that you initialize variables correctly. I highly recommend using print statements (for small programs) or the debugger to make sure your variables are being set to the expected values.

Related

Missing print-out for MPI root process, after its handling data reading alone

I'm writing a project that firstly designates the root process to read a large data file and do some calculations, and secondly broadcast the calculated results to all other processes. Here is my code: (1) it reads random numbers from a txt file with nsample=30000 (2) generate dens_ent matrix by some rule (3) broadcast to other processes. Btw, I'm using OpenMPI with gfortran.
IF (myid==0) THEN
OPEN(UNIT=8,FILE='rnseed_ent20.txt')
DO i=1,n_sample
DO j=1,3
READ(8,*) rn(i,j)
END DO
END DO
CLOSE(8)
END IF
dens_ent=0.0d0
DO i=1,n_sample
IF (myid==0) THEN
!Random draws of productivity and savings
rn_zb=MC_JOINT_SAMPLE((/-0.1d0,mu_b0/),var,rn(i,1:2))
iz=minloc(abs(log(zgrid)-rn_zb(1)),dim=1)
ib=minloc(abs(log(bgrid(1:nb/2))-rn_zb(2)),dim=1) !Find the closest saving grid
CALL SUB2IND(j,(/nb,nm,nk,nxi,nz/),(/ib,1,1,1,iz/))
DO iixi=1,nxi
DO iiz=1,nz
CALL SUB2IND(jj,(/nb,nm,nk,nxi,nz/),(/policybmk_2_statebmk_index(j,:),iixi,iiz/))
dens_ent(jj)=dens_ent(jj)+1.0d0/real(nxi)*markovian(iz,iiz)*merge(1.0d0,0.0d0,vent(j) .GE. -bgrid(ib)+ce)
!Density only recorded if the value of entry is greater than b0+ce
END DO
END DO
END IF
END DO
PRINT *, 'dingdongdingdong',myid
IF (myid==0) dens_ent=dens_ent/real(n_sample)*Mpo
IF (myid==0) PRINT *, 'sum_density by joint normal distribution',sum(dens_ent)
PRINT *, 'BLBLALALALALALA',myid
CALL MPI_BCAST(dens_ent,N,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
Problem arises:
(1) IF (myid==0) PRINT *, 'sum_density by joint normal distribution',sum(dens_ent) seems not executed, as there is no print out.
(2) I then verify this by adding PRINT *, 'BLBLALALALALALA',myid etc messages. Again no print out for root process myid=0.
It seems like root process is not working? How can this be true? I'm quite confused. Is it because I'm not using MPI_BARRIER before PRINT *, 'dingdongdingdong',myid?
Is it possible that you miss the following statement just at the very beginning of your code?
CALL MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr)
IF (ierr /= MPI_SUCCESS) THEN
STOP "MPI_COMM_RANK failed!"
END IF
The MPI_COMM_RANK returns into myid (if succeeds) the identifier of the process within the MPI_COMM_WORLD communicator (i.e a value within 0 and NP, where NP is the total number of MPI ranks).
Thanks for contributions from #cw21 #Harald and #Hristo Iliev.
The failure lies in unit numbering. One reference says:
unit number : This must be present and takes any integer type. Note this ‘number’ identifies the
file and must be unique so if you have more than one file open then you must specify a different
unit number for each file. Avoid using 0,5 or 6 as these UNITs are typically picked to be used by
Fortran as follows.
– Standard Error = 0 : Used to print error messages to the screen.
– Standard In = 5 : Used to read in data from the keyboard.
– Standard Out = 6 : Used to print general output to the screen.
So I changed all numbering i into 1i, not working; then changed into 10i. It starts to work. Mysteriously, as correctly pointed out by #Hristo Iliev, as long as the numbering is not 0,5,6, the code should behave properly. I cannot explain to myself why 1i not working. But anyhow, the root process is now printing out results.

MPI rank is changed after MPI_SENDRECV call [duplicate]

This question already has an answer here:
MPI_Recv overwrites parts of memory it should not access
(1 answer)
Closed 3 years ago.
I have some Fortran code that I'm parallelizing with MPI which is doing truly bizarre things. First, there's a variable nstartg that I broadcast from the boss process to all the workers:
call mpi_bcast(nstartg,1,mpi_integer,0,mpi_comm_world,ierr)
The variable nstartg is never altered again in the program. Later on, I have the boss process send eproc elements of an array edge to the workers:
if (me==0) then
do n=1,ntasks-1
(determine the starting point estart and the number eproc
of values to send)
call mpi_send(edge(estart),eproc,mpi_integer,n,n,mpi_comm_world,ierr)
enddo
endif
with a matching receive statement if me is non-zero. (I've left out some other code for readability; there's a good reason I'm not using scatterv.)
Here's where things get weird: the variable nstartg gets altered to n instead of keeping its actual value. For example, on process 1, after the mpi_recv, nstartg = 1, and on process 2 it's equal to 2, and so forth. Moreover, if I change the code above to
call mpi_send(edge(estart),eproc,mpi_integer,n,n+1234567,mpi_comm_world,ierr)
and change the tag accordingly in the matching call to mpi_recv, then on process 1, nstartg = 1234568; on process 2, nstartg = 1234569, etc.
What on earth is going on? All I've changed is the tag that mpi_send/recv are using to identify the message; provided the tags are unique so that the messages don't get mixed up, this shouldn't change anything, and yet it's altering a totally unrelated variable.
On the boss process, nstartg is unaltered, so I can fix this by broadcasting it again, but that's hardly a real solution. Finally, I should mention that compiling and running this code using electric fence hasn't picked up any buffer overflows, nor did -fbounds-check throw anything at me.
The most probable cause is that you pass an INTEGER scalar as the actual status argument to MPI_RECV when it should be really declared as an array with an implementation-specific size, available as the MPI_STATUS_SIZE constant:
INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status
or
INTEGER status(MPI_STATUS_SIZE)
The message tag is written to one of the status fields by the receive operation (its implementation-specific index is available as the MPI_TAG constant and the field value can be accessed as status(MPI_TAG)) and if your status is simply a scalar INTEGER, then several other local variables would get overwritten. In your case it simply happens so that nstartg falls just above status in the stack.
If you do not care about the receive status, you can pass the special constant MPI_STATUS_IGNORE instead.

Parallel execution of program within C++/CLI

I'm writing a windows forms program (C++/CLI) that calls an executable program multiple times within a large 'for' loop. I want to do the calls to the executable in parallel since it takes up to a minute to run once.
The key part of the windows forms code is the large for loop (actually 2 loops):
for (int a=0; a<1000; a++){
for (int b=0; b<100; b++){
int run = a*100 + b;
char startstr[50], configstr[50]; strcpy(startstr, "solver.exe");
sprintf(configstr, " %d %d %d", run, a, b);
strcat(startstr, configstr);
CreateProcessA(NULL, startstr,......) ;
}
}
The integers "run", "a" and "b" are used by the solver.exe program.
"Run" is used to write a unique output text file from each program run.
"a" and "b" are numbers used to read specific input text files. These are not unique to each run.
I'm not waiting after each call to "CreateProcess" as I want these to execute in parallel.
Currently my code runs and appears to work correctly. However, it swans a huge number of instances of the solver.exe program at once causing my computer to become very slow until everything finishes.
My question is, how can I create a queue that limits the number of concurrent processes (for example to the number of physical cores on the machine) so that they don't all try to run at the same time? Memory may also be an issue when the for loops are set larger.
A secondary question is, could potential concurrent file reads by different instances of solver.exe create a problem? (I can fix this but don't want to if I don't need to.)
I'm familiar with openmp and C but this is my first attempt at running parallel processes in a windows forms program.
Thanks
I've managed to do what I want using the OpenMP function "parallel for" to run the outer loop in parallel and the function omp_set_num_threads() to set the number of concurrent processes. As suggested, the concurrent file reads haven't caused any problems on my system.

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Fortran issue with multiple cpus

I'm running some code written in fortan. It is made of several subroutines and I share variables among them using global variables specified in a module.
The problem occurs when using multiple cpus. In one subroutine the code should update a value of a local variable by the value of a global variable. It so happens that in some random passes though the subroutine the code does not update the variables when I run it using multiple cpus. However if I pause it and make it go up to force the code to pass in the piece of code that updates the variable it works! Magic! I've then implemented a loop that checks if the variable was updated and tries to go back using (GOTO's) in the code to make it update the variables.... but for 2 tries it still sometimes do not update the variables. If I run the code with only one core then it works fine.... Any ideas??
Thanks
Piece of code:
Subroutine1() !Where the variable A0 should be updated
nTries = 0
777 IF (nItems.NE.0) THEN
DO J = 1,nItems
IF (nint(mDATA(J,3)).EQ.nint(XCOORD+U1NE0)
& .AND. nint(mDATA(J,4)).EQ.nint(YCOORD+U2NE0) .AND.
2 nint(mDATA(J,5)).EQ.nint(ZCOORD+U3NE0)) THEN
A0 = mDATA(J,1)
JNODE = mDATA(J,2)
EXIT
ELSE
A0 = A02
ENDIF
ENDDO
IF (A0.EQ.ZERO) THEN !If the variable was not updated
IF (nTries.LE.2) THEN
nTries = nTries + 1
GOTO 777
ENDIF
write(6,*) "ZERO A0", IELEM, JTYPE
A0 = MAXT
ENDIF
I don't exactly know how Abaqus interacts with your FORTRAN subroutines, nor is it clear from the above code what is going wrong, but what you're running into seems to be a classical example of a "race condition," which what you're calling "one core going ahead of the other."
A general comment is that GOTOs and global variables are extremely dangerous in that they make programs very hard to reason about. These problems compound once you start parallelizing. If Abaqus is doing some kind of "black box" computation that it is responsible for parallelizing, you (as a user who is only preprocessing and postprocessing the data) should be insulated from this. However, from the above, it sounds like you're doing some stuff that is interleaved with the Abaqus parallel computation. In that case, you need to make sure everything you're doing is thread-safe. Among many other things, you absolutely need to make sure you're not writing to any global variables.
Another comment is that your checking of A0 is basically a lock called a "spinlock." This is one way of making things thread-safe, but locks have pitfalls of their own. If Abaqus doesn't give you a way to synchronize all of the threads and guarantee that it's done with its job, some sort of lock like this may be the way to go.

Resources