Fortran issue with multiple cpus - parallel-processing

I'm running some code written in fortan. It is made of several subroutines and I share variables among them using global variables specified in a module.
The problem occurs when using multiple cpus. In one subroutine the code should update a value of a local variable by the value of a global variable. It so happens that in some random passes though the subroutine the code does not update the variables when I run it using multiple cpus. However if I pause it and make it go up to force the code to pass in the piece of code that updates the variable it works! Magic! I've then implemented a loop that checks if the variable was updated and tries to go back using (GOTO's) in the code to make it update the variables.... but for 2 tries it still sometimes do not update the variables. If I run the code with only one core then it works fine.... Any ideas??
Thanks
Piece of code:
Subroutine1() !Where the variable A0 should be updated
nTries = 0
777 IF (nItems.NE.0) THEN
DO J = 1,nItems
IF (nint(mDATA(J,3)).EQ.nint(XCOORD+U1NE0)
& .AND. nint(mDATA(J,4)).EQ.nint(YCOORD+U2NE0) .AND.
2 nint(mDATA(J,5)).EQ.nint(ZCOORD+U3NE0)) THEN
A0 = mDATA(J,1)
JNODE = mDATA(J,2)
EXIT
ELSE
A0 = A02
ENDIF
ENDDO
IF (A0.EQ.ZERO) THEN !If the variable was not updated
IF (nTries.LE.2) THEN
nTries = nTries + 1
GOTO 777
ENDIF
write(6,*) "ZERO A0", IELEM, JTYPE
A0 = MAXT
ENDIF

I don't exactly know how Abaqus interacts with your FORTRAN subroutines, nor is it clear from the above code what is going wrong, but what you're running into seems to be a classical example of a "race condition," which what you're calling "one core going ahead of the other."
A general comment is that GOTOs and global variables are extremely dangerous in that they make programs very hard to reason about. These problems compound once you start parallelizing. If Abaqus is doing some kind of "black box" computation that it is responsible for parallelizing, you (as a user who is only preprocessing and postprocessing the data) should be insulated from this. However, from the above, it sounds like you're doing some stuff that is interleaved with the Abaqus parallel computation. In that case, you need to make sure everything you're doing is thread-safe. Among many other things, you absolutely need to make sure you're not writing to any global variables.
Another comment is that your checking of A0 is basically a lock called a "spinlock." This is one way of making things thread-safe, but locks have pitfalls of their own. If Abaqus doesn't give you a way to synchronize all of the threads and guarantee that it's done with its job, some sort of lock like this may be the way to go.

Related

Issue with OpenMP thread execution and threadprivate variable

I have used 8 threads for 8 loops. I have used 'print' to see how the parallel code works. The 0 thread creates problems!I have showed in the attached diagram (please check the attached link below) how the parallel works. I have used threadprivate but it turned out that thread 0 can not get any private threadsafe variables.
I have tried with modules as well and got same results!
Any idea why the code acts this way? I would appreciate any help or suggestion. Thanks!
!$OMP PARALLEL DO
do nb=m3+1, m3a, 2
60 icall=nb
65 iad=idint(a(icall))
if(iad.eq.0) goto 100
call ford(a(iad),servo)
if(.not.dflag) goto 80
atemp=dble(nemc)
nemc=iad
a(icall)=a(iad+6)
a(iad+6) = atemp
dflag=.false.
goto 65
80 icall=iad+6
goto 65
100 continue
end do
!$OMP END PARALLEL DO
subroutine FORD(i,j)
dimension zl(3),zg(3)
common /ellip/ b1,c1,f1,g1,h1,d1,
. b2,c2,f2,g2,h2,p2,q2,r2,d2
common /root/ root1,root2
!$OMP threadprivate (/ellip/,/root/)
CALL CONDACT(genflg,lapflg)
return
end subroutine
SUBROUTINE CONDACT(genflg,lapflg)
common /ellip/ b1,c1,f1,g1,h1,d1,b2,c2,f2,g2,h2,p2,q2,r2,d2
!$OMP threadprivate (/ellip/)
RETURN
END
Looking at just the first few lines you have major problems.
do nb=m3+1, m3a, 2
This part is fine, each thread will have a private copy of nb properly initialized.
60 icall=nb
This is a problem. icall is shared and each thread will write its private value of nb into the shared. Threads run concurrently and the order and timing is non-determanistic so the value of icall in each thread cannot be known ahead of time.
65 iad=idint(a(icall))
Now we use icall to calculate a value to store in the shared variable iad. What are the problems? The value of icall may not be the same as in the previous line if another thread wrote to it between this thread's execution. The value of iad is being clobbered by each thread.
if(iad.eq.0) goto 100
call ford(a(iad),servo)
These lines have the same problems as above. The value of iad may not be the same as above and it may not be the same between these two lines depending on the execution of the other threads.
if(.not.dflag) goto 80
The variable dflag has not been initialized at this point.
To fix these problems you need to declare icall and iad as private with
!$omp parallel do private(icall,iad)
You should also initialize dflag before you use it.
These first errors are probably responsible for a large chunk of your problem but may not fix everything. You have architected very complex (hard to maintain) thread interaction and your code is full of bad practices (implicit variables, liberal use of goto) which make this code hard to follow.

Is there a way to remove "getKey"'s input lag?

I've recently decided to try ti-basic programming, and while I was playing with getKey; I noticed that it had a 1s~ input lag after the first input. Is this built into the calculator, or can this be changed?
I recognize that "Quick Key" code above ;) (I'm the original author and very glad to see it spread around!).
Anyway, here is my low-level knowledge of the subject:
The operating system uses what is known as an interrupt in order to handle reading the keyboard, link port, USB port, and the run indicator among other things. The interrupt is just software code, nothing hardware implemented. So it is hardwired into the OS not the calculator.
The gist of the code TI uses is that once it reads that a key press occurred, it resets a counter to 50 and decrements it so long as the user holds down the key. Once the counter reaches zero, it tells getKey to recognize it as a new keypress and then it resets the counter to 10. This cause the initial delay to be longer than subsequent delays.
The TI-OS allows third party "hooks" to jump in and modify the getkey process and I used such a hook in another more complicated program (Speedy Keys). However, this hook is never called during BASIC program execution except at a Pause or Menu( command, where it isn't too helpful.
Instead what we can do is setup a parser hook that modifies the getkey counters. Alternatively, you can use the QuickKey code above, or you can use Hybrid BASIC which requires you to download a third-party App. A few of these apps (BatLib [by me], Celtic 3, DoorsCS7, and xLIB) offer a very fast getKey alternative as well as many other powerful functions.
The following is the code for setting up the parser hook. It works very well in my tests! See notes below:
#include "ti83plus.inc" ; ~~This column is the stuff for manually
_EnableParserHook = 5026h ; creating the code on calc. ~~
.db $BB,$6D ;AsmPrgm
.org $9D95 ;
ld hl,hookcode ;21A89D
ld de,appbackupscreen ;117298
ld bc,hookend-hookcode ;010A00
ldir ;EDB0
ld hl,appbackupscreen ;217298
ld a,l ;7D
bcall(_EnableParserHook);EF2650
ret ;C9
hookcode: ;
.db 83h ;83
push af ;F5
ld a,1 ;3E01
ld (8442h),a ;324284
pop af ;F1
cp a ;BF
ret ;C9
hookend: ;
Notes: other apps or programs may use parser hooks. Using this program will disable those hooks and you will need to reinstall them. This is pretty easy.
Finally, if you manually putting this on your calculator, use the right column code. Here is an animated .gif showing how to make such a program:
You will need to run the program once either on the homescreen or at the start of your main program. After this, all getKeys will have no delay.
I figured out this myself too when I was experimenting with my Ti-84 during the summer. This lag cannot be changed. This is built into the calculator. I think this is because of how the microchip used in ti-84 is a Intel Zilog Z80 microprocessor which was made in 1984.
This is unfortunately simply the inefficiency of the calculator. TI-basic is a fairly high-level language and meant to be easy to use and is thus not very efficient or fast. Especially with respect to input and output, i.e. printing messages and getting input.
Quick Key
:AsmPrgm3A3F84EF8C47EFBF4AC9
This is a getKey routine that makes all keys repeat, not just arrows and there is no delay between repeats. The key codes are different, so you might need to experiment.

MPI : Wondering about process sequence and for-loops

I am trying to build something with MPI, so since i am not very familiar with it, i started with some arrays and printing stuff. I noticed that a plain C command (not an MPI one) works simultaneously on every process, i.e. printing something like that :
printf("Process No.%d",rank);
Them i noticed that the numbers of the processes got all scrambled and because the right sequence of the processes would fit me, i tried using a for-loop like that :
for(rank=0; rank<processes; rank++) printf("Process No.%d",rank);
And that started a third world war in my computer. Lots of strange errors in a strange format that i couldn't understand and that made me suspicious. How is it possible since an if-loop stating a ranks value , like the master rank:
if(rank==0) printf("Process No.%d",rank);
cant use a for-loop for the same reason. Well, that is my first question.
My second question is about an other for-loop i used, that it got ignored.
printf("PROCESS --------------->**%d**\n",id);
for (i = 0; i < PARTS; ++i){
printf("Array No.%d\n", i+1);
for (j = 0; j < MAXWORDS; ++j)
printf("%d, ",0);
printf("\n\n");
}
I run that for-loop and every process printed only the first line:
$ mpiexec -n 6 `pwd`/test
PROCESS --------------->**0**
PROCESS --------------->**1**
PROCESS --------------->**3**
PROCESS --------------->**2**
PROCESS --------------->**4**
PROCESS --------------->**5**
And not the following amount of zeros (there was an array there at first that i removed cause i was trying to figure out why it didn't get printed).
So, why is it about MPI and for-loops that don't get along?
--edit 1: grammar
--edit 2: Code paste
It is not the same as above, but same problem in the the last for-loop with fprintf.
This is a paste zone, sorry for that, i couldn't deal with the code system here
--edit 3: fixed
Well i finally figured it out. For first i have to say that the fprintf function when used inside MPI is a mess. Apparently there is a kind of overlap while every process writes in a text file. I tested it with the printf function and it worked. The second thing i was doing is, i was calling the MPI_Scatter function from inside root:
if(rank==root) MPI_Scatter();
..which only scatters the data inside the process and not the others.
Now that i have fixed those two issues, the program works as it should, apart a minor problem when i printf the my_list arrays. It seems like every array has a random number of inputs, but when i tested using a counter for every array, it's only the data that is printed like this. Tried using fflush(stdout); but it returned me an error.
usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: `Input/output error collect2: ld returned 1 exit status`
MPI in and of itself does not have a problem with for loops. However, just like with any other software, you should always remember that it will work the way you code it, not the way you intend it. It appears that you are having two distinct issues, both of which are only tangentially related to MPI.
The first issue is that the variable PARTS is defined in such a way that it depends on another variable, procs, which is not initialized at before-hand. This means that the value of PARTS is undefined, and probably ends up causing a divide-by-zero as often as not. PARTS should actually be set after line 44, when procs is initialized.
The second issue is with the loop for(i = 0; i = LISTS; i++) labeled /*Here is the problem*/. First of all, the test condition of the loop always sets i to the value of LISTS, regardless of the initial value of 0 and the increment at the end of the loop. Perhaps it was intended to be i < LISTS? Secondly, LISTS is initialized in a way that depends on PARTS, which depends on procs, before that variable is initialized. As with PARTS, LISTS must be initialized after the statement MPI_Comm_size(MPI_COMM_WORLD, &procs); on line 44.
Please be more careful when you write your loops. Also, make sure that you initialize variables correctly. I highly recommend using print statements (for small programs) or the debugger to make sure your variables are being set to the expected values.

Set breakpoint on variable value change

I'm just wondering if is it possible to set breakpoint on change of variable value (in any programming language and tool) ?
For example, I want to say: "Stop anywhere, when value of variable 'a' will be changed".
I know that there is ability to set condition breakpoint and to stop execution when a variable have some specific value, but I didn't hear about observing variable changes.
If it is not possible, why ?
In my experience you can achieve this with a "memory breakpoint" or "memory watch point". For example gdb does it like this: Can I set a breakpoint on 'memory access' in GDB?
As far as I've seen with write watchpoints, the break actually triggers when a is written to, regardless of whether the new value is equal to the old value. So if by "changed" you really mean "changed" then there are fewer examples out there. Possibly even none, I'm not sure, although I don't suppose it would be technically difficult to implement change-only watchpoints, assuming that you were implementing write watchpoints.
For some languages it makes a difference what kind of variable a is. For example, in C or C++ variables can be "lifted" into registers when optimization is enabled, in which case hardware memory watchpoints on the address of the variable will not necessarily catch every change.
There's also a limitation with variables on the stack, that if your function exits but the watchpoint is still set, then it could catch access to the same address, now in use for a different variable in a different function. If your function is called again later (or recursively), it's not necessarily starting from the same stack position, and if not then your watchpoint would fail to catch access to the "same" variable at a different location.
"Stop when a particular condition is true at a particular line of code" is in my experience called a "conditional breakpoint". It generally uses a different mechanism --
the debugger will most likely put a breakpoint instruction at that line of code. Each time it triggers the debugger will check the condition and continue execution if it's false.
Some processors support hardware breakpoints which will break when an address is read or written. For example, if I have a 4 byte variable at address 0x10005060, then I can set a hardware breakpoint like this (using windbg): ba w4 0x10005060. The processor will break if any of the 4 bytes are written. The following command instructs the processor to break when any of those 4 bytes a read or written: ba r4 0x10005060.

Self-restarting MathKernel - is it possible in Mathematica?

This question comes from the recent question "Correct way to cap Mathematica memory use?"
I wonder, is it possible to programmatically restart MathKernel keeping the current FrontEnd process connected to new MathKernel process and evaluating some code in new MathKernel session? I mean a "transparent" restart which allows a user to continue working with the FrontEnd while having new fresh MathKernel process with some code from the previous kernel evaluated/evaluating in it?
The motivation for the question is to have a way to automatize restarting of MathKernel when it takes too much memory without breaking the computation. In other words, the computation should be automatically continued in new MathKernel process without interaction with the user (but keeping the ability for user to interact with the Mathematica as it was originally). The details on what code should be evaluated in new kernel are of course specific for each computational task. I am looking for a general solution how to automatically continue the computation.
From a comment by Arnoud Buzing yesterday, on Stack Exchange Mathematica chat, quoting entirely:
In a notebook, if you have multiple cells you can put Quit in a cell by itself and set this option:
SetOptions[$FrontEnd, "ClearEvaluationQueueOnKernelQuit" -> False]
Then if you have a cell above it and below it and select all three and evaluate, the kernel will Quit but the frontend evaluation queue will continue (and restart the kernel for the last cell).
-- Arnoud Buzing
The following approach runs one kernel to open a front-end with its own kernel, which is then closed and reopened, renewing the second kernel.
This file is the MathKernel input, C:\Temp\test4.m
Needs["JLink`"];
$FrontEndLaunchCommand="Mathematica.exe";
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
SelectionMove[nb, Next, Cell];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Pause[1];
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
Do[SelectionMove[nb, Next, Cell],{12}];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Print["Completed"]
The demo notebook, C:\Temp\run.nb contains two cells:
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
Print[x1]
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
The initial kernel opens a front-end and runs the first cell, then it quits the front-end, reopens it and runs the second cell.
The whole thing can be run either by pasting (in one go) the MathKernel input into a kernel session, or it can be run from a batch file, e.g. C:\Temp\RunTest2.bat
#echo off
setlocal
PATH = C:\Program Files\Wolfram Research\Mathematica\8.0\;%PATH%
echo Launching MathKernel %TIME%
start MathKernel -noprompt -initfile "C:\Temp\test4.m"
ping localhost -n 30 > nul
echo Terminating MathKernel %TIME%
taskkill /F /FI "IMAGENAME eq MathKernel.exe" > nul
endlocal
It's a little elaborate to set up, and in its current form it depends on knowing how long to wait before closing and restarting the second kernel.
Perhaps the parallel computation machinery could be used for this? Here is a crude set-up that illustrates the idea:
Needs["SubKernels`LocalKernels`"]
doSomeWork[input_] := {$KernelID, Length[input], RandomReal[]}
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, DistributeDefinitions[resultSoFar]
; Quiet[ParallelEvaluate[doSomeWork[resultSoFar], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
This is an over-elaborate setup to generate a list of 1,000 triples of numbers. getTheJobDone runs a loop that continues until the result list contains the desired number of elements. Each iteration of the loop is evaluated in a subkernel. If the subkernel evaluation fails, the subkernel is relaunched. Otherwise, its return value is added to the result list.
To try this out, evaluate:
getTheJobDone[]
To demonstrate the recovery mechanism, open the Parallel Kernel Status window and kill the subkernel from time-to-time. getTheJobDone will feel the pain and print Ouch! whenever the subkernel dies. However, the overall job continues and the final result is returned.
The error-handling here is very crude and would likely need to be bolstered in a real application. Also, I have not investigated whether really serious error conditions in the subkernels (like running out of memory) would have an adverse effect on the main kernel. If so, then perhaps subkernels could kill themselves if MemoryInUse[] exceeded a predetermined threshold.
Update - Isolating the Main Kernel From Subkernel Crashes
While playing around with this framework, I discovered that any use of shared variables between the main kernel and subkernel rendered Mathematica unstable should the subkernel crash. This includes the use of DistributeDefinitions[resultSoFar] as shown above, and also explicit shared variables using SetSharedVariable.
To work around this problem, I transmitted the resultSoFar through a file. This eliminated the synchronization between the two kernels with the net result that the main kernel remained blissfully unaware of a subkernel crash. It also had the nice side-effect of retaining the intermediate results in the event of a main kernel crash as well. Of course, it also makes the subkernel calls quite a bit slower. But that might not be a problem if each call to the subkernel performs a significant amount of work.
Here are the revised definitions:
Needs["SubKernels`LocalKernels`"]
doSomeWork[] := {$KernelID, Length[Get[$resultFile]], RandomReal[]}
$resultFile = "/some/place/results.dat";
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, Put[resultSoFar, $resultFile]
; Quiet[ParallelEvaluate[doSomeWork[], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; CloseKernels[subkernel]; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
I have a similar requirement when I run a CUDAFunction for a long loop and CUDALink ran out of memory (similar here: https://mathematica.stackexchange.com/questions/31412/cudalink-ran-out-of-available-memory). There's no improvement on the memory leak even with the latest Mathematica 10.4 version. I figure out a workaround here and hope that you may find it's useful. The idea is that you use a bash script to call a Mathematica program (run in batch mode) multiple times with passing parameters from the bash script. Here is the detail instruction and demo (This is for Window OS):
To use bash-script in Win_OS you need to install cygwin (https://cygwin.com/install.html).
Convert your mathematica notebook to package (.m) to be able to use in script mode. If you save your notebook using "Save as.." all the command will be converted to comments (this was noted by Wolfram Research), so it's better that you create a package (File->New-Package), then copy and paste your commands to that.
Write the bash script using Vi editor (instead of Notepad or gedit for window) to avoid the problem of "\r" (http://www.linuxquestions.org/questions/programming-9/shell-scripts-in-windows-cygwin-607659/).
Here is a demo of the test.m file
str=$CommandLine;
len=Length[str];
Do[
If[str[[i]]=="-start",
start=ToExpression[str[[i+1]]];
Pause[start];
Print["Done in ",start," second"];
];
,{i,2,len-1}];
This mathematica code read the parameter from a commandline and use it for calculation.
Here is the bash script (script.sh) to run test.m many times with different parameters.
#c:\cygwin64\bin\bash
for ((i=2;i<10;i+=2))
do
math -script test.m -start $i
done
In the cygwin terminal type "chmod a+x script.sh" to enable the script then you can run it by typing "./script.sh".
You can programmatically terminate the kernel using Exit[]. The front end (notebook) will automatically start a new kernel when you next try to evaluate an expression.
Preserving "some code from the previous kernel" is going to be more difficult. You have to decide what you want to preserve. If you think you want to preserve everything, then there's no point in restarting the kernel. If you know what definitions you want to save, you can use DumpSave to write them to a file before terminating the kernel, and then use << to load that file into the new kernel.
On the other hand, if you know what definitions are taking up too much memory, you can use Unset, Clear, ClearAll, or Remove to remove those definitions. You can also set $HistoryLength to something smaller than Infinity (the default) if that's where your memory is going.
Sounds like a job for CleanSlate.
<< Utilities`CleanSlate`;
CleanSlate[]
From: http://library.wolfram.com/infocenter/TechNotes/4718/
"CleanSlate, tries to do everything possible to return the kernel to the state it was in when the CleanSlate.m package was initially loaded."

Resources