This question comes from the recent question "Correct way to cap Mathematica memory use?"
I wonder, is it possible to programmatically restart MathKernel keeping the current FrontEnd process connected to new MathKernel process and evaluating some code in new MathKernel session? I mean a "transparent" restart which allows a user to continue working with the FrontEnd while having new fresh MathKernel process with some code from the previous kernel evaluated/evaluating in it?
The motivation for the question is to have a way to automatize restarting of MathKernel when it takes too much memory without breaking the computation. In other words, the computation should be automatically continued in new MathKernel process without interaction with the user (but keeping the ability for user to interact with the Mathematica as it was originally). The details on what code should be evaluated in new kernel are of course specific for each computational task. I am looking for a general solution how to automatically continue the computation.
From a comment by Arnoud Buzing yesterday, on Stack Exchange Mathematica chat, quoting entirely:
In a notebook, if you have multiple cells you can put Quit in a cell by itself and set this option:
SetOptions[$FrontEnd, "ClearEvaluationQueueOnKernelQuit" -> False]
Then if you have a cell above it and below it and select all three and evaluate, the kernel will Quit but the frontend evaluation queue will continue (and restart the kernel for the last cell).
-- Arnoud Buzing
The following approach runs one kernel to open a front-end with its own kernel, which is then closed and reopened, renewing the second kernel.
This file is the MathKernel input, C:\Temp\test4.m
Needs["JLink`"];
$FrontEndLaunchCommand="Mathematica.exe";
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
SelectionMove[nb, Next, Cell];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Pause[1];
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
Do[SelectionMove[nb, Next, Cell],{12}];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Print["Completed"]
The demo notebook, C:\Temp\run.nb contains two cells:
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
Print[x1]
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
The initial kernel opens a front-end and runs the first cell, then it quits the front-end, reopens it and runs the second cell.
The whole thing can be run either by pasting (in one go) the MathKernel input into a kernel session, or it can be run from a batch file, e.g. C:\Temp\RunTest2.bat
#echo off
setlocal
PATH = C:\Program Files\Wolfram Research\Mathematica\8.0\;%PATH%
echo Launching MathKernel %TIME%
start MathKernel -noprompt -initfile "C:\Temp\test4.m"
ping localhost -n 30 > nul
echo Terminating MathKernel %TIME%
taskkill /F /FI "IMAGENAME eq MathKernel.exe" > nul
endlocal
It's a little elaborate to set up, and in its current form it depends on knowing how long to wait before closing and restarting the second kernel.
Perhaps the parallel computation machinery could be used for this? Here is a crude set-up that illustrates the idea:
Needs["SubKernels`LocalKernels`"]
doSomeWork[input_] := {$KernelID, Length[input], RandomReal[]}
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, DistributeDefinitions[resultSoFar]
; Quiet[ParallelEvaluate[doSomeWork[resultSoFar], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
This is an over-elaborate setup to generate a list of 1,000 triples of numbers. getTheJobDone runs a loop that continues until the result list contains the desired number of elements. Each iteration of the loop is evaluated in a subkernel. If the subkernel evaluation fails, the subkernel is relaunched. Otherwise, its return value is added to the result list.
To try this out, evaluate:
getTheJobDone[]
To demonstrate the recovery mechanism, open the Parallel Kernel Status window and kill the subkernel from time-to-time. getTheJobDone will feel the pain and print Ouch! whenever the subkernel dies. However, the overall job continues and the final result is returned.
The error-handling here is very crude and would likely need to be bolstered in a real application. Also, I have not investigated whether really serious error conditions in the subkernels (like running out of memory) would have an adverse effect on the main kernel. If so, then perhaps subkernels could kill themselves if MemoryInUse[] exceeded a predetermined threshold.
Update - Isolating the Main Kernel From Subkernel Crashes
While playing around with this framework, I discovered that any use of shared variables between the main kernel and subkernel rendered Mathematica unstable should the subkernel crash. This includes the use of DistributeDefinitions[resultSoFar] as shown above, and also explicit shared variables using SetSharedVariable.
To work around this problem, I transmitted the resultSoFar through a file. This eliminated the synchronization between the two kernels with the net result that the main kernel remained blissfully unaware of a subkernel crash. It also had the nice side-effect of retaining the intermediate results in the event of a main kernel crash as well. Of course, it also makes the subkernel calls quite a bit slower. But that might not be a problem if each call to the subkernel performs a significant amount of work.
Here are the revised definitions:
Needs["SubKernels`LocalKernels`"]
doSomeWork[] := {$KernelID, Length[Get[$resultFile]], RandomReal[]}
$resultFile = "/some/place/results.dat";
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, Put[resultSoFar, $resultFile]
; Quiet[ParallelEvaluate[doSomeWork[], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; CloseKernels[subkernel]; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
I have a similar requirement when I run a CUDAFunction for a long loop and CUDALink ran out of memory (similar here: https://mathematica.stackexchange.com/questions/31412/cudalink-ran-out-of-available-memory). There's no improvement on the memory leak even with the latest Mathematica 10.4 version. I figure out a workaround here and hope that you may find it's useful. The idea is that you use a bash script to call a Mathematica program (run in batch mode) multiple times with passing parameters from the bash script. Here is the detail instruction and demo (This is for Window OS):
To use bash-script in Win_OS you need to install cygwin (https://cygwin.com/install.html).
Convert your mathematica notebook to package (.m) to be able to use in script mode. If you save your notebook using "Save as.." all the command will be converted to comments (this was noted by Wolfram Research), so it's better that you create a package (File->New-Package), then copy and paste your commands to that.
Write the bash script using Vi editor (instead of Notepad or gedit for window) to avoid the problem of "\r" (http://www.linuxquestions.org/questions/programming-9/shell-scripts-in-windows-cygwin-607659/).
Here is a demo of the test.m file
str=$CommandLine;
len=Length[str];
Do[
If[str[[i]]=="-start",
start=ToExpression[str[[i+1]]];
Pause[start];
Print["Done in ",start," second"];
];
,{i,2,len-1}];
This mathematica code read the parameter from a commandline and use it for calculation.
Here is the bash script (script.sh) to run test.m many times with different parameters.
#c:\cygwin64\bin\bash
for ((i=2;i<10;i+=2))
do
math -script test.m -start $i
done
In the cygwin terminal type "chmod a+x script.sh" to enable the script then you can run it by typing "./script.sh".
You can programmatically terminate the kernel using Exit[]. The front end (notebook) will automatically start a new kernel when you next try to evaluate an expression.
Preserving "some code from the previous kernel" is going to be more difficult. You have to decide what you want to preserve. If you think you want to preserve everything, then there's no point in restarting the kernel. If you know what definitions you want to save, you can use DumpSave to write them to a file before terminating the kernel, and then use << to load that file into the new kernel.
On the other hand, if you know what definitions are taking up too much memory, you can use Unset, Clear, ClearAll, or Remove to remove those definitions. You can also set $HistoryLength to something smaller than Infinity (the default) if that's where your memory is going.
Sounds like a job for CleanSlate.
<< Utilities`CleanSlate`;
CleanSlate[]
From: http://library.wolfram.com/infocenter/TechNotes/4718/
"CleanSlate, tries to do everything possible to return the kernel to the state it was in when the CleanSlate.m package was initially loaded."
Related
I've implemented this code in my .gdbinit file to make stop-requiring gdb commands work(such as x, set etc)
define hook-x
if $pince_debugging_mode == 0
interrupt
end
end
define hook-stop
if $pince_debugging_mode == 0
c &
end
end
The purpose of $pince_debugging_mode variable is to inform gdb if the program is interrupting the target for debugging purposes or not.
I have a concern about signal concurrency with this design: Lets say we put a breakpoint on an address and while waiting it to get triggered we wanted to check some addresses with the x command. After x command gets executed, hook-stop will get executed because we have stopped the thread with the x command. And lets say breakpoint has been reached while hook-post is executing but hook-stop isn't aware of that and the $pince_debugging_mode still equals to 0 so it'll execute the c & command and the target will continue. So the stop at breakpoint won't mean anything because of this concurrency problem.
This problem never occurred yet but I'm afraid of the odds of occurring, even if it's very low, I don't want to take the risk. What can I do to avoid this possible problem?
Note: defining a hookpost-x is problematic because whenever x command throws an exception hookpost won't get executed, so it'll be impossible to continue after a exception-throwing x command
I'm writing a windows forms program (C++/CLI) that calls an executable program multiple times within a large 'for' loop. I want to do the calls to the executable in parallel since it takes up to a minute to run once.
The key part of the windows forms code is the large for loop (actually 2 loops):
for (int a=0; a<1000; a++){
for (int b=0; b<100; b++){
int run = a*100 + b;
char startstr[50], configstr[50]; strcpy(startstr, "solver.exe");
sprintf(configstr, " %d %d %d", run, a, b);
strcat(startstr, configstr);
CreateProcessA(NULL, startstr,......) ;
}
}
The integers "run", "a" and "b" are used by the solver.exe program.
"Run" is used to write a unique output text file from each program run.
"a" and "b" are numbers used to read specific input text files. These are not unique to each run.
I'm not waiting after each call to "CreateProcess" as I want these to execute in parallel.
Currently my code runs and appears to work correctly. However, it swans a huge number of instances of the solver.exe program at once causing my computer to become very slow until everything finishes.
My question is, how can I create a queue that limits the number of concurrent processes (for example to the number of physical cores on the machine) so that they don't all try to run at the same time? Memory may also be an issue when the for loops are set larger.
A secondary question is, could potential concurrent file reads by different instances of solver.exe create a problem? (I can fix this but don't want to if I don't need to.)
I'm familiar with openmp and C but this is my first attempt at running parallel processes in a windows forms program.
Thanks
I've managed to do what I want using the OpenMP function "parallel for" to run the outer loop in parallel and the function omp_set_num_threads() to set the number of concurrent processes. As suggested, the concurrent file reads haven't caused any problems on my system.
I'm running some code written in fortan. It is made of several subroutines and I share variables among them using global variables specified in a module.
The problem occurs when using multiple cpus. In one subroutine the code should update a value of a local variable by the value of a global variable. It so happens that in some random passes though the subroutine the code does not update the variables when I run it using multiple cpus. However if I pause it and make it go up to force the code to pass in the piece of code that updates the variable it works! Magic! I've then implemented a loop that checks if the variable was updated and tries to go back using (GOTO's) in the code to make it update the variables.... but for 2 tries it still sometimes do not update the variables. If I run the code with only one core then it works fine.... Any ideas??
Thanks
Piece of code:
Subroutine1() !Where the variable A0 should be updated
nTries = 0
777 IF (nItems.NE.0) THEN
DO J = 1,nItems
IF (nint(mDATA(J,3)).EQ.nint(XCOORD+U1NE0)
& .AND. nint(mDATA(J,4)).EQ.nint(YCOORD+U2NE0) .AND.
2 nint(mDATA(J,5)).EQ.nint(ZCOORD+U3NE0)) THEN
A0 = mDATA(J,1)
JNODE = mDATA(J,2)
EXIT
ELSE
A0 = A02
ENDIF
ENDDO
IF (A0.EQ.ZERO) THEN !If the variable was not updated
IF (nTries.LE.2) THEN
nTries = nTries + 1
GOTO 777
ENDIF
write(6,*) "ZERO A0", IELEM, JTYPE
A0 = MAXT
ENDIF
I don't exactly know how Abaqus interacts with your FORTRAN subroutines, nor is it clear from the above code what is going wrong, but what you're running into seems to be a classical example of a "race condition," which what you're calling "one core going ahead of the other."
A general comment is that GOTOs and global variables are extremely dangerous in that they make programs very hard to reason about. These problems compound once you start parallelizing. If Abaqus is doing some kind of "black box" computation that it is responsible for parallelizing, you (as a user who is only preprocessing and postprocessing the data) should be insulated from this. However, from the above, it sounds like you're doing some stuff that is interleaved with the Abaqus parallel computation. In that case, you need to make sure everything you're doing is thread-safe. Among many other things, you absolutely need to make sure you're not writing to any global variables.
Another comment is that your checking of A0 is basically a lock called a "spinlock." This is one way of making things thread-safe, but locks have pitfalls of their own. If Abaqus doesn't give you a way to synchronize all of the threads and guarantee that it's done with its job, some sort of lock like this may be the way to go.
Running matlab R2010B on Windows 7 Enterprise
In matlab scripts, I save a bunch of results to a word file and then at the end, close and quit word. The code I use is:
WordFname = ['BatInfoDoc' sprintf('%0.3f',now) '.doc']; % serialnumbered filenames
WordFile = fullfile(pwd,WordFname);
WordApp = actxserver('Word.Application');
WordDoc = WordApp.Documents.Add;
WordDoc.SaveAs2(WordFile);
....
WordApp.Selection.TypeText([title2 title3 title4 title5 title6]);
WordApp.Selection.TypeParagraph;
then finally at the end of the script
WordDoc.Close;
WordApp.Quit;
The problem I have is that through my development process, I often crash the matlab script and wind up leaving orphaned WINWORD.EXE processes, each one of which keeps a lock on the file it had been writing.
Up until now I have been using TaskManager to kill these processes one at a time by hand. Having been developing all morning, I find myself with around 20 files I can't delete because they are locked by about 11 orphaned WINWORD.EXE processes!
My question(s):
1) Is there an elegant way to handle the file writing and saving and closing and so on so I don't lock up files and processes when my script crashes out before I get to the part where I close the file and quit word?
2) Is there an elegant way to determine the bad processes from within matlab script and go through and delete them from within a matlab script? That is, can I code my matlab so it cleans up after itself?
ADDED A FEW MINUTES LATER:
By the way, I would prefer NOT to enclose all my code in a big try-catch and then close the windows after I've caught my error. The problem with this is I do like to go to debug mode on error, and caught errors don't bring me to debug mode.
Straight after you create Wordapp, use c = onCleanup(#()Wordapp.Quit). When your function exits, either naturally or with a crash, c will be deleted and its function will execute, quitting Word. If this is part of a script rather than a function, you can manually delete c to quit.
Also - while developing/debugging, I would set Wordapp.Visible to true so you can manually close word if necessary. Set back to false for production.
Use a handle class to delete them automatically.
classdef SafeWord < handle
properties(Access=public)
WordApp;
end
methods(Access=public)
function this = SafeWord(WordApp)
this.WordApp= WordApp;
end
function delete(this)
this.WordDoc.Close;
this.WordApp.Quit;
end
end
end
And the use case:
sw = SafeWord(Word.Application());
WordApplication = sw.WordApp;
% Do something here
% When sw ends its lifecycle, it calls delete.
Here is a related question.
LinkClose[link] "does not necessarily terminate the program at the other end
of the connection" as it is said in the Documentation. Is there a way to kill the
process of the slave kernel securely?
EDIT:
In really I need a function in Mathematica that returns only when the process of the slave kernel has already killed and its memory has already released. Both LinkInterrupt[link, 1] and LinkClose[link] do not wait while the slave kernel exits. At this moment the only such function is seemed to be killProc[procID] function I had showed in one of answers at this page. But is there a built-in analog?
At this moment I know only one method to kill the MathKernel process securely. This method uses NETLink and seems to work only under Windows and requires Microsoft .NET 2 or later to be installed.
killProc[processID_] := If[$OperatingSystem === "Windows",
Needs["NETLink`"];
Symbol["LoadNETType"]["System.Diagnostics.Process"];
With[{procID = processID},
killProc[procID_] := (
proc = Process`GetProcessById[procID];
proc#Kill[]
);
];
killProc[processID]
];
(*Killing the current MathKernel process*)
killProc[$ProcessID]
Any suggestions or improvements will be appreciated.
Edit:
The more correct method:
Needs["NETLink`"];
LoadNETType["System.Diagnostics.Process"];
$kern = LinkLaunch[First[$CommandLine] <> " -mathlink -noinit"];
LinkRead[$kern];
LinkWrite[$kern, Unevaluated[$ProcessID]];
$kernProcessID = First#LinkRead[$kern];
$kernProcess = Process`GetProcessById[$kernProcessID];
AbortProtect[If[! ($kernProcess#Refresh[]; $kernProcess#HasExited),
$kernProcess#Kill[]; $kernProcess#WaitForExit[];
$kernProcess#Close[]];
LinkClose[$kern]]
Edit 2:
Even more correct method:
Needs["NETLink`"];
LoadNETType["System.Diagnostics.Process"];
$kern = LinkLaunch[First[$CommandLine] <> " -mathlink -noinit"];
LinkRead[$kern];
LinkWrite[$kern, Unevaluated[$ProcessID]];
$kernProcessID = First#LinkRead[$kern];
$kernProcess = Process`GetProcessById[$kernProcessID];
krnKill := AbortProtect[
If[TrueQ[MemberQ[Links[], $kern]], LinkClose[$kern]];
If[TrueQ[MemberQ[LoadedNETObjects[], $kernProcess]],
If[! TrueQ[$kernProcess#WaitForExit[100]],
Quiet#$kernProcess#Kill[]; $kernProcess#WaitForExit[]];
$kernProcess#Close[]; ReleaseNETObject[$kernProcess];
]
];
Todd Gayley has answered my question in the newsgroup. The solution is to send to the slave kernel an MLTerminateMessage. From
top-level code:
LinkInterrupt[link, 1] (* An undocumented form that lets you pick
the message type *)
In C:
MLPutMessage(link, MLTerminateMessage);
In Java using J/Link:
link.terminateKernel();
In .NET using .NET/Link:
link.TerminateKernel();
EDIT:
I have discovered that in standard cases when using LinkInterrupt[link, 1]
my operating system (Windows 2000 at the moment) releases physical memory
only in 0.05-0.1 second beginning with a moment of execution of
LinkInterrupt[link, 1] while with LinkClose[link] it releases physical
memory in 0.01-0.03 second (both values include the time, spent on execution
of the command itself). Time intervals were measured by using SessionTime[]
under equal conditions and are steadily reproduced.
Actually I need a function in Mathematica that returns only when the process of the slave kernel has already killed and its memory has already released. Both LinkInterrupt[link, 1] and LinkClose[link] do not wait while the slave kernel exits. At this moment the only such function is seemed to be killProc[procID] function I had showed in another answer at this page.