Run a Sas macro in parallel - parallel-processing

So I have a macro similar to this, with the objective of calculating information value:
%macro iv_calc(x,event,varlist);
data main_table;
set x(keep=event varlist.);
run;
/****Steps to compute IV ****/
%mend;
X is the name of the dataset, event is the dependent variable name and varlist has the names of all the independent variables in a macro variable format.
The number of variables in varlist is unknown and could vary from 100 to 2000+. As a result, the macro is taking a very long time to run. I'm new to this, so my request is to understand if there's a way for me to split the varlist into 2, and run the same macro in parallel(because event is needed to compute information value), so as to reduce the runtime. My first thought was resorting to a shell script, but the number of variables is unknown and there lies the problem. Any tiny help will be greatly appreciated. Thanks a lot.

Managing parallel execution in SAS is rather inconvenient and involves SAS MP Connect / SAS Grid (signon/rsubmit).
Parallel execution in the shell is much easier, for example:
echo "param1 param2 param3" | tr ' ' '\n' | xargs -i{} -P 2 ./run-sas.sh {}
-P 2 specifies the number of parallel processes. I covered passing parameters to a child SAS session in a recent answer.

Related

SAS macro inside data step runs parallelized? Is It possible?

I have the following table process_table
table
index
TABLE_001
1
TABLE_002
2
TABLE_003
3
TABLE_004
4
And a macro for create tables that i called type_a tables, using lines from process_table.
So, for example, when input was TABLE_001 will generate TABLE_001_A.
%macro create_table_type_a(table_name);
proc sql;
create table temp.&table_name._A as
select
/*some process*/
from &table_name
quit;
%mend create_table_type_a;
And then I run
data _null_;
set process_table;
call execute('%create_table_type_a('||table||')');
run;
Well, I have two doubts.
1 - Does SAS process the macro sequential, one line after other, or is parallelized? I didn't find the answer on internet.
2 - If It was not parallelized, is it possible do it using the same startegy? The tables to be processed are huge, and i dont know how to parallize the process on SAS.
Thanks.
Good question.
No. The macros are run 'sequential', meaning that it runs %create_table_type_a(TABLE_001) before %create_table_type_a(TABLE_002) and so on. This is because Call Executes merely 'stacks' the macro calls in the data step and executes them after the data step has executed.
It is possible, but probably advanced. Reezas question of 'how huge' is pretty relevant before moving into advanced solutions of running macros in parallel.
You could spawn separate SAS processes for each macro call (within its own program), then wait for both to finish before proceeding.
Example
%MACRO SPAWN(PGM,JOBID) ;
systask command "/path/to/sasexe /path/to/programs/&PGM" status=job_&JOBID taskname="job_&JOBID" ;
%MEND ;
/* Run jobs asynchronously */
%SPAWN(Program1.sas,pgm1) ;
%SPAWN(Program2.sas,pgm2) ;
/* Wait for both to finish */
waitfor _ALL_ job_pgm1 job_pgm2 ;
/* ... continue processing ... */

Missing print-out for MPI root process, after its handling data reading alone

I'm writing a project that firstly designates the root process to read a large data file and do some calculations, and secondly broadcast the calculated results to all other processes. Here is my code: (1) it reads random numbers from a txt file with nsample=30000 (2) generate dens_ent matrix by some rule (3) broadcast to other processes. Btw, I'm using OpenMPI with gfortran.
IF (myid==0) THEN
OPEN(UNIT=8,FILE='rnseed_ent20.txt')
DO i=1,n_sample
DO j=1,3
READ(8,*) rn(i,j)
END DO
END DO
CLOSE(8)
END IF
dens_ent=0.0d0
DO i=1,n_sample
IF (myid==0) THEN
!Random draws of productivity and savings
rn_zb=MC_JOINT_SAMPLE((/-0.1d0,mu_b0/),var,rn(i,1:2))
iz=minloc(abs(log(zgrid)-rn_zb(1)),dim=1)
ib=minloc(abs(log(bgrid(1:nb/2))-rn_zb(2)),dim=1) !Find the closest saving grid
CALL SUB2IND(j,(/nb,nm,nk,nxi,nz/),(/ib,1,1,1,iz/))
DO iixi=1,nxi
DO iiz=1,nz
CALL SUB2IND(jj,(/nb,nm,nk,nxi,nz/),(/policybmk_2_statebmk_index(j,:),iixi,iiz/))
dens_ent(jj)=dens_ent(jj)+1.0d0/real(nxi)*markovian(iz,iiz)*merge(1.0d0,0.0d0,vent(j) .GE. -bgrid(ib)+ce)
!Density only recorded if the value of entry is greater than b0+ce
END DO
END DO
END IF
END DO
PRINT *, 'dingdongdingdong',myid
IF (myid==0) dens_ent=dens_ent/real(n_sample)*Mpo
IF (myid==0) PRINT *, 'sum_density by joint normal distribution',sum(dens_ent)
PRINT *, 'BLBLALALALALALA',myid
CALL MPI_BCAST(dens_ent,N,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
Problem arises:
(1) IF (myid==0) PRINT *, 'sum_density by joint normal distribution',sum(dens_ent) seems not executed, as there is no print out.
(2) I then verify this by adding PRINT *, 'BLBLALALALALALA',myid etc messages. Again no print out for root process myid=0.
It seems like root process is not working? How can this be true? I'm quite confused. Is it because I'm not using MPI_BARRIER before PRINT *, 'dingdongdingdong',myid?
Is it possible that you miss the following statement just at the very beginning of your code?
CALL MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr)
IF (ierr /= MPI_SUCCESS) THEN
STOP "MPI_COMM_RANK failed!"
END IF
The MPI_COMM_RANK returns into myid (if succeeds) the identifier of the process within the MPI_COMM_WORLD communicator (i.e a value within 0 and NP, where NP is the total number of MPI ranks).
Thanks for contributions from #cw21 #Harald and #Hristo Iliev.
The failure lies in unit numbering. One reference says:
unit number : This must be present and takes any integer type. Note this ‘number’ identifies the
file and must be unique so if you have more than one file open then you must specify a different
unit number for each file. Avoid using 0,5 or 6 as these UNITs are typically picked to be used by
Fortran as follows.
– Standard Error = 0 : Used to print error messages to the screen.
– Standard In = 5 : Used to read in data from the keyboard.
– Standard Out = 6 : Used to print general output to the screen.
So I changed all numbering i into 1i, not working; then changed into 10i. It starts to work. Mysteriously, as correctly pointed out by #Hristo Iliev, as long as the numbering is not 0,5,6, the code should behave properly. I cannot explain to myself why 1i not working. But anyhow, the root process is now printing out results.

Is it better to declare a local inside or outside a loop? [duplicate]

This question already has an answer here:
In Lua, should I define a variable every iteration of a loop or before the loop?
(1 answer)
Closed 6 years ago.
I am used to do this:
do
local a
for i=1,1000000 do
a = <some expression>
<...> --do something with a
end
end
instead of
for i=1,1000000 do
local a = <some expression>
<...> --do something with a
end
My reasoning is that creating a local variable 1000000 times is less efficient than creating it just once and reuse it on each iteration.
My question is: is this true or there is another technical detail I am missing? I am asking because I don't see anyone doing this but not sure if the reason is because the advantage is too small or because it is in fact worse. By better I mean using less memory and running faster.
Like any performance question, measure first.
In a unix system you can use time:
time lua -e 'local a; for i=1,100000000 do a = i * 3 end'
time lua -e 'for i=1,100000000 do local a = i * 3 end'
the output:
real 0m2.320s
user 0m2.315s
sys 0m0.004s
real 0m2.247s
user 0m2.246s
sys 0m0.000s
The more local version appears to be a small percentage faster in Lua, since it does not initialize a to nil. However, that is no reason to use it, use the most local scope because it it is more readable (this is good style in all languages: see this question asked for C,
Java, and C#)
If you are reusing a table instead of creating it in the loop then there is likely a more significant performance difference. In any case, measure and favour readability whenever you can.
I think there's some confusion about the way compilers deal with variables. From a high-level kind of human perspective, it feels natural to think of defining and destroying a variable to have some kind of "cost" associated with it.
Yet that's not necessarily the case to the optimizing compiler. The variables you create in a high-level language are more like temporary "handles" into memory. The compiler looks at those variables and then translates it into an intermediate representation (something closer to the machine) and figures out where to store everything, predominantly with the goal of allocating registers (the most immediate form of memory for the CPU to use). Then it translates the IR into machine code where the idea of a "variable" doesn't even exist, only places to store data (registers, cache, dram, disk).
This process includes reusing the same registers for multiple variables provided that they do not interfere with each other (provided that they are not needed simultaneously: not "live" at the same time).
Put another way, with code like:
local a = <some expression>
The resulting assembly could be something like:
load gp_register, <result from expression>
... or it may already have the result from some expression in a register, and the variable ends up disappearing completely (just using that same register for it).
... which means there's no "cost" to the existence of the variable. It just translates directly to a register which is always available. There's no "cost" to "creating a register", as registers are always there.
When you start creating variables at a broader (less local) scope, contrary to what you think, you may actually slow down the code. When you do this superficially, you're kind of fighting against the compiler's register allocation, and making it harder for the compiler to figure out what registers to allocate for what. In that case, the compiler might spill more variables into the stack which is less efficient and actually has a cost attached. A smart compiler may still emit equally-efficient code, but you could actually make things slower. Helping the compiler here often means more local variables used in smaller scopes where you have the best chance for efficiency.
In assembly code, reusing the same registers whenever you can is efficient to avoid stack spills. In high-level languages with variables, it's kind of the opposite. Reducing the scope of variables helps the compiler figure out which registers it can reuse because using a more local scope for variables helps inform the compiler which variables aren't live simultaneously.
Now there are exceptions when you start involving user-defined constructor and destructor logic in languages like C++ where reusing an object might prevent redundant construction and destruction of an object that can be reused. But that doesn't apply in a language like Lua, where all variables are basically plain old data (or handles into garbage-collected data or userdata).
The only case where you might see an improvement using less local variables is if that somehow reduces work for the garbage collector. But that's not going to be the case if you simply re-assign to the same variable. To do that, you would have to reuse whole tables or user data (without re-assigning). Put another way, reusing the same fields of a table without recreating a whole new one might help in some cases, but reusing the variable used to reference the table is very unlikely to help and could actually hinder performance.
All local variables are "created" at compile (load) time and are simply indexes into function activation record's locals block. Each time you define a local, that block is grown by 1. Each time do..end/lexical block is over, it shrinks back. Peak value is used as total size:
function ()
local a -- current:1, peak:1
do
local x -- current:2, peak:2
local y -- current:3, peak:3
end
-- current:1, peak:3
do
local z -- current:2, peak:3
end
end
The above function has 3 local slots (determined at load, not at runtime).
Regarding your case, there is no difference in locals block size, and moreover, luac/5.1 generates equal listings (only indexes change):
$ luac -l -
local a; for i=1,100000000 do a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7fee6b600000)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 1 -1 ; 1
2 [1] LOADK 2 -2 ; 100000000
3 [1] LOADK 3 -1 ; 1
4 [1] FORPREP 1 1 ; to 6
5 [1] MUL 0 4 -3 ; - 3 // [0] is a
6 [1] FORLOOP 1 -2 ; to 5
7 [1] RETURN 0 1
vs
$ luac -l -
for i=1,100000000 do local a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7f8302d00020)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 0 -1 ; 1
2 [1] LOADK 1 -2 ; 100000000
3 [1] LOADK 2 -1 ; 1
4 [1] FORPREP 0 1 ; to 6
5 [1] MUL 4 3 -3 ; - 3 // [4] is a
6 [1] FORLOOP 0 -2 ; to 5
7 [1] RETURN 0 1
// [n]-comments are mine.
First note this: Defining the variable inside the loop makes sure that after one iteration of this loop the next iteration cannot use that same stored variable again. Defining it before the for-loop makes it possible to carry an variable through multiple iterations, as any other variable not defined within the loop.
Further, to answer your question: Yes, it is less efficient, because it re-initiates the variable. If the Lua JIT- /Compiler has good pattern recognition, it might be that it just resets the variable, but I cannot confirm nor deny that.

How can I debug a Fortran READ/WRITE statement with an implicit DO loop?

The Fortran program I am working is encountering a runtime error when processing an input file.
At line 182 of file ../SOURCE_FILE.f90 (unit = 1, file = 'INPUT_FILE.1')
Fortran runtime error: Bad value during integer read
Looking to line 182 I see a READ statement with an implicit/implied DO loop:
182: READ(IT4, 310 )((IPPRM2(IP,I),IP=1,NP),I=1,16) ! read 6 integers
183: READ(IT4, 320 )((PPARM2(IP,I),IP=1,NP),I=1,14) ! read 5 reals
Format statement:
310 FORMAT(1X,6I12)
When I reach this code in the debugger NP has a value of 2. I has a value of 6, and IP has a value of 67. I think I and IP should be reinitialized in the loop.
My problem is that when I try to step through in the debugger once I get to the READ statement it seems to execute and then throw the error. I'm not sure how to follow it as it reads. I tried stepping into the function, but it seems like that may be a difficult route to take since I am unfamiliar with the gfortran library. The input file looks OK, I think it should be read just fine. This makes me think this READ statement isn't looping as intended.
I am completely new to Fortran and implicit DO loops like this, but from what I can gather line 182 should read in 6 integers according to the format string #310. However, when I arrive NP has a value of 2 which makes me think it will only try to read 2 integers 16 times.
How can I debug this read statement to examine the values read into IPPARM as they are read from the file? Will I have to step through the Fortran library?
Any tips that can clear up my confusion regarding these implicit loops would be appreciated!
Thanks!
NOTE: I'm using gfortran/gcc and gdb on Linux.
Is there any reason you need specific formatting on the read? I would use READ(IT4, *) where feasible...
Later versions of gfortran support unlimited format reads (see link http://fortranwiki.org/fortran/show/Fortran+2008+status)
Then it may be helpful to specify
310 FORMAT("*(1X,6I12)")
Or for older compilers
310 FORMAT(1000(1X,6I12))
The variables IP and I are loop indices and so they are reinitialized by the loop. With NP=2 the first statement is going to read a total of 32 integers -- it is contributing to the determination the list of items to read. The format determines how they are read. With "1X,6I12" they will be read as 6 integers per line of the input file. When the first 6 of the requested 32 integers is read fron a line/record, Fortran will consider that line/record completed and advance to the next record.
With a format of "1X,6I12" the integers must be precisely arranged in the file. There should be a single blank, then the integers should each be right-justified in fields of 12 columns. If they get out of alignment you could get the wrong value read or a runtime error.

Self-restarting MathKernel - is it possible in Mathematica?

This question comes from the recent question "Correct way to cap Mathematica memory use?"
I wonder, is it possible to programmatically restart MathKernel keeping the current FrontEnd process connected to new MathKernel process and evaluating some code in new MathKernel session? I mean a "transparent" restart which allows a user to continue working with the FrontEnd while having new fresh MathKernel process with some code from the previous kernel evaluated/evaluating in it?
The motivation for the question is to have a way to automatize restarting of MathKernel when it takes too much memory without breaking the computation. In other words, the computation should be automatically continued in new MathKernel process without interaction with the user (but keeping the ability for user to interact with the Mathematica as it was originally). The details on what code should be evaluated in new kernel are of course specific for each computational task. I am looking for a general solution how to automatically continue the computation.
From a comment by Arnoud Buzing yesterday, on Stack Exchange Mathematica chat, quoting entirely:
In a notebook, if you have multiple cells you can put Quit in a cell by itself and set this option:
SetOptions[$FrontEnd, "ClearEvaluationQueueOnKernelQuit" -> False]
Then if you have a cell above it and below it and select all three and evaluate, the kernel will Quit but the frontend evaluation queue will continue (and restart the kernel for the last cell).
-- Arnoud Buzing
The following approach runs one kernel to open a front-end with its own kernel, which is then closed and reopened, renewing the second kernel.
This file is the MathKernel input, C:\Temp\test4.m
Needs["JLink`"];
$FrontEndLaunchCommand="Mathematica.exe";
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
SelectionMove[nb, Next, Cell];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Pause[1];
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
Do[SelectionMove[nb, Next, Cell],{12}];
SelectionEvaluate[nb];
];
Pause[8];
CloseFrontEnd[];
Print["Completed"]
The demo notebook, C:\Temp\run.nb contains two cells:
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
Print[x1]
x1 = 0;
Module[{},
While[x1 < 1000000,
If[Mod[x1, 100000] == 0, Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]]]
The initial kernel opens a front-end and runs the first cell, then it quits the front-end, reopens it and runs the second cell.
The whole thing can be run either by pasting (in one go) the MathKernel input into a kernel session, or it can be run from a batch file, e.g. C:\Temp\RunTest2.bat
#echo off
setlocal
PATH = C:\Program Files\Wolfram Research\Mathematica\8.0\;%PATH%
echo Launching MathKernel %TIME%
start MathKernel -noprompt -initfile "C:\Temp\test4.m"
ping localhost -n 30 > nul
echo Terminating MathKernel %TIME%
taskkill /F /FI "IMAGENAME eq MathKernel.exe" > nul
endlocal
It's a little elaborate to set up, and in its current form it depends on knowing how long to wait before closing and restarting the second kernel.
Perhaps the parallel computation machinery could be used for this? Here is a crude set-up that illustrates the idea:
Needs["SubKernels`LocalKernels`"]
doSomeWork[input_] := {$KernelID, Length[input], RandomReal[]}
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, DistributeDefinitions[resultSoFar]
; Quiet[ParallelEvaluate[doSomeWork[resultSoFar], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
This is an over-elaborate setup to generate a list of 1,000 triples of numbers. getTheJobDone runs a loop that continues until the result list contains the desired number of elements. Each iteration of the loop is evaluated in a subkernel. If the subkernel evaluation fails, the subkernel is relaunched. Otherwise, its return value is added to the result list.
To try this out, evaluate:
getTheJobDone[]
To demonstrate the recovery mechanism, open the Parallel Kernel Status window and kill the subkernel from time-to-time. getTheJobDone will feel the pain and print Ouch! whenever the subkernel dies. However, the overall job continues and the final result is returned.
The error-handling here is very crude and would likely need to be bolstered in a real application. Also, I have not investigated whether really serious error conditions in the subkernels (like running out of memory) would have an adverse effect on the main kernel. If so, then perhaps subkernels could kill themselves if MemoryInUse[] exceeded a predetermined threshold.
Update - Isolating the Main Kernel From Subkernel Crashes
While playing around with this framework, I discovered that any use of shared variables between the main kernel and subkernel rendered Mathematica unstable should the subkernel crash. This includes the use of DistributeDefinitions[resultSoFar] as shown above, and also explicit shared variables using SetSharedVariable.
To work around this problem, I transmitted the resultSoFar through a file. This eliminated the synchronization between the two kernels with the net result that the main kernel remained blissfully unaware of a subkernel crash. It also had the nice side-effect of retaining the intermediate results in the event of a main kernel crash as well. Of course, it also makes the subkernel calls quite a bit slower. But that might not be a problem if each call to the subkernel performs a significant amount of work.
Here are the revised definitions:
Needs["SubKernels`LocalKernels`"]
doSomeWork[] := {$KernelID, Length[Get[$resultFile]], RandomReal[]}
$resultFile = "/some/place/results.dat";
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, Put[resultSoFar, $resultFile]
; Quiet[ParallelEvaluate[doSomeWork[], subkernel]] /.
{ $Failed :> (Print#"Ouch!"; CloseKernels[subkernel]; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
I have a similar requirement when I run a CUDAFunction for a long loop and CUDALink ran out of memory (similar here: https://mathematica.stackexchange.com/questions/31412/cudalink-ran-out-of-available-memory). There's no improvement on the memory leak even with the latest Mathematica 10.4 version. I figure out a workaround here and hope that you may find it's useful. The idea is that you use a bash script to call a Mathematica program (run in batch mode) multiple times with passing parameters from the bash script. Here is the detail instruction and demo (This is for Window OS):
To use bash-script in Win_OS you need to install cygwin (https://cygwin.com/install.html).
Convert your mathematica notebook to package (.m) to be able to use in script mode. If you save your notebook using "Save as.." all the command will be converted to comments (this was noted by Wolfram Research), so it's better that you create a package (File->New-Package), then copy and paste your commands to that.
Write the bash script using Vi editor (instead of Notepad or gedit for window) to avoid the problem of "\r" (http://www.linuxquestions.org/questions/programming-9/shell-scripts-in-windows-cygwin-607659/).
Here is a demo of the test.m file
str=$CommandLine;
len=Length[str];
Do[
If[str[[i]]=="-start",
start=ToExpression[str[[i+1]]];
Pause[start];
Print["Done in ",start," second"];
];
,{i,2,len-1}];
This mathematica code read the parameter from a commandline and use it for calculation.
Here is the bash script (script.sh) to run test.m many times with different parameters.
#c:\cygwin64\bin\bash
for ((i=2;i<10;i+=2))
do
math -script test.m -start $i
done
In the cygwin terminal type "chmod a+x script.sh" to enable the script then you can run it by typing "./script.sh".
You can programmatically terminate the kernel using Exit[]. The front end (notebook) will automatically start a new kernel when you next try to evaluate an expression.
Preserving "some code from the previous kernel" is going to be more difficult. You have to decide what you want to preserve. If you think you want to preserve everything, then there's no point in restarting the kernel. If you know what definitions you want to save, you can use DumpSave to write them to a file before terminating the kernel, and then use << to load that file into the new kernel.
On the other hand, if you know what definitions are taking up too much memory, you can use Unset, Clear, ClearAll, or Remove to remove those definitions. You can also set $HistoryLength to something smaller than Infinity (the default) if that's where your memory is going.
Sounds like a job for CleanSlate.
<< Utilities`CleanSlate`;
CleanSlate[]
From: http://library.wolfram.com/infocenter/TechNotes/4718/
"CleanSlate, tries to do everything possible to return the kernel to the state it was in when the CleanSlate.m package was initially loaded."

Resources