SAS macro inside data step runs parallelized? Is It possible? - parallel-processing

I have the following table process_table
table
index
TABLE_001
1
TABLE_002
2
TABLE_003
3
TABLE_004
4
And a macro for create tables that i called type_a tables, using lines from process_table.
So, for example, when input was TABLE_001 will generate TABLE_001_A.
%macro create_table_type_a(table_name);
proc sql;
create table temp.&table_name._A as
select
/*some process*/
from &table_name
quit;
%mend create_table_type_a;
And then I run
data _null_;
set process_table;
call execute('%create_table_type_a('||table||')');
run;
Well, I have two doubts.
1 - Does SAS process the macro sequential, one line after other, or is parallelized? I didn't find the answer on internet.
2 - If It was not parallelized, is it possible do it using the same startegy? The tables to be processed are huge, and i dont know how to parallize the process on SAS.
Thanks.

Good question.
No. The macros are run 'sequential', meaning that it runs %create_table_type_a(TABLE_001) before %create_table_type_a(TABLE_002) and so on. This is because Call Executes merely 'stacks' the macro calls in the data step and executes them after the data step has executed.
It is possible, but probably advanced. Reezas question of 'how huge' is pretty relevant before moving into advanced solutions of running macros in parallel.

You could spawn separate SAS processes for each macro call (within its own program), then wait for both to finish before proceeding.
Example
%MACRO SPAWN(PGM,JOBID) ;
systask command "/path/to/sasexe /path/to/programs/&PGM" status=job_&JOBID taskname="job_&JOBID" ;
%MEND ;
/* Run jobs asynchronously */
%SPAWN(Program1.sas,pgm1) ;
%SPAWN(Program2.sas,pgm2) ;
/* Wait for both to finish */
waitfor _ALL_ job_pgm1 job_pgm2 ;
/* ... continue processing ... */

Related

Run a Sas macro in parallel

So I have a macro similar to this, with the objective of calculating information value:
%macro iv_calc(x,event,varlist);
data main_table;
set x(keep=event varlist.);
run;
/****Steps to compute IV ****/
%mend;
X is the name of the dataset, event is the dependent variable name and varlist has the names of all the independent variables in a macro variable format.
The number of variables in varlist is unknown and could vary from 100 to 2000+. As a result, the macro is taking a very long time to run. I'm new to this, so my request is to understand if there's a way for me to split the varlist into 2, and run the same macro in parallel(because event is needed to compute information value), so as to reduce the runtime. My first thought was resorting to a shell script, but the number of variables is unknown and there lies the problem. Any tiny help will be greatly appreciated. Thanks a lot.
Managing parallel execution in SAS is rather inconvenient and involves SAS MP Connect / SAS Grid (signon/rsubmit).
Parallel execution in the shell is much easier, for example:
echo "param1 param2 param3" | tr ' ' '\n' | xargs -i{} -P 2 ./run-sas.sh {}
-P 2 specifies the number of parallel processes. I covered passing parameters to a child SAS session in a recent answer.

How to check if select is going to block or not during run-time

Linux Man Page select command is as follows:
int select(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);
It says
select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some
class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g.,
read(2) without blocking, or a sufficiently small write(2)).
Also timeout returns from the select call when none of the descriptor is available within timelimit parameter unless it's 0 in which case it behaves as a blocking call.
I am debugging (instrumenting) a server by attaching an instrumentation like pintool to it. In this tool, I want to know whether my select call is going to block or not. Is there any way during the run time by looking at the memory, register or select arguments I can tell whether it's going to block or not.
PS. One thing which is coming in my mind is by using one more select call in the analysis routine before the actual select execution and passing the same argument values to the analysis routine select call and if select returns with 0 that means no descriptor is ready yet otherwise descriptor is ready. So by putting one more select before the actual select, I am trying to guess whether my actual select is going to block or not. Not sure - if this is the right way?

Parallel execution of program within C++/CLI

I'm writing a windows forms program (C++/CLI) that calls an executable program multiple times within a large 'for' loop. I want to do the calls to the executable in parallel since it takes up to a minute to run once.
The key part of the windows forms code is the large for loop (actually 2 loops):
for (int a=0; a<1000; a++){
for (int b=0; b<100; b++){
int run = a*100 + b;
char startstr[50], configstr[50]; strcpy(startstr, "solver.exe");
sprintf(configstr, " %d %d %d", run, a, b);
strcat(startstr, configstr);
CreateProcessA(NULL, startstr,......) ;
}
}
The integers "run", "a" and "b" are used by the solver.exe program.
"Run" is used to write a unique output text file from each program run.
"a" and "b" are numbers used to read specific input text files. These are not unique to each run.
I'm not waiting after each call to "CreateProcess" as I want these to execute in parallel.
Currently my code runs and appears to work correctly. However, it swans a huge number of instances of the solver.exe program at once causing my computer to become very slow until everything finishes.
My question is, how can I create a queue that limits the number of concurrent processes (for example to the number of physical cores on the machine) so that they don't all try to run at the same time? Memory may also be an issue when the for loops are set larger.
A secondary question is, could potential concurrent file reads by different instances of solver.exe create a problem? (I can fix this but don't want to if I don't need to.)
I'm familiar with openmp and C but this is my first attempt at running parallel processes in a windows forms program.
Thanks
I've managed to do what I want using the OpenMP function "parallel for" to run the outer loop in parallel and the function omp_set_num_threads() to set the number of concurrent processes. As suggested, the concurrent file reads haven't caused any problems on my system.

COBOL logic for de-normalized file to Normalized table

How to load de-normalized file to Normalized table. I'm new to cobol, any suggestion on the below requirement. Thanks.
Inbound file: FileA.DAT
ABC01
ABC2014/01/01
FDE987
FDE2012/01/06
DEE6759
DEE2014/12/12
QQQ444
QQQ2004/10/12
RRR678
RRR2001/09/01
Table : TypeDB
TY_CD Varchar(03)
SEQ_NUM CHAR(10)
END_DT DATE
I have to write a COBOL program to load the table : TypeDB
Output of the result should be,
TY_CD SEQ_NUM END_DT
ABC 01 2014/01/01
FDE 987 2012/01/06
DEE 6759 2014/12/12
QQQ 444 2004/10/12
RRR 678 2001/09/01
Below is the pseudo-codeish
Perform Until F1 IS EOF
Read F1
MOVE F1-REC to WH1-REC
Read F1
MOVE F1-REC to WH2-REC
IF WH1-TY-CD = WH2-TY-CD
move WH1-TY-CD to TY-CD
move WH1-CD to SEQ_NUM
move WH2-DT to END-DT
END-IF
END-PERFORM
This is not working.. any thing better? instead read 2 inside the perform?
I'd definitely go with reading in pairs, like you have. It is clearer, to me, than having "flags" to say what is going on.
I suspect you've overwritten your first record with the second without realising it.
A simple way around that, for a beginner, is to use READ ... INTO ... to get your two different layouts. As you become more experienced, you'll perhaps save the data you need from the first record, and just use the second record from the FD area.
Here's some pseudo-code. It is the same as yours, but by using a "Priming Read". This time the Priming Read is reading two records. No problem.
By testing the FILE STATUS field as indicated, the paired structure of the file is verified. Checking the key ensures that the pairs are always for the same "key" as well. All built-in and hidden away from your actual logic (which in this case is not much anyway).
PrimingRead
FileLoop Until EOF
ProcessPair
ReadData
EndFileLoop
ProcessPair
Do the processing from Layout1 and Layout2
PrimingRead
ReadData
Crash with non-zero file-status
ReadData
ReadRec1
ReadRec2
If Rec2-key not equal to Rec1-key, crash
ReadRec1
Read Into Layout1
Crash with non-zero file-status
ReadRec2
Read Into Layout2
Crash with file-status other than zero or 10
While we are at it, we can apply this solution from Valdis Grinbergs as well (see https://stackoverflow.com/a/28744236/1927206).
PrimingRead
FileLoop Until EOF
ProcessPair
ReadPairStructure
EndFileLoop
ProcessPair
Do the processing from Layout1 and Layout2
PrimingRead
ReadPairStructure
Crash with non-zero file-status
ReadPairStructure
ReadRec1
ReadSecondOfPair
ReadSecondOfPair
ReadRec2
If Rec2-key not equal to Rec1-key, crash
ReadRec1
Read Into Layout1
Crash with non-zero file-status
ReadRec2
Read Into Layout2
Crash with file-status other than zero or 10
Because the structure of the file is very simple either can do. With fixed-number-groups of records, I'd go for the read-a-group-at-a-time. With a more complex structure, the second, "sideways".
Either method clearly reflects the structure of the file, and when you do that in your program, you aid the understanding of the program for human readers (which may be you some time in the future).

Performance Standalone Procedure vs Packaged Procedure in Oracle

What is the difference of performance between standalone procedure and packaged procedure? Which will be good performance wise and why? Is there any difference in execution of both?
Tom says:
Always use a package.
Never use a standalone procedure
except for demos, tests and standalone
utilities (that call nothing and are
called by nothing)
There you can also find a very good discussion about their performance. Just search for "performance" on that page.
If still seriously in doubt, you can always test yourself which one is faster. You'll certainly learn something new by doing so.
My take on your question: while it's true that calling package procedures/functions seems to be slower in certain situations than calling standalone procedures/functions, the advantages offered by the additional features available when using packages far outweigh the performance loss. So, just like Tom puts it, use packages.
The link: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:7452431376537
Test code(20 million calls, runstats_pkg is a package I wrote based on the runstats package by Tom Kyte):
CREATE OR REPLACE PACKAGE testperf AS
FUNCTION pow(i INT) RETURN INT;
END;
/
CREATE OR REPLACE PACKAGE BODY testperf AS
FUNCTION pow(i int) RETURN INT AS
BEGIN
RETURN i * i;
END;
END;
/
CREATE OR REPLACE FUNCTION powperf(i INT) RETURN INT AS
BEGIN
RETURN i * i;
END;
/
DECLARE
I INT;
S INT DEFAULT 0;
BEGIN
runstats_pkg.start1;
FOR I IN 1 .. 20000000 LOOP
s := s + (powperf(i) / i);
END LOOP;
runstats_pkg.stop1;
dbms_output.put_line(s);
s := 0;
runstats_pkg.start2;
FOR I IN 1 .. 20000000 LOOP
s := s + (testperf.pow(i) / i);
END LOOP;
runstats_pkg.stop2;
dbms_output.put_line(s);
runstats_pkg.show;
END;
Results(Oracle XE):
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
2,491 2,439 -52 102.13%
Run1 ran in 2304 hsecs
Run2 ran in 2364 hsecs
run 1 ran in 97.46% of the time
Results(Oracle 11g R1, different machine):
Run1 latches total versus runs -- difference and pct
Run1 Run2 Diff Pct
2,990 3,056 66 97.84%
Run1 ran in 2071 hsecs
Run2 ran in 2069 hsecs
run 1 ran in 100.1% of the time
So, there you go. Really not much of a difference.
Want data for something more complex that also involves SQL DML? You gotta test it yourself.
There isn't a performance difference except that packages can have state and standalone procedures and functions not.
The use of package is more about ordening and grouping of code. You could see them as an alternative of namespaces.
The primary reason to use packages is they break the dependency chain. For instance if you have two stand-alone procedures, procedure A which calls procedure B and you recompile procedure B you will also need to recompile procedure A. This gets quite complicated as you increase the number of procedures and functions.
If you move these to two to different packages you will not need to recompile them as long as the specification does not change.
There should be no difference between the two.
A major use of packages is to group a set of similar/associeted functions+procedures
The other answers here are all good (e.g. packages have state, they separate interface from implementation, etc).
Another difference is when procedures or packages are wrapped - it is simple to unwrap a procedure, but not a package body.

Resources