Looking for a more efficient way to pull data from multiple datasets in SAS - performance

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!

I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.

First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.

You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

Related

Is there a different way to create variables that don't terminate after the program ends?

Right now, I am creating files to make unterminating variables. But I'm curious if there's a simpler way to create variables that don't terminate.
I find Redis invaluable for persisting data like this. It is a quick and lightweight installation and allows you to store many types of data:
strings, including complete JSONs and binary data like JPEG/PNG/TIFF images - also with TTL (Time-to-Live) so data can be expired when no longer needed
numbers, including atomic integers, floats
lists/queues/stacks
hashes (like Python dictionaries)
sets, and sorted (ordered) sets
streams, bitfields, geospatial data and esoteric hyperlogs
PUB/SUB is also possible, where one or more machines/processes publish items and multiple consumers, who have subscribed to that topic, receive the published items.
It can also perform very fast operations on your data for you, like set intersections and unions, getting lengths of lists, moving items between lists, atomically adding/subtracting from numbers and so on.
You can also use it to pass data between processes, sub-processes, shell scripts, parent and child, child and parent (!) scripts and so on.
In addition to all that, it is networked, so you can set variables on one computer and read/alter them from another - very simply. For example, you can PUSH jobs to a queue, potentially from multiple machines, and run workers on multiple machines that wait for jobs on the queue, process them and return results to another list.
There is a discussion of the things you can store here.
Example: Store a string, then retrieve it:
redis-cli SET name fred
name=$(redis-cli GET name)
Example: Increment views of page 2 by 10, and then retrieve from different machine on network:
redis-cli INCRBY views:page:2 10
views=$(redis-cli -h 192.168.0.10 GET views:page:2)
Example: Push a value onto a list:
redis-cli LPUSH shoppingList bananas
Example: Blocking wait for next item in list - use RPOP for non-blocking:
item=$(redis-cli BRPOP shoppingList)
Also, there are bindings for Python, C/C++, Java, Ruby, PHP etc. So you can "inject" dummy/test data into, or extract debug data from a running Python program using the redis-cli tool even on a different computer.
Use environment variables to store your data.
ABC="abc"; export ABC
And the other question is, how to make environment variables persistents after reboot.
Depending on your shell, you may have different file to persist the veriables.
if using bash, run this command containing the variable's last value before reboot.
echo 'export ABD="hello"' >> $HOME/.bashrc
I think this is a good time to be using an SQL Database. It's more scalable and functional than having a fileful of "persistent variables".
It may require a little more setup, and I admit it isn't "simpler" per say, but it will probably be worth it in the long run. You will be able to do things with your variables and that may make your future scripts simpler.
I recommend going to YouTube and find a simple instruction on how to set up a local MySQL or MSSQL. There is a guy, Mike Dane, who makes really beginner-friendly instructions. Try searching "GiraffeAcademy SQL Beginner" and see if that helps you.

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

Loops within loops in PsychoPy Builder

I am new to programming, and need some help with an experiment I'm constructing in PsychoPy builder. I have made something that works, but I can tell it's inelegant and there must be a better way.
I want to conduct 24 trials. Each trial will show 7 unique images, then 1 image which may or may not be from the 7, and users are then asked to enter y/n if they have seen the image before.
In my current code I have created 24 separate input files, each containing a list of the unique objects. I have created a loop which shows the seven objects in succession. I have then created separate routines for the pre-trial fixation screen (constant for all 24 trials) and the response (probe image and correct answer manually programmed). The code works, but it is very long, and if I wanted to change something in the fixation or the probe/response, I would need to change each of the 24 trials individually.
How can I instead get Builder to create a loop which contains the fixation screen (constant), the trial loop (picking the next seven unique objects (they are named sequentially from 1-168), and then a probe/response which is unique to each trial (I have these in an input file as follows. Probe refers to a number between 1 and 7 which references the position of the image in the sequence shown in the trial.)
TrialNumber Probe CorrAns
1 4 0
2 3 1
3 4 0
4 5 1
5 4 1
...
I hope my question makes sense and I would be grateful for any assistance.
Thanks PsychoPy Beginner.
Yes, you're correct that there is a (much) more efficient way to do this.
First, start with your conditions file (i.e. .csv or .xlsx). You only need one of them. It should have 24 rows (one per trial). It will need 8 columns: 7 to specify the unique images in a trial and an eighth to specify the repeated one.
Second, you need a loop to control the trials. This is connected to the conditions file, and encompasses all of the routines (the pre-trial fixation and the image routine.
Third, you need a second, inner loop, nested within the outer one. This encompasses only the image routine. i.e. the fixation routine will run 24 times (once per outer loop), and the image routine will run 7×24 times (i.e. 7 times per trial). The inner loop is not connected to a conditions file and is simply set to run 7 times.
So note that you no longer have 24 separate routines in Builder but only two (the fixation and image routines). Instead of duplicating routines, you repeat them via the loops.
In the image stimulus image field, you can construct the image name to use on each presentation. e.g. lets say the 8 columns in your conditions file are labelled 'image0', 'image1', etc. Then in the image field, put something like this:
$'image' + str(yourInnerLoopName.thisN)
i.e. on the first iteration within each trial, the image filename would come from column image0, the second from image1, and so on.
I don't know how you are handling the responses, but you will also probably need a ninth column in the conditions file that indicates what the correct response is. The keyboard component can access that to judge whether the response is correct or not.

How to use RRDTool/Cacti to count "user activities" in apache access logs?

Goal
I wish to use RRDTool to count logical "user activity" from our web application's apache/tomcat access logs.
Specifically we want to count, for a period, occurrences of several url patterns.
Example
We have two applications (call them 'foo' and 'bar')
These url's interest us. They indicate when users 'did interesting stuff'.
/foo/hop
/foo/skip
/foo/jump
/bar/crawl
/bar/walk
/bar/run
Basically we want to know for a given interval (10 minutes, hour, day, etc.) how many users: hopped,skipped,jumped,crawled, walked, etc.
Reference/Starting point
This article on importing access logs into RRDTool seemed like a helpful starting point.
http://neidetcher.com/programming/2014/05/13/just-enough-rrdtool.html
However to clarify, this example uses the access log directly , whereas we want to a handful of url's 'in buckets' and count the 'number in each bucket'
Some Scripting Required..
I could do this with bash & grep & wc --iterating through the patterns, sending output to an 'intermediate results' text file....but believe RRDTool could do this with minimal 'outside coding'
That said, I believe RRDTool could do this with minimal 'outside coding'--but am unclear on the details.
Some points
I mention 'two applications' because we actually serve them up from separate servers with different log file formats. I'd like go get them into the same RRA file
Eventually I'd like to report this in cacti; initially however, I wanted to understand RRDTool details
Open to doing any coding, but would like to keep it as efficient as possible--both administratively and computer-resources. (By administratively, I mean: easy to monitor new instances)
I am very new to RRDTool and am RTM'ing . (and Walking through the Tutorial). I'm used to relational databases and spreadsheets, etc and don't have my mind around all the nuances of the RRA format.
Thanks in advance!
You could setup a separate RRD file with ABSOLUTE type datasources for each address you want to track.
Then you tail the log file and whenever you see one of the interesting urls rush by you call:
rrdtool update url-xyz.rrd N:1
The ABSOLUTE data source type is like a counter, but it gets reset every time it is read. Your counter will just count to one, but that should not be a problem.
In the example above I am using N: and not the timestamp from the access log. You could also use that if you are not doing this in real time ... but beware that you can not update the same rrd file twice at the same time. N: will use milli timestamps internally and thus probably avoid this problem.
On the other hand it may make more sense to accumulate matching log entries with the same timestamp and only update rrdtool with that number once the timestamp on the logfile changes.

possible locking issue with a Talend job

I'm parsing data from one table and writing it back to another one. Input are characteristics, written as text. Output is a boolean field that needs to be updated. For example a characteristic would be "has 4 wheel drive" and I want to set a boolean has_4weeldrive to true.
I'm going through all the characteristics that belong to a car and set it to true if found, else to null. The filter after the tmap_1 filters the rows for which the attribute is true, and then updates that in a table. I want to do that for all different characteristics (around 10).
If I do it for one characteristic the job runs fine, as soon as I have more than 1 it only loads 1 record and waits indefinitely. I can of course make 10 jobs and it will run, but I need to touch all the characteristics 10 times, that doesn't feel right. Is this a locking issue? Is there a better way to do this? Target and source db is Postgresql if that makes a difference.
Shared connections could cause problems like this.
Also make sure you're committing after each update. Talend use 1 thread for execution (except the enterprise version) so multiple shared outputs could cause problems.
Setting the commit to 1 should eliminate the problem.

Resources