I need to generate 10 million values (Bernoulli and Poisson) for my end of year project and since the .csv file I use as a support is limited at 32K values per column, that makes it incredibly tedious to generate 10M values since that would take me 300 variables, which, while not impossible to make (that would take what, 20 minutes ?), must not be the only way to proceed. Is there any way I could generate 300 variables at once ?
dataset close all.
*this is to create a new dataset.
DATA LIST /NEWVAR 1.
BEGIN DATA
1
END DATA.
*this will create 300 variables and 32000 lines.
vector d(300).
LOOP varx=1 TO 32000.
loop #i = 1 to 300.
compute d(#i) = uniform(1). /*use whatever generation method you want here.
end loop.
XSAVE OUTFILE='somepath\temp.sav' .
END LOOP.
EXECUTE.
get file ='somepath\temp.sav'.
Related
I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;
For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.
I am new to programming, and need some help with an experiment I'm constructing in PsychoPy builder. I have made something that works, but I can tell it's inelegant and there must be a better way.
I want to conduct 24 trials. Each trial will show 7 unique images, then 1 image which may or may not be from the 7, and users are then asked to enter y/n if they have seen the image before.
In my current code I have created 24 separate input files, each containing a list of the unique objects. I have created a loop which shows the seven objects in succession. I have then created separate routines for the pre-trial fixation screen (constant for all 24 trials) and the response (probe image and correct answer manually programmed). The code works, but it is very long, and if I wanted to change something in the fixation or the probe/response, I would need to change each of the 24 trials individually.
How can I instead get Builder to create a loop which contains the fixation screen (constant), the trial loop (picking the next seven unique objects (they are named sequentially from 1-168), and then a probe/response which is unique to each trial (I have these in an input file as follows. Probe refers to a number between 1 and 7 which references the position of the image in the sequence shown in the trial.)
TrialNumber Probe CorrAns
1 4 0
2 3 1
3 4 0
4 5 1
5 4 1
...
I hope my question makes sense and I would be grateful for any assistance.
Thanks PsychoPy Beginner.
Yes, you're correct that there is a (much) more efficient way to do this.
First, start with your conditions file (i.e. .csv or .xlsx). You only need one of them. It should have 24 rows (one per trial). It will need 8 columns: 7 to specify the unique images in a trial and an eighth to specify the repeated one.
Second, you need a loop to control the trials. This is connected to the conditions file, and encompasses all of the routines (the pre-trial fixation and the image routine.
Third, you need a second, inner loop, nested within the outer one. This encompasses only the image routine. i.e. the fixation routine will run 24 times (once per outer loop), and the image routine will run 7×24 times (i.e. 7 times per trial). The inner loop is not connected to a conditions file and is simply set to run 7 times.
So note that you no longer have 24 separate routines in Builder but only two (the fixation and image routines). Instead of duplicating routines, you repeat them via the loops.
In the image stimulus image field, you can construct the image name to use on each presentation. e.g. lets say the 8 columns in your conditions file are labelled 'image0', 'image1', etc. Then in the image field, put something like this:
$'image' + str(yourInnerLoopName.thisN)
i.e. on the first iteration within each trial, the image filename would come from column image0, the second from image1, and so on.
I don't know how you are handling the responses, but you will also probably need a ninth column in the conditions file that indicates what the correct response is. The keyboard component can access that to judge whether the response is correct or not.
Is there a way to control number of records to be stored in a part file?
Thanks.
Not easily (if at all). The number of part files in the output is determined by the parallelism of the script, and the data is split up non-deterministically into those part files. The only way I can think of is something like:
A = FOREACH output GENERATE 1 AS num ;
B = FOREACH (GROUP A ALL) GENERATE COUNT(A) AS totaloutputlines ;
-- Then store both output and B
Then, from within a python wrapper, use totaloutputlines to set the parallelism of the script the python wrapper is running, so that PAR = number of lines in B / number of lines you want per file. That will hopefully, approximately control the number of records per part file.
Maybe you can get something close to what you want with MultiStorage by splitting the output into one file per value of the field you use.
I want to paginate my web applications pages. But the usual get first 10 rows on first page, 10-20 in second page is not a wise decision for me, because my data is only text and varies a lot. One row can be only 10 byte long white another one 100kb long. So I want to paginate my applications pages by their size. For example when a page hits 300kb, i want HAML to notice it and flush the website.
The question is, is there a variable or function that i can check the generated pages size while the page is being generated?
While you're pulling the data from the database, just keep a running count of how much data you're outputting:
#some_collection.each_with_object(0) do |item, count|
count += item.some_text_field.length + item.some_other_text_field.length
break if count > 300
# Proceed as usual
end
Note that the above counts the number of characters, not the number of bytes. If you want to count the latter you can call bytesize on each string instead of length.