I have a data set that contains several numbered columns, Code1, Code2, ... Code12. I also have a set of values for those codes %myCodes = ('ABC','DEF','GHI',etc.).
What I want to do is filter so that I include only rows where at least one of the Code1 through Code12 columns contains a value from myCodes. I realize that I could just do a very long OR condition [e.g. (Code1 in &myCodes or Code2 in &myCodes or ...)], but I was wondering if there is a more efficient way to do it.
I think I would concatenate col1-12 into a variable then use a regular expression:
length long_text $32767 ;
long_text = catx('~', of col:);
found = prxmatch('/(abc|def|ghi)/oi', long_text);
Obviously, this requires to rework your list of codes to search for, but I think this pretty efficient until you don't want to identify the variable(s) that contain(s) the code!
You could put your columns that you want to select within a macro variable and loop over them all.
%let cols = col1 col2 col3 col4;
%let myCodes = ('ABC','DEF','GHI');
%macro condition;
%scan(&cols, 1) IN &myCodes
%do i = 2 %to %sysfunc(countw(&cols));
OR %scan(&cols, &i) IN &myCodes
%end;
%mend;
data want;
set have;
where %condition;
run;
If you look at %condition, it has all of your filters:
%put %condition;
col1 IN ('ABC','DEF','GHI') OR col2 IN ('ABC','DEF','GHI') OR col3 IN ('ABC','DEF','GHI') OR col4 IN ('ABC','DEF','GHI')
It really depends on what you mean by efficient.
In general you will have to test all of the CODEx variables to be sure that none of them contain a code of interest. So your proposed long expression might be the most efficient it terms of performance.
You could use a code generator (such as a SAS macro) to help you generate the repetitive code. That would be more efficient for the programmer even if it has no impact on the actual execution time.
You could use an ARRAY to allow you to loop of the set of CODEx variables. That does offer an opportunity to stop once at least one is found which might be more efficient.
array code[12];
do index=1 to dim(code) until(found);
found= code[index] in %mycodes ;
end;
if found;
But there is extra work required to implement the looping so for the cases that do not match it might actually take longer. Plus you could not use that in a WHERE statement.
Depending on the codes it might be better to loop over the codes in %MYCODES and test if they appear in any of the CODEx variables instead. You will need to know how many codes in in %MYCODES for this to work.
array code[12] ;
array mycodes [3] $3 _temporary_ %mycodes;
do index=1 to dim(mycodes) until(found);
found = mycodes[index] in code ;
end;
if found;
Related
I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;
I'm using a PROC SURVEYSELECT statement to get random numbers from a set of integers. SAS then returns the sampled integers but in ascending order, and I need them to remain in random order. How would I either randomly mix the output from the SURVEYSELECT statement, or just get the statement not to sort? I can't seem to find any option that lets the statement just output in the order that it randomly selects.
Here's the code:
proc surveyselect data=data noprint
method=srs
n=numOfSamps
seed=123
out=outputSet
run;
As always, thanks in advance!
If you just want to random sort your final stratified sample you can use ranuni() and proc sort.
data data;
set data;
rn = ranuni(12345);
run;
proc sort data = data; by rn; run;
To generate a random permutation of integers size n of x you can use PROC PLAN. I don't know how that would fit with your SRS but then you don't tell the whole story do you?
proc plan;
factors x=10 of 100 random / noprint;
output out=x10;
run;
quit;
This is the default of SURVEYSELECT, if you have just a single SRS.
data have;
call streaminit(7);
do _order = 1 to 1e6;
intnum = floor(rand('Uniform')*1e9);
output;
end;
run;
proc surveyselect data=have noprint
method=srs
n=10000
seed=123
out=outputSet;
run;
As you can see from the included _ORDER variable, it's not sorted in any way - it still retains the initial ordering (and there's nothing special about that variable).
Now, if you are doing stratified SRS, which is implied by your use of a dataset name in the above code (but you leave the details of that out), it will get sorted, and you need to include the _ORDER variable or something similar, and then re-sort by that variable to return to your original order.
I am out of necessity using the SQLite3 shell tool to maintain a small database. I'm using -header -ascii flags, although this applies—as far as I can tell—to any of the output choices. I'm looking a way to avoid ambiguity over the type of any one value returned. Consider the following:
Create Table `things` (`number` Integer, `string` Text, `binary` Blob);
Insert Into `things` (`number`,`string`,`binary`) Values (4,'4',X'34');
Select * From `things`;
This returns (using caret notation):
number^_string^_binary^^4^_4^_4^^
As is evident, there is no way to infer the type of any of the '4' characters from the response alone as none of them have distinguishing delimiters.
Is there any way to coerce the inclusion of type metadata into the response?
I'd like to avoid:
Altering query statements to also include types as that would be obfuscatory and would be superfluous in the event I did switch interfaces;
Prefixing TEXT and BLOB values prior to insert as this would have to be uniform for all TEXT and BLOB interaction (in saying that, this is still my preferred choice should it come to that).
What I'm looking for is a switch of some kind that indicates type as part of SQLite's response, e.g.:
number^_string^_binary^^4^_'4'^_X'4'^^
number^_string^_binary^^4^_text:4'^blob:4^^
Or some variation thereof. Fundamental to this is the response alone contains enough information to discern the type and value of each element of that response (much in the same way sqlite3_column_type() allows in the SQLite Library API).
Update: I've refined this question since the first answer by #mike-sherrill-cat-recall to clarify expectations.
In SQLite, it doesn't always make sense to echo the data type of a column. SQLite doesn't have column-wise data types in the traditional sense. You can use typeof(X) in SQL to show the "datatype of the expression X".
sqlite> create table test (n integer, d decimal(8, 2));
sqlite> insert into test (n, d) values (8, 3.14);
sqlite> insert into test (n, d) values ('wibble', 'wibble');
Inserting text into an integer column succeeds.
sqlite> select n, typeof(n), d, typeof(d) from test;
n typeof(n) d typeof(d)
---------- ---------- ---------- ----------
8 integer 3.14 real
wibble text wibble text
You can concatenate anything you like--even producing caret notation--but it's kind of clumsy.
sqlite> select '(' || typeof(n) || ')^_' || n as caret_n from test;
caret_n
-------------------------
(integer)^_8
(text)^_wibble
SQLite Core Functions
The shell always converts printed values to strings. (That's what "print" means.)
If you don't want to add separate output columns for the types, you could use the quote function to output all values according to SQL syntax rules:
sqlite> with v(x) as (values (null), (1), (2.3), ('hello'), (x'00')) select quote(x) from v;
NULL
1
2.3
'hello'
X'00'
I need someone could help me out on how to trace the error of "mismatched data type" in visual foxpro 6.0 When I issues a command like this "insert into tmpcur from memvar".
tmpcur is a cursor having bulk numbers of columns and it is ready hard to trace which one is having mismatch in data type for insertion problem.
It is pretty difficult to trace the insertion loop of each record into VFP tables one by one unliked MSSQL profiler.
Appreciate to someone could help. Thanks.
This should help you. I have a temp cursor created with some bogus field / column names testing for types of character, integer, double, currency, date and time. Trying to follow what is the result of your scenario, I am taking the memory variable of "bbbb" which should be double (or numeric at the least), and changed it to a string.
I am then HOLDING the error trapping routine that MAY be in effect, then setting my own (as I don't think try/catch existed in VFP6.. it may, but I just don't remember. So, I did an ON ERROR, set a variable to true. Then, I default it to false, try the insert, then check the flag. If the flag IS set, then I go into a loop and try for each column in the given table/alias (in my example it is "C_Tmp", so replace with your table/alias). It goes through each variable, and if the data type is different from the table structure, it will dump the column name and table / memory value for you to review.
You could put this to a log file or something.
Now, another consideration. Some types are completely valid and common for implied conversion, such as character and memo fields can both get strings. Integer, double, float, currency can all work with generic "numeric" values.
So, if you encounter these differences, then we can go one level further and look for comparable types, but let me know and we can adjust as needed.
At least this should give you a huge jump to your insert issue.
CREATE CURSOR C_tmp ( cccc c(10), iiii i, bbbb b(2), ccyyyy y, ddd d, tttt t )
SCATTER MEMVAR memo
m.bbbb = "wrong data type, was double with 2 decimal"
lcHoldError = ON("ERROR")
ON ERROR lFailInsert = .t.
lFailInsert = .f.
INSERT INTO C_Tmp FROM memvar
IF lFailInsert
FOR lnI = 1 TO FCOUNT( "C_Tmp" )
lcTmp = FIELD( lnI, "C_Tmp" )
IF NOT TYPE( "C_Tmp." + lcTmp ) == TYPE( "m.&lcTmp" )
? "Invalid " + lcTmp + ", C_Tmp.&lcTmp, m.&lcTmp
ENDIF
ENDFOR
ENDIF
ON ERROR &lcHoldError
I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.
Is it possible?
BTW, below are what I know regarding unduplicating records, which does sorting in the end..
1.
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;
/* Give each record in the original dataset and row number */
data with_id ;
set mydata ;
_id = _n_ ;
run ;
/* Remove dupes */
proc sort data=with_id nodupkey ;
by var1 var2 var3 ;
run ;
/* Sort back into original order */
proc sort data=with_id ;
by _id ;
run ;
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
Depending on the number of variables in your data set, the following might be practical:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
The two examples given in the original post are not identical.
distinct in proc sql only removes lines which are fully identical
nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
This is the fastest way I can think of. It requires no sorting.
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.