Sorting Logic - Clarifiaciton - sorting

I have a doubt about sorting logic. For explanation, I am providing a scenario here
I have 3 datasets and want to remove duplicate between the datasets and the final result should NOT be a single combined dataset , instead the result datasets should be seperate but without duplicates. like
TEST1 , TEST2, TEST3. Each of the dataset contains duplicates. After removing duplicates, still the datasets should be TEST1, TEST2, TEST3, but without any duplicates between the 3 datasets
Logic Used:
data
enter code herefinal;
set test1 test2 test3 indsname=dsn;
memnm=dsn;
run;
proc sort data=final nodupkey; by var var2; run;
data
test1 test2 test3;
set
final;
if memnm = 'test1' then output test1;
if memnm = 'test2' then output test2;
if memnm = 'test3' then output test3;
run;
I want to know whether the order of the rows of the datasets(test1,2,3) will be still preserved in the final dataset and even after the sorting procedure completion. Like as I am ordering the datasets while setting them in final dataset, will that order be changed during the sort procedure, or NOT
Note: The order of datasets(test1,2,3) will NOT be changed in the SET statement
Please provide a suggestion on this.
As far as I have tested this code, I have not seen any order change. But really want to confirm on it. If someone has any idea or document related to the ordering logic of sort step, it will be very helpful
Thanks in advance

The sort procedure will sort the rows in final by var and var2 and therefore your result datasets test1,test2 and test3 will also each be sorted by var and var2. If you want to preserve the order of your test datasets as it was before your program, you could for instance store the _N_ variable in your final dataset and sort on it after splitting your data based on memnm.
Though if your datasets contain other variables that var and var2, be aware that you are removing duplicate keys (var,var2), not duplicate records. proc sort nodupkey will keep the first record it encounters with a specific key in the order that they appear in final and discard the other regardless of the values variables other than var and var2.
For instance if you had test1:
var var2 var3
---------------
one two foo
and test2:
var var2 var3
---------------
one two bar
After your proc sort, you would have table final:
var var2 var3
---------------
one two foo
and the bar record would be gone.
To remove duplicate records you can use option noduprec instead of nodupkey

PROC SORT will preserve the relative order of observations within the BY groups.
If you want to combine three datasets that are already sorted then you can use the BY statement in combination with the SET statement. So if you wanted to combine three datasets and keep the first observation found in each BY group and also record which dataset contributed that observation you could use code like this.
data final ;
set test1 test2 test3 indsname=dsn;
by var var2;
if first.var2;
memnm=dsn;
run;

Related

SAS - Determine if group of columns contains any value of a set

I have a data set that contains several numbered columns, Code1, Code2, ... Code12. I also have a set of values for those codes %myCodes = ('ABC','DEF','GHI',etc.).
What I want to do is filter so that I include only rows where at least one of the Code1 through Code12 columns contains a value from myCodes. I realize that I could just do a very long OR condition [e.g. (Code1 in &myCodes or Code2 in &myCodes or ...)], but I was wondering if there is a more efficient way to do it.
I think I would concatenate col1-12 into a variable then use a regular expression:
length long_text $32767 ;
long_text = catx('~', of col:);
found = prxmatch('/(abc|def|ghi)/oi', long_text);
Obviously, this requires to rework your list of codes to search for, but I think this pretty efficient until you don't want to identify the variable(s) that contain(s) the code!
You could put your columns that you want to select within a macro variable and loop over them all.
%let cols = col1 col2 col3 col4;
%let myCodes = ('ABC','DEF','GHI');
%macro condition;
%scan(&cols, 1) IN &myCodes
%do i = 2 %to %sysfunc(countw(&cols));
OR %scan(&cols, &i) IN &myCodes
%end;
%mend;
data want;
set have;
where %condition;
run;
If you look at %condition, it has all of your filters:
%put %condition;
col1 IN ('ABC','DEF','GHI') OR col2 IN ('ABC','DEF','GHI') OR col3 IN ('ABC','DEF','GHI') OR col4 IN ('ABC','DEF','GHI')
It really depends on what you mean by efficient.
In general you will have to test all of the CODEx variables to be sure that none of them contain a code of interest. So your proposed long expression might be the most efficient it terms of performance.
You could use a code generator (such as a SAS macro) to help you generate the repetitive code. That would be more efficient for the programmer even if it has no impact on the actual execution time.
You could use an ARRAY to allow you to loop of the set of CODEx variables. That does offer an opportunity to stop once at least one is found which might be more efficient.
array code[12];
do index=1 to dim(code) until(found);
found= code[index] in %mycodes ;
end;
if found;
But there is extra work required to implement the looping so for the cases that do not match it might actually take longer. Plus you could not use that in a WHERE statement.
Depending on the codes it might be better to loop over the codes in %MYCODES and test if they appear in any of the CODEx variables instead. You will need to know how many codes in in %MYCODES for this to work.
array code[12] ;
array mycodes [3] $3 _temporary_ %mycodes;
do index=1 to dim(mycodes) until(found);
found = mycodes[index] in code ;
end;
if found;

SAS proc surveyselect don't sort output

I'm using a PROC SURVEYSELECT statement to get random numbers from a set of integers. SAS then returns the sampled integers but in ascending order, and I need them to remain in random order. How would I either randomly mix the output from the SURVEYSELECT statement, or just get the statement not to sort? I can't seem to find any option that lets the statement just output in the order that it randomly selects.
Here's the code:
proc surveyselect data=data noprint
method=srs
n=numOfSamps
seed=123
out=outputSet
run;
As always, thanks in advance!
If you just want to random sort your final stratified sample you can use ranuni() and proc sort.
data data;
set data;
rn = ranuni(12345);
run;
proc sort data = data; by rn; run;
To generate a random permutation of integers size n of x you can use PROC PLAN. I don't know how that would fit with your SRS but then you don't tell the whole story do you?
proc plan;
factors x=10 of 100 random / noprint;
output out=x10;
run;
quit;
This is the default of SURVEYSELECT, if you have just a single SRS.
data have;
call streaminit(7);
do _order = 1 to 1e6;
intnum = floor(rand('Uniform')*1e9);
output;
end;
run;
proc surveyselect data=have noprint
method=srs
n=10000
seed=123
out=outputSet;
run;
As you can see from the included _ORDER variable, it's not sorted in any way - it still retains the initial ordering (and there's nothing special about that variable).
Now, if you are doing stratified SRS, which is implied by your use of a dataset name in the above code (but you leave the details of that out), it will get sorted, and you need to include the _ORDER variable or something similar, and then re-sort by that variable to return to your original order.

generate a different number of columns based on input number

Suppose I have some XML data that has an unknown number of sub-nodes. Is there a method that allows me to input the number of sub-nodes into the program as a parameter, and have it process them? current code is something like this
SourceXML = LOAD '$input' using org.apache.pig.piggybank.storage.XMLLoader('$TopNode') as test:chararray;
test2 = LIMIT SourceXML 3;
test3 = FOREACH test2 GENERATE REGEX_EXTRACT(test,'<$tag1>(.*)</$tag1>',1),
REGEX_EXTRACT(test,'<$tag2>(.*)</$tag2>',1);
dump test3;
however I may not know in advance how many simple elements there are in the target data (how many $tag# there are). I am hoping to use a .txt file containing parameters that looks something like this:
input=/inputpath/lowerlevelsofpath
numberSimpleElements=3
tag1=tag1name
tag2=tag2name
tag3=tag3name
With a regex_extract being done on each tag in the input file
Any ideas on how to accomplish this?
You could do following
Split the text by some regex, so that each row now has value.
Generate (tag, value) for each row
Do a join between (tag, value) and (list of tags)

Run string as Code

How to call an existing VB6 function and pass parameters or execute a statement utilizing some defined objects dynamically? E.g.
Private Const KONST = 123.45
Private Function First()
Dim var1 As String
Dim var2 As Date
Dim var3 As Integer
...
var3 = Second(var1) 'LINE 1
...
var2 = var2 + IIf(var3 > KONST, 1, -1) 'LINE 2
...
var2 = var2 * KONST 'LINE 3
...
End Function
Private Function Second(ByVal str As String) As Integer
Second = CInt(str)
End Function
At line 1: Name of the function Second could be dynamic while using var1 and returning value
At line 2: The whole IIf should be dynamic using var3 and KONST
At line 3: Whole var2 * KONST should be dynamic i.e. here i may write var2 + KONST or var3 / KONST or var3 + 222 or 1 + 2 or myCollection.Item("item_Key").
All such dynamic configuration will be in a config file.
Edit
I am trying to make grid layout and data population dynamic. By grid layout i mean number of columns, their title, order, format, etc. And by population i mean loading data into the grid, in doing so, sometimes we resolve a database value with some of our Enums, we apply some logic on data before showing it, value of one column is based on value of other column, etc. Though, upto some extent, this can be achieved via databases views, but to have all such logic in a central location we do such things from source code. Therefore i need some way to call my vb6 code dynamically and define the calling (function name, parameters, enums, types, statement) in a config file.
Well, you could use CallByName (see http://support.microsoft.com/kb/186143 for one of many easily found examples) to invoke methods and properties on an object dynamically.
But I think you want entire composed statements to be executed dynamically. For that you can use the Script control (as in VBScript). See http://support.microsoft.com/kb/184740 for a sample. In particular, it has a Eval function that runs arbitrary statements.

How to remove duplicated records\observations WITHOUT sorting in SAS?

I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.
Is it possible?
BTW, below are what I know regarding unduplicating records, which does sorting in the end..
1.
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;
/* Give each record in the original dataset and row number */
data with_id ;
set mydata ;
_id = _n_ ;
run ;
/* Remove dupes */
proc sort data=with_id nodupkey ;
by var1 var2 var3 ;
run ;
/* Sort back into original order */
proc sort data=with_id ;
by _id ;
run ;
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
Depending on the number of variables in your data set, the following might be practical:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
The two examples given in the original post are not identical.
distinct in proc sql only removes lines which are fully identical
nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
This is the fastest way I can think of. It requires no sorting.
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.

Resources