SAS: Proc sort nodupkeys error - sorting

I have two data sets :
"mothers" - 5,512 observations where the variable "MOTHER" = 1
"all_women" - 2,336,750 observations where the variable "MOTHER" = 0
I combine the two as follows:
data combined;
set mothers all_women;
Now as the mothers are already in the datset all_women, I want to delete the repeated entries - with the condition that I keep the observations where "MOTHER"=1.
I tried the following:
proc sort data=combined; by ID DESCENDING MOTHER; run;
proc sort data=combined nodupkeys; by ID; run;
yet I lose some of the mothers because I am left with only 5458 observations where "MOTHER"=1. What have I done to introduce this error?

Instead of using NODUPKEY, use FIRST./LAST. processing.
proc sort data=combined;
by ID DESCENDING MOTHER;
run;
data want;
set combined;
by ID descending mother;
if not (first.ID) and (mother=0) then delete;
run;
That would keep any ID that had mother=0 only (keep 1 record per), and would keep all mother=1 rows.

Have you checked whether there were any duplicate IDs in the mothers dataset? The second proc sort would have eliminated those rows.
You can check like so:
proc sort data = mothers nodupkey out = mothers_dedup dupout = mothers_dups;
by ID;
run;
If mothers_dups contains more than 0 observations, this might account for the problem.

Related

Impact of Order on Proc Surveyselect w/ Seed

I'm following basic instructions to replicate a Proc Surveyselect step; the instructions include the seed used but it seems depending on the sort order results can still vary (and the instructions don't include anything about sorting). I've been reviewing online documentations and all of the information I'm finding related to sorting and Proc Surveyselect is when a strata is specified. Does sort order impact the results of Proc Surveyselect when a seed is specified (and no strata statements are included)?
proc surveyselect data = dataset groups=10 seed=123456 out=dataset_out; run;
Let's test it out. The answer is yes, but what better way to learn than to test it yourself?
data class;
set sashelp.class;
run;
/* Create 10 random groups from class */
proc surveyselect
data = class
groups = 10
seed = 123456
out = class_sample;
run;
/* Resort the data */
proc sort data=class;
by age;
run;
/* Create 10 random groups from the sorted data */
proc surveyselect
data = class
groups = 10
seed = 123456
out = class_sample_sorted;
run;
/* Resort both datasets back by name */
proc sort data=class_sample;
by name;
run;
proc sort data=class_sample_sorted;
by name;
run;
/* Compare the difference between assigned groups */
proc compare base = class_sample
compare = class_sample_sorted
;
id name;
var groupid;
run;
The results from proc compare show that dataset order does indeed matter and will give different sampling results.

Why is my nested Lua table printing out of order?

Beginner Lua quesiton - I'm just learning lua, and I wrote some code, a nested table to create something like a table with rows and columns.
However, when I iterate through the table using pairs(), it doesn't output in the same order I put it in. I put it in a Serial, Service Days, Connected, and it's coming out as Service Days, Serial, Connected. I am at a loss to figuring out why. I intentionally created the three rows different ways, since I'm just learning and trying to get comfortable with the different ways of dealing with Lua tables...
The code:
myTable = {}
myTable["headerRow"] = {
Serial = "Serial",
ServDays = "Service Days",
Connected = "Connected" }
myTable[1] = {
Serial = "B9FX",
ServDays = 7,
Connected = true }
myTable[2] = {}
myTable[2]["Serial"] = "2SHA"
myTable[2]["ServDays"] = 3
myTable[2]["Connected"] = true
for k, v in pairs(myTable) do
for k2, v2 in pairs(v) do
io.write(tostring(v2),",")
end
io.write("\n") --End the row
end
The result:
c:\lua>lua53 primer.lua
7,B9FX,true,
3,2SHA,true,
Service Days,Serial,Connected,
pairs uses the next function. Hence the order of traversal in a generic for loop using the pairs iterator is unspecified.
From the Lua reference manual:
https://www.lua.org/manual/5.3/manual.html#pdf-next
The order in which the indices are enumerated is not specified, even
for numeric indices. (To traverse a table in numerical order, use a
numerical for.)
The behavior of next is undefined if, during the traversal, you assign
any value to a non-existent field in the table. You may however modify
existing fields. In particular, you may clear existing fields.
If you do something like this:
myTable[2] = {}
myTable[2]["Serial"] = "2SHA"
myTable[2]["ServDays"] = 3
myTable[2]["Connected"] = true
Lua will not remember in which order you asigned values to table keys. It will only map keys to values.

Creating portfolios depending on 2 variables with SAS

I am actually new to SAS and would like form portfolios between the intersection of 2 variables from my spreadsheet.
Basically, I have an excel file called 'Up' with variables in it like 'month, company, BM, market cap usd)
I would like to sort for each month my data: the size (descending) and then BM (descending). I would like to create 4 size portfolios according to P25, P50 and P75 with the first size portfolio being above P75 (for each month) and so on. Then for each size portfolio that was create recreating 4 new portfolios in function of 'BM' and also with P25, P50, and P75.
Could someone help me and display me the SAS code and the way to add it to my existing 'Up' file (name of the sheet is also named 'up')
So I agree with the comment, this is not asked well. However, it is a common problem to solve and somewhat fun. So here goes:
First I'm going to just make up some data. Google search how to read Excel in SAS. It's easy.
1000 companies with a random SIZE and BM value.
data companies(drop=c);
format company $12.;
do c=1 to 1000;
company = catt("C_",put(c,z4.));
size = ceil(100*ranuni(1));
BM = ceil(100*ranuni(1));
output;
end;
run;
So I'm assuming you just want equal amounts in these 4 groups. You don't want to estimate percentiles based on a distribution or KDE. For this, PROC RANK works well.
proc rank data=companies out=companies descending groups=4;
var size;
ranks p_size;
run;
We now have a variable P_SIZE that is values 0,1,2,3 based on the descending order of SIZE.
Sort the portfolios by that P_SIZE value.
proc sort data=companies;
by p_size;
run;
Now run PROC RANK again, this time using a BY statement with P_SIZE, ranking on BM, and creating P_SIZE_BM.
proc rank data=companies out=companies descending groups=4;
var bm;
by p_size;
ranks p_size_bm;
run;
P_SIZE_BM now contains values 0,1,2,3 for EACH value of P_SIZE.
Sort the data and see how it comes out:
proc sort data=companies;
by p_size p_size_bm;
run;

Selecting top 10 observations for each data type (SAS)

I am trying to select the top 10 exposures for each class of business out of a large data set.
Below is an example of the dataset.
dataset example
If I were to need the top 10 exposures then I would simply sort by exposure descending (as I have done) and use the (obs = 10) command.
However I require the top 10 for each LOB.
Do you know how I could do this in SAS?
Thanks!
I would create a counting dummy variable, counting the number of exposures per lines of business and then delete any observation for which the dummy variable exceeds 10.
This can be done in a single datastep (given that the data is properly sorted) by (ab-)using that SAS code runs top to bottom.
proc sort data = have out=temp; by lob descending exposure; run;
data want(drop=countlob);
retain countlob;
set temp;
by lob;
countlob = countlob + 1;
if first.lob then countlob = 1;
if countlob > 10 then delete;
run;

How to sort by defined order in sas, not alphabetically?

Say I have a dataset for a supermarket, with a product category, price product name etc. I want to sort by the category but with a defined order as opposed to alphabetically.
For instance if the categories are: canned, dairy, meat, vegetable, and I want to sort by when they may expire (I'll it's likely we'd have that information, so just play along please). This means I want to sort in this order: Dairy, Meat, Vegetable, Canned.
I wrote a macro with this signature:
key_sort(ds=, keys='canned, dairy, meat, vegetable', field =category, sort_by=)
This parses the keys so that the can be put in a macro loop, then I use that much loop to write out a select statement like so:
Select(&field. ) ;
%do i=1 %to &number_of_keys. ;
%let current_key= %scan(&keys., &I., &delim.) ;
When(&current_key. ) &field._key=&i. ;
%end;
End;
I then sort by the &field._key
Is this the best method to take? Can this be done more succinctly or efficiently?
If you had a separate dataset that contained the ordering then you could utilise that and provide it as input a PROC FORMAT, or apply one that's only a few bits as simply as possible.
proc format;
value $EXPIRES
'canned'= 4
'dairy' = 1
'meat' = 2
'vegetable' = 3
other = 5
;
run;
proc sql;
CREATE TABLE output_set AS
SELECT * FROM foods ORDER BY put(produce_type, $expires.);
quit;

Resources