Use if first statement in a single variable is SAS - sorting

Hello: I have a question.
I have a sas dataset like this:
data a;
input id $ a b ;
cards;
ddd 12 1
ddd 22 1
ddd 44 2
ddd 50 1
ddd 52 1
ddd 88 2
;run;
and I expect I can use if first to flag the obs lake this:
data a;
input id $ a b flag $;
cards;
ddd 12 1 Y
ddd 22 1
ddd 44 2 Y
ddd 50 1 Y
ddd 52 1
ddd 88 2 Y
;run;
In order to do that, I sort the dataset by ID, a,b and tried to use if first.b to create flag. But it flags all the obs with Y. I think it might be the reason that I sort by a before b. But in order to keep the dataset in this order, I have to sort it by a,b. So, my question is how can I keep the order and use first.b to create the flag?
Thanks.

I'm assuming you're using set by a b; in combination with first.b. The reason first.b doesn't work in this case is because first.b will be true for the first value of b inside an a group, and in this case there is only one b within each a.
This alternative should work, it retains the previous value of b and checks it each time.
data flagged (drop=prev_b);
set a;
retain prev_b;
if b ne prev_b then flag='Y';
output;
prev_b=b;
run;

You just need to use the NOTSORTED option on the BY statement so that SAS will set the FIRST. and LAST. flags as want them.
data want ;
set a ;
by id b notsorted;
flag = first.b ;
run;

Related

bash: add column if row name is repeated

I have a file with several variables in rows and values of these variables in columns. Some rows are repeated and only contain data for some of the columns (e.g. is the example below, the second time "A" appears, it only contains data in columns S1 and S2)
Example:
Variable S1 S2 S3
A 3 5 6
B 4 5 6
A some_string another_string
C 2 5 6
What I want is to add another (or several) columns that contain the data from the repeated row
Output example:
Variable S1 S2 S3 new_column1 new_column2
A 3 5 6 some_string another_string
B 4 5 6
C 2 5 6
I am thinking that something like the code below could get me there, but it's still erroneous and I'm not sure if it is even possible to do in bash?
My code would only be able to create ONE new column and I don't know how I can add the data to that new column.
I found those pieces of code in an other question that was similar, but not quite what I want, so I would appreciate any help!
awk 'NR==1{$5="new_column";print;next} seen[$1]++ {$5=$2}' file

Why do tabulate or summarize not take into account missing values when implemented inside a program?

As an illustrative example, suppose this is your dataset:
cat sex age
1 1 13
1 0 14
1 1 .
2 1 23
2 1 45
2 1 15
If you want to create a table of frequencies between cat and sex, you tabulate these two variables and you get the following result:
tab cat sex
| sex
cat | 0 1 | Total
-----------+----------------------+----------
1 | 1 2 | 3
2 | 0 3 | 3
-----------+----------------------+----------
Total | 1 5 | 6
I am writing a Stata program where the three variables are involved, i.e. cat, sex and age. Getting the matrix of frequencies for the first two variables is just an intermediate step that I need for further computation.
cap program drop myexample
program def myexample, rclass byable(recall) sortpreserve
version 14
syntax varlist [aweight iweight fweight] [if] [in] [ , AGgregate ]
args var1 var2 var3
tempname F
marksample touse
set more off
if "`aggregate'" == "" {
local var1: word 1 of `varlist'
local var2: word 2 of `varlist'
local var3: word 3 of `varlist'
qui: tab `var1' `var2' [`weight' `exp'] if `touse', matcell(`F') label matcol(`var2')
mat list `F'
}
end
However, when I run:
myexample cat sex age
I get this result which is not what I expected:
__000001[2,2]
c1 c2
r1 1 1
r2 0 3
That is, given that age contains a missing value, even if it is not directly involved in the tabulation, the program ignores the missing value and does not take into account that observation. I need to get the result of the first tabulation. I have tried using summarize instead, but the same problem arises. When implemented inside the program, missing values are not counted.
You are complaining about behaviour which you built into your own program. The responsibility and the explanation are in your hands.
The effect of
marksample touse
followed by calling up a command with the qualifier
if `touse'
is to ignore missing values. marksample by default marks as "to use" those observations in which all variables specified have non-missing values; the other observations are marked as to be ignored. It also takes account of any if or in qualifiers and any zero weights.
It's also true, as #Noobie explains, that omitting missing values from a tabulation is default for tabulate in any case.
So, to get the result you want you'd need to modify your marksample call to
marksample touse, novarlist
and to call up tabulate with the missing option (if it's compulsory) or to allow users to specify a missing option which you then pass to tabulate.
You also ask about summarize. By design that command ignores missing values. I don't know what you would expect summarize to do about them. It could report a count of missing values. If you want that, several other commands will oblige, such as codebook or missings (Stata Journal). You can always include a report on missings in your program, such as using count to count the missings and display the result.
I understand your program to be very much work in progress, so won't comment on details you don't ask about.
This is caused by marksample. Rule 5 in help mark states
The marker variable is set to 0 in observations for which any of the
numeric variables in varlist contain a numeric missing value.
You should use the novarlist option. According to the help file,
novarlist is for use with marksample. It specifies that missing values
among variables in varlist not cause the marker variable to be set to 0.
if I understand well you want tab to include missing values? If so, you just have to ask for it
tab myvar1 myvar2, mi
from the documentation
missing : treat missing values like other values

Alpha numeric sorting in Crystal Report

I'm trying to sort a string field in crystal report that contains numbers and letters
I have:
21B
1
10
11B
33A
11
200
120C
11A
50
120A
1B
and I like to sort it like this: first numeric then letters
1
1B
10
11
11A
11B
21B
33A
50
120A
120C
200
I've tried
if length({Table.field}) = 1 then
"0" + {Table.field})
else if NumericText(right({Table.field}, 1)
then {Table.field}
else "0" + {Table.field}
but it doesn't give me the result I'm looking for
try like below
Create a formula #Sort and write below formula
val({Table.field})
Place the formula in section where you placed fields and supress it. Now sort the records with respect to the created formula.

Encode a string variable in non-alphanumeric order

I want to encode a string variable in such a way that assigned numerical codes respect the original order of string values (as shown when using browse). Why? I need encoded variable labels to get the correct variable names when using reshape wide.
Suppose var is a string variable with no labels:
var label(var)
"zoo" none
"abc" none
If you start with:
encode var, gen(var2)
the labels are 1="abc" 2="zoo" as can be seen with
label li
But I want the labels sorted as they come, as shown in browse for an unchanged order of variables later.
I didn't find an encode option in which the labels are added in the order I see when using browse.
My best idea is to do it by hand:
ssc install labutil
labvalch var, f(1 2) t(2 1)
This is nice, but I have >50 list entries.
Other approach: When using reshape use another order, but I don't think that works.
reshape wide x, i(id) j(var)
I only found
ssc install labutil
labmask regioncode, values(region)
as some alternative to encode but I'm not able to cope with strings using labmask.
First off, it's a rule in Stata that string variables can't have value labels. Only numeric variables can have value labels. In essence, what you want as value labels are already in your string variable as string values. So, the nub of the problem is that you need to create a numeric variable with values in the right order.
Let's solve the problem in its easiest form: string values occur once and once only. So
gen long order = _n
labmask order, values(var)
then solves the problem, as the numeric values 1, 2, ... are linked with the string values zoo, abc, whatever, which become value labels. Incidentally, a better reference for labmask, one of mine, is
http://www.stata-journal.com/sjpdf.html?articlenum=gr0034
Now let's make it more complicated. String values might occur once or more times, but we want the numeric variable to respect first occurrence in the data.
gen long order1 = _n
egen order2 = min(order1), by(var)
egen order = group(order2)
labmask order, values(var)
Here's how that works.
gen long order1 = _n
puts the observation numbers 1, 2, whatever in a new variable.
egen order2 = min(order1), by(var)
finds the first occurrence of each distinct value of var.
egen order = group(order2)
maps those numbers to 1, 2, whatever.
labmask order, values(var)
links the numeric values of order and the string values of var, which become its value labels.
Here is an example of how that works in practice.
. l, sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 zoo |
2. | abc 2 2 abc |
3. | zoo 3 1 zoo |
4. | abc 4 2 abc |
5. | new 5 5 new |
6. | newer 6 6 newer |
+---------------------------------+
. l, nola sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 1 |
2. | abc 2 2 2 |
3. | zoo 3 1 1 |
4. | abc 4 2 2 |
5. | new 5 5 3 |
6. | newer 6 6 4 |
+---------------------------------+
You would drop order1 order2 once you have got the right answer.
See also sencode for another solution. (search sencode to find references and download locations.)
The user-written command sencode (super encode) by Roger Newson, and available running ssc describe sencode can be used for what you want. Instead of assigning numerical codes based on the alphanumeric order of the string variable, they can be assigned using the order in which the values appear in the original dataset.
clear all
set more off
*------- example data ---------
input str10 var
abc
zoo
zoo
zoo
elephant
elephant
abc
abc
elephant
zoo
end
*------- encode ---------------
encode var, generate(var2)
sencode var, generate(var3)
list, separator(0)
list, separator(0) nolabel
The variable var3 is in the desired form. Contrast that with var2.
I'm not sure if there's an elegant solution, because I think that levelsof orders strings alphabetically.
As long as your list is unique this should work.
clear
input str3 myVar
"zoo"
"abc"
"def"
end
* for reshape
generate iVar = 1
generate jVar = _n
* reshape to wide
reshape wide myVar, i(iVar) j(jVar)
list
* create label
local i = 0
foreach v of varlist myVar* {
local ++i
local myVarName = `v'
label define myLabel `i' "`myVarName'", add
}
* reshape to wide
reshape long myVar, i(iVar) j(myVarEncoded)
* assign label
label value myVarEncoded myLabel

MDX Make a set of all cousins

Can anyone help me to solve the followinig MDX related problem ?
I'd need to aggregate a value over a specific set of members.
This set consists of the currentmember and all his cousins (members at the same relative position from their parents as my currentmember) from the "uncles" that are preceeding his parent.
Example :
AAA BBB CCC DDD EEE
123 123 123 123 123
If my current member is C3, my result set would be C3 + B3 + A3
Thanks in advance to the champ' that will find the solution to this !
You can use the Cousin function.

Resources