Referencing macro values by index - for-loop

I defined the macros below as levels of the variables id, var1 and var2:
levelsof id, local(id_lev) sep(,)
levelsof var1, local(var1_lev) sep(,)
levelsof var2, local(var2_lev) sep(,)
I'd like to be able to reference the level values stored in these macros by their index during foreach and forval loops. I'm learning how to use macros, so I'm not sure if this is possible.
When I try to access a single element of any of the above macros, every element of the macro is displayed. For example, if I display the first element of id_lev, every element is displayed as a single element (and, the last element is listed as an invalid name which I don't understand):
. di `id_lev'[1]
0524062407240824092601260226032604 invalid name
r(198);
Furthermore, if I attempt to refer to elements of any of the macros in a loop (examples of what I've tried given below), I receive the error that the third value of the list of levels is an invalid number.
foreach i of numlist 1/10 {
whatever `var1'[i] `var2'[i], gen(newvar)
}
forval i = 1/10 {
local var1_ `: word `i' of `var1''
local var2_ `: word `i' of `var2''
whatever `var1_' `var2_', gen(newvar)
}
Is it not possible to reference elements of a macro by its index?
Or am I referencing the index values incorrectly?
Update 1:
I've gotten everything to work (thank you), save for adapting the forval loop given in William's answer to my loops above in which I am trying to access the macros of two variables at the same index value.
Specifically, I want to call on the first, second, ..., last elements of var1 and var2 simultaneously so that I can use the elements in a loop to produce a new variable. How can I adapt the forval loop suggested by William to accomplish this?
Update 2:
I was able to adapt the code given by William below to create the functioning loop:
levelsof id, clean local(id_lev)
macro list _id_lev
local nid_lev : word count `id_lev'
levelsof var1, local(var1_lev)
macro list _var1_lev
local nvar1_lev : word count `var1_lev'
levelsof var2, local(var2_lev)
macro list _var2_lev
local nvar2_lev : word count `var2_lev'
forval i = 1/`nid_lev' {
local id : word `i' of `id_lev'
macro list _id
local v1 : word `i' of `var1_lev'
macro list _v1
local v2 : word `i' of `var2_lev'
macro list _v2
whatever `v1' `v2', gen(newvar)
}

You will benefit, as I mentioned in my closing remark on your previous question, from close study of section 18.3 of the Stata User's Guide PDF.
sysuse auto, clear
tab rep78, missing
levelsof rep78, missing local(replvl)
macro list _replvl
local numlvl : word count `replvl'
macro list _numlvl
forval i = 1/`numlvl' {
local level : word `i' of `replvl'
macro list _level
display `level'+1000
}
yields
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, missing
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.70 2.70
2 | 8 10.81 13.51
3 | 30 40.54 54.05
4 | 18 24.32 78.38
5 | 11 14.86 93.24
. | 5 6.76 100.00
------------+-----------------------------------
Total | 74 100.00
. levelsof rep78, missing local(replvl)
1 2 3 4 5 .
. macro list _replvl
_replvl: 1 2 3 4 5 .
. local numlvl : word count `replvl'
. macro list _numlvl
_numlvl: 6
. forval i = 1/`numlvl' {
2. local level : word `i' of `replvl'
3. macro list _level
4. display `level'+1000
5. }
_level: 1
1001
_level: 2
1002
_level: 3
1003
_level: 4
1004
_level: 5
1005
_level: .
.

Related

spss search for value in dataset

I'd like to find any cases of a value (e.g., 0) in any cell in an SPSS database. What syntax would accomplish this?
(I came across a python script but don't have that option.)
It is still not very clear how you want to select those cases. But the below syntax will list in the output any cases which have ate least one "0" in any of the variables var1,var2 or var3. I am assuming CaseID is the case identifier variable.
TEMPORARY.
SELECT IF ANY(0,var1,var2,var3).
LIST CaseID var1 var2 var3.
You can use as many variables as you want in the ANY function, and also on the LIST command.
The following syntax will create a list of appearances of 0 within your data - In a separate file:
First creating some fake data to demonstrate on.
data list list/ID (a6) test1 to test6 (6f2).
begin data
ID_001 2 3 2 3 0 3
ID_002 3 4 0 4 3 4
ID_003 0 4 2 4 2 4
ID_004 7 0 1 2 8 3
ID_005 5 5 5 0 5 5
ID_006 4 5 4 5 4 0
end data.
dataset name origData.
Now to create the list:
dataset copy ForList.
dataset activate ForList. /* the list will be created from a copy of the data.
varstocases /make vals from test1 to test6/index testNum(vals).
select if vals=0.
You can use the list in the new file, or put it in the output window:
list ID testNum.

Why do tabulate or summarize not take into account missing values when implemented inside a program?

As an illustrative example, suppose this is your dataset:
cat sex age
1 1 13
1 0 14
1 1 .
2 1 23
2 1 45
2 1 15
If you want to create a table of frequencies between cat and sex, you tabulate these two variables and you get the following result:
tab cat sex
| sex
cat | 0 1 | Total
-----------+----------------------+----------
1 | 1 2 | 3
2 | 0 3 | 3
-----------+----------------------+----------
Total | 1 5 | 6
I am writing a Stata program where the three variables are involved, i.e. cat, sex and age. Getting the matrix of frequencies for the first two variables is just an intermediate step that I need for further computation.
cap program drop myexample
program def myexample, rclass byable(recall) sortpreserve
version 14
syntax varlist [aweight iweight fweight] [if] [in] [ , AGgregate ]
args var1 var2 var3
tempname F
marksample touse
set more off
if "`aggregate'" == "" {
local var1: word 1 of `varlist'
local var2: word 2 of `varlist'
local var3: word 3 of `varlist'
qui: tab `var1' `var2' [`weight' `exp'] if `touse', matcell(`F') label matcol(`var2')
mat list `F'
}
end
However, when I run:
myexample cat sex age
I get this result which is not what I expected:
__000001[2,2]
c1 c2
r1 1 1
r2 0 3
That is, given that age contains a missing value, even if it is not directly involved in the tabulation, the program ignores the missing value and does not take into account that observation. I need to get the result of the first tabulation. I have tried using summarize instead, but the same problem arises. When implemented inside the program, missing values are not counted.
You are complaining about behaviour which you built into your own program. The responsibility and the explanation are in your hands.
The effect of
marksample touse
followed by calling up a command with the qualifier
if `touse'
is to ignore missing values. marksample by default marks as "to use" those observations in which all variables specified have non-missing values; the other observations are marked as to be ignored. It also takes account of any if or in qualifiers and any zero weights.
It's also true, as #Noobie explains, that omitting missing values from a tabulation is default for tabulate in any case.
So, to get the result you want you'd need to modify your marksample call to
marksample touse, novarlist
and to call up tabulate with the missing option (if it's compulsory) or to allow users to specify a missing option which you then pass to tabulate.
You also ask about summarize. By design that command ignores missing values. I don't know what you would expect summarize to do about them. It could report a count of missing values. If you want that, several other commands will oblige, such as codebook or missings (Stata Journal). You can always include a report on missings in your program, such as using count to count the missings and display the result.
I understand your program to be very much work in progress, so won't comment on details you don't ask about.
This is caused by marksample. Rule 5 in help mark states
The marker variable is set to 0 in observations for which any of the
numeric variables in varlist contain a numeric missing value.
You should use the novarlist option. According to the help file,
novarlist is for use with marksample. It specifies that missing values
among variables in varlist not cause the marker variable to be set to 0.
if I understand well you want tab to include missing values? If so, you just have to ask for it
tab myvar1 myvar2, mi
from the documentation
missing : treat missing values like other values

Drop all obs of group if condition is met

suppose I have the following panel data (didn't include time var for simplicity)
clear
input id var
1 .
1 0
1 0
1 .
2 .
2 .
2 .
2 .
3 1
3 .
3 .
3 0
end
I would like to delete all groups that have all missing data in their group, that is, I want my data to be like:
id var
1 .
1 0
1 0
1 .
3 1
3 .
3 .
3 0
I tried doing a gen todrop = var[_N], but for some reason, for some groups it doesn't work. Any thoughts? I thought about sorting id var, then doing a cascade replace, but I'm sure there is a better way to do this.
In general, you can verify whether all observations hold the same value by checking first and last observations in each panel, after appropriate sorting. The same principle applies here. I'll use the missing() function:
clear
set more off
input id myvar
1 .
1 0
1 0
1 .
2 .
2 .
2 .
2 .
3 1
3 .
3 .
3 0
end
bysort id (myvar) : gen todrop = missing(myvar[1]) & missing(myvar[_N])
list, sepby(id)
In this case, just checking the first one also works. If it's missing, all others are.
See help by.
Roberto has provided a solution which is however case specific and might lead to wrong outcome.
In fact, suppose you have an observation as follows:
id myvar
2 .
2 1
2 .
Using Roberto's code, you would remove this group, while in the question you need to remove only if all observations are missing.
Therefore I suggest you use a different approach, as follows:
levels id, local(groups) // creates unique values for id (no need to egen if you don't really have to)
foreach iter of local groups {
mdesc myvar if id == "`iter'" // use mdesc and put double quotes if id is a string
drop if id == "`iter'" & r(percent) == 100 // r(percent) is stored after mdesc
}
Roberto's code definitely works. Also does below code. The only contribution is that the original order (sort) of observations is kept if you might want it.
egen todrop2 = min(missing(myvar)), by(id)

Encode a string variable in non-alphanumeric order

I want to encode a string variable in such a way that assigned numerical codes respect the original order of string values (as shown when using browse). Why? I need encoded variable labels to get the correct variable names when using reshape wide.
Suppose var is a string variable with no labels:
var label(var)
"zoo" none
"abc" none
If you start with:
encode var, gen(var2)
the labels are 1="abc" 2="zoo" as can be seen with
label li
But I want the labels sorted as they come, as shown in browse for an unchanged order of variables later.
I didn't find an encode option in which the labels are added in the order I see when using browse.
My best idea is to do it by hand:
ssc install labutil
labvalch var, f(1 2) t(2 1)
This is nice, but I have >50 list entries.
Other approach: When using reshape use another order, but I don't think that works.
reshape wide x, i(id) j(var)
I only found
ssc install labutil
labmask regioncode, values(region)
as some alternative to encode but I'm not able to cope with strings using labmask.
First off, it's a rule in Stata that string variables can't have value labels. Only numeric variables can have value labels. In essence, what you want as value labels are already in your string variable as string values. So, the nub of the problem is that you need to create a numeric variable with values in the right order.
Let's solve the problem in its easiest form: string values occur once and once only. So
gen long order = _n
labmask order, values(var)
then solves the problem, as the numeric values 1, 2, ... are linked with the string values zoo, abc, whatever, which become value labels. Incidentally, a better reference for labmask, one of mine, is
http://www.stata-journal.com/sjpdf.html?articlenum=gr0034
Now let's make it more complicated. String values might occur once or more times, but we want the numeric variable to respect first occurrence in the data.
gen long order1 = _n
egen order2 = min(order1), by(var)
egen order = group(order2)
labmask order, values(var)
Here's how that works.
gen long order1 = _n
puts the observation numbers 1, 2, whatever in a new variable.
egen order2 = min(order1), by(var)
finds the first occurrence of each distinct value of var.
egen order = group(order2)
maps those numbers to 1, 2, whatever.
labmask order, values(var)
links the numeric values of order and the string values of var, which become its value labels.
Here is an example of how that works in practice.
. l, sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 zoo |
2. | abc 2 2 abc |
3. | zoo 3 1 zoo |
4. | abc 4 2 abc |
5. | new 5 5 new |
6. | newer 6 6 newer |
+---------------------------------+
. l, nola sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 1 |
2. | abc 2 2 2 |
3. | zoo 3 1 1 |
4. | abc 4 2 2 |
5. | new 5 5 3 |
6. | newer 6 6 4 |
+---------------------------------+
You would drop order1 order2 once you have got the right answer.
See also sencode for another solution. (search sencode to find references and download locations.)
The user-written command sencode (super encode) by Roger Newson, and available running ssc describe sencode can be used for what you want. Instead of assigning numerical codes based on the alphanumeric order of the string variable, they can be assigned using the order in which the values appear in the original dataset.
clear all
set more off
*------- example data ---------
input str10 var
abc
zoo
zoo
zoo
elephant
elephant
abc
abc
elephant
zoo
end
*------- encode ---------------
encode var, generate(var2)
sencode var, generate(var3)
list, separator(0)
list, separator(0) nolabel
The variable var3 is in the desired form. Contrast that with var2.
I'm not sure if there's an elegant solution, because I think that levelsof orders strings alphabetically.
As long as your list is unique this should work.
clear
input str3 myVar
"zoo"
"abc"
"def"
end
* for reshape
generate iVar = 1
generate jVar = _n
* reshape to wide
reshape wide myVar, i(iVar) j(jVar)
list
* create label
local i = 0
foreach v of varlist myVar* {
local ++i
local myVarName = `v'
label define myLabel `i' "`myVarName'", add
}
* reshape to wide
reshape long myVar, i(iVar) j(myVarEncoded)
* assign label
label value myVarEncoded myLabel

Group and Count an Array of Structs

Ruby noob here!
I have an array of structs that look like this
Token = Struct.new(:token, :ordinal)
So an array of these would look like this, in tabular form:
Token | Ordinal
---------------
C | 2
CC | 3
C | 5
And I want to group by the "token" (i.e. the left hand column) of the struct and get a count, but also preserve the "ordinal" element. So the above would look like this
Token | Merged Ordinal | Count
------------------------------
C | 2, 5 | 2
CC | 3 | 1
Notice that the last column is a count of the grouped tokens and the middle column merges the "ordinal". The first column ("Token") can contain a variable number of characters, and I want to group on these.
I have tried various methods, using group_by (I can get the count, but not the middle column), inject, iterating (does not seem very functional) but I just can't get it right, partly because I don't have a good grasp of Ruby and the available operations / functions.
I have also had a good look around SO, but I am not getting very far.
Any help, pointers would be much appreciated!
Use Enumerable#group_by to do the grouping for you and use the resulting hash to get what you want with map or similar.
structs.group_by(&:token).map do |token, with_same_token|
[token, with_same_token.map(&:ordinal), with_same_token.size]
end

Resources