Extracting the value labels of a categorical variable

Extracting the value labels of a categorical variable - label

I have a categorical variable comprised of 12 levels with numerical values from 1 to 12.
Each one of these numerical values is assigned a label. For example, 1 = heart, 2 = brain, 3 = liver and so on. What i would like is to do is extract the label (heart, brain, liver) and place it into a local macro. Is this possible?
I have tried lots of different commands such as describe and codebook.
I have also tried the following:
levelsof var, local(diseases)
The above code gets the levels of the categorical variable var and stores them in the local macro diseases. However this only outputs the numerical values, that is 1,2,3,4, not the labels.

Below is a flexible solution relying on macro extended functions:
sysuse auto, clear
levelsof foreign, local(levels)
local lab : value label foreign
foreach l of local levels {
local all `all' `: label `lab' `l''
}
display "`all'"
Domestic Foreign
If you also want to keep the numerical values change the loop as follows:
foreach l of local levels {
local all `all' `l' `: label `lab' `l''
}
display "`all'"
0 Domestic 1 Foreign

The decode command is also helpful for this issue:
decode var, generate(labvar)
levelsof labvar, local(diseases) clean

Related

Performance: Replacing Series values with keys from a Dictionary in Python

I have a data series that contains various names of the same organizations. I want harmonize these names into a given standard using a mapping dictionary. I am currently using a nested for loop to iterate through each series element and if it is within the dictionary's values, I update the series value with the dictionary key.
# For example, corporation_series is:
0 'Corp1'
1 'Corp-1'
2 'Corp 1'
3 'Corp2'
4 'Corp--2'
dtype: object
# Dictionary is:
mapping_dict = {
'Corporation_1': ['Corp1', 'Corp-1', 'Corp 1'],
'Corporation_2': ['Corp2', 'Corp--2'],
}
# I use this logic to replace the values in the series
for index, value in corporation_series.items():
for key, list in mapping_dict.items():
if value in list:
corporation_series = corporation_series.replace(value, key)
So, if the series has a value of 'Corp1', and it exists in the dictionary's values, the logic replaces it with the corresponding key of corporations. However, it is an extremely expensive method. Could someone recommend me a better way of doing this operation? Much appreciated.

I found a solution by using python's .map function. In order to use .map, I had to invert my dictionary:
# Inverted Dict:
mapping_dict = {
'Corp1': ['Corporation_1'],
'Corp-1': ['Corporation_1'],
'Corp 1': ['Corporation_1'],
'Corp2': ['Corporation_2'],
'Corp--2':['Corporation_2'],
}
# use .map
corporation_series.map(newdict)
Instead of 5 minutes of processing, took around 5s. While this is works, I sure there are better solutions out there. Any suggestions would be most welcome.

gnuplot : variable paths to data file in a for loop

I would like to plot multiple curve on the same graph using a for loop. Each data file (named stat_coupe) is located in a different folder (fwal055wal055/rep16/ and fwal055wal055_c2/rep20/). fwal055wal055 and fwal055wal055_c2 correspond to names of simulation. First, I need to get a previous result, a single number (Utau), in other files (named file_fwal055wal055 and file_fwal055wal055_c2). This is successfully done thanks to the command awk. The result depend on the file: Utaufwal055wal055=10.5 and Utaufwal055wal055_c2=12.2.
Then I need to divid the 1st column of the file stat_coupe corresponding to the path fwal055wal055/rep16/ by the value of Utaufwal055wal055 and do the same thing for the file stat_coupe corresponding to the path fwal055wal055_c2/rep20/ with the value of Utaufwal055wal055_c2. Moreover, each plot should have a specific format which depend on the type of simulation run (fwal055wal055 or fwal055wal055_c2).
The presented problem is reduced to 2 simulations fwal055wal055 and fwal055wal055_c2 and 1 plot but I have about 20 simulations and 15 various graphs to plot that is why I would like to use the for loop.
To summary at each iteration I have:
a specific format,
a specific path,
a specific value of Utau
I want to indicate the wright format, path and value of Utau at each iteration of the for loop. The solution I propose below successfully permits to obtain the value of Utau for each simulation but the code #path_.i and #format_.i does not work.
#!/bin/bash
for elem in fwal055wal055 fwal055wal055_c2;
do
Utau[${elem}]=$(awk 'FNR==5{print $1}' file_$elem)
done
gnuplot -persist <<-EOFMarker
format_fwal055wal055='pt 1 ps 1.0 lc 0 title "WALE"'
format_fwal055wal055_c2='pt 2 ps 1.0 lc 0 title "WALE c2"'
path_fwal055wal055='"fwal055wal055/rep16/stat_coupe"'
path_fwal055wal055_c2='"fwal055wal055_c2/rep20/stat_coupe"'
list="fwal055wal055 fwal055wal055_c2"
plot for [i in list] #path_.i u 1:(\$2/${Utau[${i}]}) #format_.i
EOFMarker
I would like to obtain something equivalent to:
plot #path_fwal055wal055 u 1:(\$2/${Utau[${i}]}) #format_fwal055wal055,\
#path_fwal055wal055_c2 u 1:(\$2/${Utau[${i}]}) #format_fwal055wal055_c2
Can someone help me to solve this issue ?
Thank you very much,
Martin

Check help sprintf, help words and help word.
I would create two strings with the same number of items and then combine them with sprintf(). From gnuplot 5.2 on you could also do it with arrays.
# Version 1
PATHS = '"fwal055wal055/rep16/stat_coupe" "fwal055wal055_c2/rep20/stat_coupe"'
FILES = "fwal055wal055 fwal055wal055_c2"
plot for [i=1:words(FILES)] sprintf("%s_%s",word(PATHS,i),word(FILES,i)) u 1:2
or you could define a function for your filenames to keep the plot command short and readable.
# Version 2
PATHS = '"rep16/stat_coupe" "rep20/stat_coupe"'
FILES = "fwal055wal055 fwal055wal055_c2"
myFilename(i) = sprintf("%s/%s_%s",word(FILES,i),word(PATHS,i),word(FILES,i))
plot for [i=1:words(FILES)] myFilename(i) u 1:2
Addition (after some clarifications...)
If I understand your question now correctly, the following code should do the job.
For the extraction of the UTAUS you do a separate loop before plotting and store the extracted values in a string. During plotting you get these values back via word(UTAUS,i). Since you do the mathematical operation column(2)/word(UTAUS,i), gnuplot will interpret them as number. Check help words, help word, help sprintf, help every.
Code:
### extract and normalize in a loop with individual files and directories
reset session
FILES = 'fwal055wal055 fwal055wal055_c2'
DIRS = 'rep16 rep20'
TITLES = '"WALE" "WALE c2"' # if you have spaces you need to put it into double quotes
UTAUS = ''
# define functions for better readability
myExtractionFile(i) = sprintf("file_%s",word(FILES,i))
myDataFile(i) = sprintf("%s/%s/stat_coupe",word(FILES,i),word(DIRS,i))
myTitle(i) = word(TITLES,i)
# define point or line appearance. Add more if you have more files
set style line 1 pt 1 ps 1.0 lc 0
set style line 2 pt 2 ps 1.0 lc 1
# extract the UTAUs
do for [i=1:words(FILES)] {
set table $Dummy
plot myExtractionFile(i) u (utau=$1) every ::4::4 w table # extract value row 5, column 1 (not counting header lines)
unset table
UTAUS = UTAUS.sprintf(" %g",utau) # append the extracted value as string
}
plot for [i=1:words(FILES)] myDataFile(i) u 1:(column(2)/word(UTAUS,i)) ls i title myTitle(i)
### end of code

Change All Value Labels to Numerics in SPSS

I need to change all the value labels of all my variables in my spss file to be the value itself.
I first tried -
Value Labels ALL.
EXECUTE.
This removes the value labels, but also removes the value entirely. I need this to have a label of some sort as I am converting the file and when there is no values defined it turns the value into a numeric. Therefore, I need the all value labels changed into numbers so that each value's label is just the value - value = 1 then label = 1.
Any ideas to do this across all my variables??
Thanks in advance!!

Here is a solution to get you started:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
begin program.
import spss, spssaux, spssdata
spss.Submit("set mprint on.")
vd=spssaux.VariableDict(variableType ="numeric")
for v in vd:
allvalues = list(set(item[0] for item in spssdata.Spssdata(v.VariableName, names=False).fetchall()))
if allvalues:
cmd="value labels " + v.VariableName + "\n".join([" %(i)s '%(i)s'" %locals() for i in allvalues if i <> None]) + "."
spss.Submit(cmd)
spss.Submit("set mprint off.")
end program.
You may want to read this to understand the behaviour of fetchall in reading date variables (or simply exclude date variables from having their values labelled also, if they cause no problems?)

Why do tabulate or summarize not take into account missing values when implemented inside a program?

As an illustrative example, suppose this is your dataset:
cat sex age
1 1 13
1 0 14
1 1 .
2 1 23
2 1 45
2 1 15
If you want to create a table of frequencies between cat and sex, you tabulate these two variables and you get the following result:
tab cat sex
| sex
cat | 0 1 | Total
-----------+----------------------+----------
1 | 1 2 | 3
2 | 0 3 | 3
-----------+----------------------+----------
Total | 1 5 | 6
I am writing a Stata program where the three variables are involved, i.e. cat, sex and age. Getting the matrix of frequencies for the first two variables is just an intermediate step that I need for further computation.
cap program drop myexample
program def myexample, rclass byable(recall) sortpreserve
version 14
syntax varlist [aweight iweight fweight] [if] [in] [ , AGgregate ]
args var1 var2 var3
tempname F
marksample touse
set more off
if "`aggregate'" == "" {
local var1: word 1 of `varlist'
local var2: word 2 of `varlist'
local var3: word 3 of `varlist'
qui: tab `var1' `var2' [`weight' `exp'] if `touse', matcell(`F') label matcol(`var2')
mat list `F'
}
end
However, when I run:
myexample cat sex age
I get this result which is not what I expected:
__000001[2,2]
c1 c2
r1 1 1
r2 0 3
That is, given that age contains a missing value, even if it is not directly involved in the tabulation, the program ignores the missing value and does not take into account that observation. I need to get the result of the first tabulation. I have tried using summarize instead, but the same problem arises. When implemented inside the program, missing values are not counted.

You are complaining about behaviour which you built into your own program. The responsibility and the explanation are in your hands.
The effect of
marksample touse
followed by calling up a command with the qualifier
if `touse'
is to ignore missing values. marksample by default marks as "to use" those observations in which all variables specified have non-missing values; the other observations are marked as to be ignored. It also takes account of any if or in qualifiers and any zero weights.
It's also true, as #Noobie explains, that omitting missing values from a tabulation is default for tabulate in any case.
So, to get the result you want you'd need to modify your marksample call to
marksample touse, novarlist
and to call up tabulate with the missing option (if it's compulsory) or to allow users to specify a missing option which you then pass to tabulate.
You also ask about summarize. By design that command ignores missing values. I don't know what you would expect summarize to do about them. It could report a count of missing values. If you want that, several other commands will oblige, such as codebook or missings (Stata Journal). You can always include a report on missings in your program, such as using count to count the missings and display the result.
I understand your program to be very much work in progress, so won't comment on details you don't ask about.

This is caused by marksample. Rule 5 in help mark states
The marker variable is set to 0 in observations for which any of the
numeric variables in varlist contain a numeric missing value.
You should use the novarlist option. According to the help file,
novarlist is for use with marksample. It specifies that missing values
among variables in varlist not cause the marker variable to be set to 0.

if I understand well you want tab to include missing values? If so, you just have to ask for it
tab myvar1 myvar2, mi
from the documentation
missing : treat missing values like other values

creating a "for" loop in Stata that assigns a different label to multiple variables

I'm currently using Stata 13.1 to examine a long list of float variables (e.g., A1 - A60). Each of these variables represents the frequency of a different medical symptom (e.g., "Insomnia", "Anxiety", "Nausea"). I'd to add labels to each variable to make data analysis a bit easier, but would prefer something more elegant than:
label var A1 "Insomnia"
label var A2 "Anxiety"
.
.
.
label var A60 "Nausea"
Any suggestions are very much appreciated!

Initially, you need to store the labels in some place. You can use a local macro for that. Below an example with variables that follow some naming pattern (like your example does).
clear
set more off
*----- example data -----
gen A1 = .
gen A2 = .
gen A3 = .
*----- what you want -----
local mylabels "Insomnia Anxiety Nausea"
local n: word count `mylabels'
forvalues i = 1/`n' {
label variable A`i' `:word `i' of `mylabels''
}
describe
The looping over parallel lists technique is from: http://www.stata.com/support/faqs/programming/looping-over-parallel-lists/.
See also help macro and help help extended_fcn.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting the value labels of a categorical variable - label

The decode command is also helpful for this issue: decode var, generate(labvar) levelsof labvar, local(diseases) clean

Related

Performance: Replacing Series values with keys from a Dictionary in Python

gnuplot : variable paths to data file in a for loop

Change All Value Labels to Numerics in SPSS

Why do tabulate or summarize not take into account missing values when implemented inside a program?

creating a "for" loop in Stata that assigns a different label to multiple variables

Categories

Resources