I have a question about data set preparation. In a survey, the same people were asked about a number of different variables at two points of measurement. This resulted in a dataset in long format, i.e. information from each participant is stored in two rows. Each row represents the data of this person at the respective time of measurement (see example). Individuals have individual participation codes. The same participation code thus indicates that the data is from the same person.
code
time
risk_perception
DB6M
1
6
DB6M
2
4
TH4D
1
2
TH4D
2
3
Now I would like to create a new variable "risk_perception.complete", which shows me whether the information for each participant is complete. It could be that a person has not given any information at both measurement times or only at one of the two measurement times and therefore values are missing (NAs).In the new variable I would like to check and code this information for each person. If the person has one or more NAs, then a 0 should be coded there. If the person has no NAs, then there should be a 1 (see example).
code
time
risk_perception
risk_perception.complete
DB6M
1
6
1
DB6M
2
4
1
TH4D
1
2
1
TH4D
2
3
1
SU6H
1
NA
0
SU6H
2
3
0
VG9S
1
NA
0
VG9S
2
NA
0
Can anyone tell me the best way to program this?
Here is reproducible example:
data <- data.frame(
code = c("AH6M","AH6M","BD7M","BD7M","SH9L","SH9L"),
time = c(1,2,1,2,1,2),
risk = c(6,7,NA,3,NA,NA))
Thank you in advance and best regards!
Related
I have a Google Sheet in which I have to calculate a moving average conditioned to the 'ID' that calculates the average of the last 3 periods.
Any idea on how to do it?
I leave an example with the final results (column "Mean Average (last 3)").
Regards!
ID value Mean Average (last 3)
1 12 12,00
1 19 12,00
1 19 15,50
1 18 16,67
1 13 18,67
2 11 11,00
2 18 11,00
2 15 14,50
2 17 14,67
2 11 16,67
3 11 11,00
3 16 11,00
3 10 13,50
3 11 12,33
I've got an answer that may work for you. Assuming that your sample data is in columns A4:C (see my sample sheet), try the following formula in column D, in the same row as your data headers.
={"Mean Avg";ArrayFormula(
IF(ROW(A4:A18)<ROW(A$4)+2,
C$4,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-1,0))),
B4:B19,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-2,0))),
B3:B18,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-3,0))),
(B2:B17+B3:B18)/2,
(B1:B16+B2:B17+B3:B18)/3)))))}
The first IF checks whether it is one of the first two data rows, to force the initial values.
The next IF checks if the ID is not equal to the row above, and forces the start of a new Average, with just one value. The next IF checks if it is the second ID in a series (NOT EQual to the ID 2 rows up), and if yes, also uses the single value from the row above.
The next IF checks up three rows, and if the IDs are different, it averages the values from the two rows above.
Otherwise, this is the fourth data row in a series with the same ID, and the formula takes the values from the three rows above, and averages them.
Due to the offsets, it seems quite sensitive to ranges, so it may need some tuning if you move it.
Let me know if this helps.
I'd like to find any cases of a value (e.g., 0) in any cell in an SPSS database. What syntax would accomplish this?
(I came across a python script but don't have that option.)
It is still not very clear how you want to select those cases. But the below syntax will list in the output any cases which have ate least one "0" in any of the variables var1,var2 or var3. I am assuming CaseID is the case identifier variable.
TEMPORARY.
SELECT IF ANY(0,var1,var2,var3).
LIST CaseID var1 var2 var3.
You can use as many variables as you want in the ANY function, and also on the LIST command.
The following syntax will create a list of appearances of 0 within your data - In a separate file:
First creating some fake data to demonstrate on.
data list list/ID (a6) test1 to test6 (6f2).
begin data
ID_001 2 3 2 3 0 3
ID_002 3 4 0 4 3 4
ID_003 0 4 2 4 2 4
ID_004 7 0 1 2 8 3
ID_005 5 5 5 0 5 5
ID_006 4 5 4 5 4 0
end data.
dataset name origData.
Now to create the list:
dataset copy ForList.
dataset activate ForList. /* the list will be created from a copy of the data.
varstocases /make vals from test1 to test6/index testNum(vals).
select if vals=0.
You can use the list in the new file, or put it in the output window:
list ID testNum.
I have this kind of structure (ID and Event) :
ID Event X
A 0 0
A 0 0
A 1 1
A 0 1
B 0 0
B 1 1
B 0 1
B 1 2
B 0 2
B 0 2
B 1 3
And I would like to create X, but I can't use any loops as the data base is huge. I would appreciate any suggestion.
Edit: I tried some kinds of bysort ID and Event without luck: now I'm working with this approach:
gen Spell=Event
replace Spell=2 if Spell[_n-1]==1 & Spell[_n+1]==0 & ID[_n]==ID[_n-1]
but it's not going to work since I can't discriminate between the second or the third + event showing on the data base.
Solved
gen X=Event[_n]
replace X=X[_n]+X[_n-1] if _n>1 & ID[_n]==ID[_n-1]
Datasets like this need a time or other sequence variable. You should certainly create one if you don't have one:
sort ID, stable
by ID : gen t = _n
What you want is then just
bysort ID (t) : gen wanted = sum(event)
which is cleaner and clearer than what you have. In Stata
help sum()
search by
search spell
to see relevant help files and expository articles.
(You are aware of this approach, but as you don't show precisely what you tried, so we can't comment on what was wrong.)
my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash
In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).
I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?
I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.