Variable recording event incrementally (no loops) - matrix

I have this kind of structure (ID and Event) :
ID Event X
A 0 0
A 0 0
A 1 1
A 0 1
B 0 0
B 1 1
B 0 1
B 1 2
B 0 2
B 0 2
B 1 3
And I would like to create X, but I can't use any loops as the data base is huge. I would appreciate any suggestion.
Edit: I tried some kinds of bysort ID and Event without luck: now I'm working with this approach:
gen Spell=Event
replace Spell=2 if Spell[_n-1]==1 & Spell[_n+1]==0 & ID[_n]==ID[_n-1]
but it's not going to work since I can't discriminate between the second or the third + event showing on the data base.
Solved
gen X=Event[_n]
replace X=X[_n]+X[_n-1] if _n>1 & ID[_n]==ID[_n-1]

Datasets like this need a time or other sequence variable. You should certainly create one if you don't have one:
sort ID, stable
by ID : gen t = _n
What you want is then just
bysort ID (t) : gen wanted = sum(event)
which is cleaner and clearer than what you have. In Stata
help sum()
search by
search spell
to see relevant help files and expository articles.
(You are aware of this approach, but as you don't show precisely what you tried, so we can't comment on what was wrong.)

Related

Create values of new data frame variable based on other column values

I have a question about data set preparation. In a survey, the same people were asked about a number of different variables at two points of measurement. This resulted in a dataset in long format, i.e. information from each participant is stored in two rows. Each row represents the data of this person at the respective time of measurement (see example). Individuals have individual participation codes. The same participation code thus indicates that the data is from the same person.
code
time
risk_perception
DB6M
1
6
DB6M
2
4
TH4D
1
2
TH4D
2
3
Now I would like to create a new variable "risk_perception.complete", which shows me whether the information for each participant is complete. It could be that a person has not given any information at both measurement times or only at one of the two measurement times and therefore values are missing (NAs).In the new variable I would like to check and code this information for each person. If the person has one or more NAs, then a 0 should be coded there. If the person has no NAs, then there should be a 1 (see example).
code
time
risk_perception
risk_perception.complete
DB6M
1
6
1
DB6M
2
4
1
TH4D
1
2
1
TH4D
2
3
1
SU6H
1
NA
0
SU6H
2
3
0
VG9S
1
NA
0
VG9S
2
NA
0
Can anyone tell me the best way to program this?
Here is reproducible example:
data <- data.frame(
code = c("AH6M","AH6M","BD7M","BD7M","SH9L","SH9L"),
time = c(1,2,1,2,1,2),
risk = c(6,7,NA,3,NA,NA))
Thank you in advance and best regards!

What is the best approach for large scale Paths and Funnels Analysis?

We have a big dataset of user actions on our internal apps. I am trying to create an algorithm for Paths & Funnels analytics which will take parameters for Paths (i.e. Start and End point) and a defined step of actions for Funnel. What is the best algorithm to program this with large data? The output should be just counts of users for specific set of actions like this :
Format of the file to scan:
UserID
Action
TS
1
A
06/04/2022
1
B
06/04/2022
1
C
06/04/2022
1
D
06/04/2022
2
G
06/04/2022
2
H
06/04/2022
2
K
06/04/2022
Algorithm input parameters:
For Path : User statistics on the start point A and end point F
For Funnel: User statistics on the defined steps A->B->C->D
Path
Count
A->B->C->D
385
G->H->K
89
where A,B,C,D,... are nodes for user actions or pages.
This should be easy using Python for a smaller set, but the issue is, I am worried about performance, as I am dealing with millions of records like this. Please help!
Assuming that
...
1 A ts
1 B ts
...
in the input data means user 1 went A -> B
the algorithm is
CREATE new table paths_users_followed
CREATE new path
LOOP over data input rows, except last
IF user in row equals user in row+1
ADD action in row to path
IF row+1 is last row
ADD action in last to path
ADD user, path to paths_users_followed
ELSE
ADD user, path to paths_users_followed
CREATE new PATH
ENDLOOP
LOOP P over input of "path statistics"
COUNT occurrences of P in paths_users_followed
This can be most easily and efficiently implemented using a high performance database engine - I would use SQLite.

Bash: find all pair of lines such that the difference of their first field is less than a threshold

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash
In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Why do tabulate or summarize not take into account missing values when implemented inside a program?

As an illustrative example, suppose this is your dataset:
cat sex age
1 1 13
1 0 14
1 1 .
2 1 23
2 1 45
2 1 15
If you want to create a table of frequencies between cat and sex, you tabulate these two variables and you get the following result:
tab cat sex
| sex
cat | 0 1 | Total
-----------+----------------------+----------
1 | 1 2 | 3
2 | 0 3 | 3
-----------+----------------------+----------
Total | 1 5 | 6
I am writing a Stata program where the three variables are involved, i.e. cat, sex and age. Getting the matrix of frequencies for the first two variables is just an intermediate step that I need for further computation.
cap program drop myexample
program def myexample, rclass byable(recall) sortpreserve
version 14
syntax varlist [aweight iweight fweight] [if] [in] [ , AGgregate ]
args var1 var2 var3
tempname F
marksample touse
set more off
if "`aggregate'" == "" {
local var1: word 1 of `varlist'
local var2: word 2 of `varlist'
local var3: word 3 of `varlist'
qui: tab `var1' `var2' [`weight' `exp'] if `touse', matcell(`F') label matcol(`var2')
mat list `F'
}
end
However, when I run:
myexample cat sex age
I get this result which is not what I expected:
__000001[2,2]
c1 c2
r1 1 1
r2 0 3
That is, given that age contains a missing value, even if it is not directly involved in the tabulation, the program ignores the missing value and does not take into account that observation. I need to get the result of the first tabulation. I have tried using summarize instead, but the same problem arises. When implemented inside the program, missing values are not counted.
You are complaining about behaviour which you built into your own program. The responsibility and the explanation are in your hands.
The effect of
marksample touse
followed by calling up a command with the qualifier
if `touse'
is to ignore missing values. marksample by default marks as "to use" those observations in which all variables specified have non-missing values; the other observations are marked as to be ignored. It also takes account of any if or in qualifiers and any zero weights.
It's also true, as #Noobie explains, that omitting missing values from a tabulation is default for tabulate in any case.
So, to get the result you want you'd need to modify your marksample call to
marksample touse, novarlist
and to call up tabulate with the missing option (if it's compulsory) or to allow users to specify a missing option which you then pass to tabulate.
You also ask about summarize. By design that command ignores missing values. I don't know what you would expect summarize to do about them. It could report a count of missing values. If you want that, several other commands will oblige, such as codebook or missings (Stata Journal). You can always include a report on missings in your program, such as using count to count the missings and display the result.
I understand your program to be very much work in progress, so won't comment on details you don't ask about.
This is caused by marksample. Rule 5 in help mark states
The marker variable is set to 0 in observations for which any of the
numeric variables in varlist contain a numeric missing value.
You should use the novarlist option. According to the help file,
novarlist is for use with marksample. It specifies that missing values
among variables in varlist not cause the marker variable to be set to 0.
if I understand well you want tab to include missing values? If so, you just have to ask for it
tab myvar1 myvar2, mi
from the documentation
missing : treat missing values like other values

Drop all obs of group if condition is met

suppose I have the following panel data (didn't include time var for simplicity)
clear
input id var
1 .
1 0
1 0
1 .
2 .
2 .
2 .
2 .
3 1
3 .
3 .
3 0
end
I would like to delete all groups that have all missing data in their group, that is, I want my data to be like:
id var
1 .
1 0
1 0
1 .
3 1
3 .
3 .
3 0
I tried doing a gen todrop = var[_N], but for some reason, for some groups it doesn't work. Any thoughts? I thought about sorting id var, then doing a cascade replace, but I'm sure there is a better way to do this.
In general, you can verify whether all observations hold the same value by checking first and last observations in each panel, after appropriate sorting. The same principle applies here. I'll use the missing() function:
clear
set more off
input id myvar
1 .
1 0
1 0
1 .
2 .
2 .
2 .
2 .
3 1
3 .
3 .
3 0
end
bysort id (myvar) : gen todrop = missing(myvar[1]) & missing(myvar[_N])
list, sepby(id)
In this case, just checking the first one also works. If it's missing, all others are.
See help by.
Roberto has provided a solution which is however case specific and might lead to wrong outcome.
In fact, suppose you have an observation as follows:
id myvar
2 .
2 1
2 .
Using Roberto's code, you would remove this group, while in the question you need to remove only if all observations are missing.
Therefore I suggest you use a different approach, as follows:
levels id, local(groups) // creates unique values for id (no need to egen if you don't really have to)
foreach iter of local groups {
mdesc myvar if id == "`iter'" // use mdesc and put double quotes if id is a string
drop if id == "`iter'" & r(percent) == 100 // r(percent) is stored after mdesc
}
Roberto's code definitely works. Also does below code. The only contribution is that the original order (sort) of observations is kept if you might want it.
egen todrop2 = min(missing(myvar)), by(id)

Resources