Generation of a counter variable for episodes in panel data in stata [duplicate] - panel

This question already has an answer here:
Calculating consecutive ones
(1 answer)
Closed 1 year ago.
I am trying to generate a counter variable that describes the duration of a temporal episode in panel data.
I am using long format data that looks something like this:
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
I want to generate a variable like aim1 that starts with a value of 1 when var1==1, and counts up one unit with each subsequent observation per ID where var1 is still equal to 1. For each observation where var1!=1, aim1 should contain missing values.
I already tried using rangestat (count) to solve the problem, however the created variable does not restart the count with each episode:
ssc install rangestat
gen var2=1 if var1==1
rangestat (count) aim2=var2, interval(time -7 0) by (id)

Here are two ways to do it: (1) from first principles, but see this paper for more and (2) using tsspell from SSC.
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
bysort id (time) : gen wanted = 1 if var1 == 1 & var1[_n-1] != 1
by id: replace wanted = wanted[_n-1] + 1 if var1 == 1 & missing(wanted)
tsset id time
ssc inst tsspell
tsspell, cond(var1 == 1)
list, sepby(id _spell)
+---------------------------------------------------------+
| id time var1 aim1 wanted _seq _spell _end |
|---------------------------------------------------------|
1. | 1 1 0 . . 0 0 0 |
2. | 1 2 0 . . 0 0 0 |
|---------------------------------------------------------|
3. | 1 3 1 1 1 1 1 0 |
4. | 1 4 1 2 2 2 1 1 |
|---------------------------------------------------------|
5. | 1 5 0 . . 0 0 0 |
6. | 1 6 0 . . 0 0 0 |
7. | 1 7 0 . . 0 0 0 |
|---------------------------------------------------------|
8. | 2 1 0 . . 0 0 0 |
|---------------------------------------------------------|
9. | 2 2 1 1 1 1 1 0 |
10. | 2 3 1 2 2 2 1 0 |
11. | 2 4 1 3 3 3 1 1 |
|---------------------------------------------------------|
12. | 2 5 0 . . 0 0 0 |
|---------------------------------------------------------|
13. | 2 6 1 1 1 1 2 0 |
14. | 2 7 1 2 2 2 2 1 |
+---------------------------------------------------------+
The approach of tsspell is very close to what you ask for, except (a) its counter (by default _seq is 0 when out of spell, but replace _seq = . if _seq == 0 gets what you ask (b) its auxiliary variables (by default _spell and _end) are useful in many problems. You must install tsspell before you can use it with ssc install tsspell.

Related

how to record properties of other variables in stata

I have to generate variables entry_1, entry_2 and entry_3 which will adopt the value 1 if id_i for that particular month had entry=1.
Example.
id month entry entry_1 entry_2 entry_3
1 1 1 1 0 0
1 2 0 0 0 0
1 3 0 0 1 1
1 4 0 0 0 0
2 1 0 1 0 0
2 2 0 0 0 0
2 3 1 0 1 1
2 4 0 0 0 0
3 1 0 1 0 0
3 2 0 0 0 0
3 3 1 0 1 1
3 4 0 0 0 0
Would anyone be so kind to propose an idea of how to implement a loop in order to do this?
I am thinking of something like this:
forvalues i=1(1)3 {
gen entry`i'=0
replace entry`i'=1 if on that particular month id=`i' had entry=1
}
You could do something like this (although your data don't quite look right for the question you're asking):
forvalues i = 1/3 {
gen entry_`i' = id == `i' & entry == 1
}
This generates a dummy variable entry_i for each i in the forvalues loop where entry_i = 1 if id is i and entry is 1, and 0 otherwise.
The code can be simplified down to at most one loop.
clear
input id month entry entry_1 entry_2 entry_3
1 1 1 1 0 0
1 2 0 0 0 0
1 3 0 0 1 1
1 4 0 0 0 0
2 1 0 1 0 0
2 2 0 0 0 0
2 3 1 0 1 1
2 4 0 0 0 0
3 1 0 1 0 0
3 2 0 0 0 0
3 3 1 0 1 1
3 4 0 0 0 0
end
forval j = 1/4 {
egen entry`j' = total(entry & id == `j'), by(month)
}
list id month entry entry? , sepby(id)
+--------------------------------------------------------+
| id month entry entry1 entry2 entry3 entry4 |
|--------------------------------------------------------|
1. | 1 1 1 1 0 0 0 |
2. | 1 2 0 0 0 0 0 |
3. | 1 3 0 0 1 1 0 |
4. | 1 4 0 0 0 0 0 |
|--------------------------------------------------------|
5. | 2 1 0 1 0 0 0 |
6. | 2 2 0 0 0 0 0 |
7. | 2 3 1 0 1 1 0 |
8. | 2 4 0 0 0 0 0 |
|--------------------------------------------------------|
9. | 3 1 0 1 0 0 0 |
10. | 3 2 0 0 0 0 0 |
11. | 3 3 1 0 1 1 0 |
12. | 3 4 0 0 0 0 0 |
+--------------------------------------------------------+

UNIX - Count occurrences of character per line between two fields and add new column with result

I have a PLINK ped file that looks like this:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I am interested in counting how many time the string "2" occurs between the 7th field (included) and the last field. Then, if the number of occurrences is:
0: add 1 (being absent) to the new last field
1: add 2 (being present) to the new last field
2: add 2 (being present) to the new last field
The output would be:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I know that to count the occurence of a string in every field I can use:
grep -n -o "2" file1 | sort -n | uniq -c | cut -d : -f 1
And that I can merge the 2 results using:
paste -d' ' file1 file2 > file3
But I don't know how to count the occurrences between two fields.
Thank you in advance for helping me!
You can use awk to check for column, row based data:
awk '{c=0; for(i=7; i<=NF; i++) if ($i==2) c++; if (c<2) c++; print $0, c}' file
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Perl to the rescue:
perl -ape 's/$/" " . (1 + !! grep 2 == $_, #F[6 .. $#F])/e'
-p reads the input line by line and prints the result
-a splits each input line on whitespace into the #F array
grep in scalar context returns the count, by !! (double negation) we change it to 0 or 1, and by adding 1 we make it into 1 and 2 as requested
s/// substitutes $ (end of line) with the result of the code in the replacement part (that's what /e does)
You could use awk:
awk '{s=0;for(i=7;i<=NF;i++) if($i==2) s+=1; s=s==0?1:2; print $0, s;}' data.txt
Explanations:
The instructions between the {} are executed on each line of the file.
NF is the number of fields in the line. They are numbered 1 to NF and you can access them with the $n notation.

swift2 How to parse variable spaced text

I have the following data (a portion shown below) and I'm trying to determine the best way to parse this into an array where each line would be parsed separately. The problem I'm running into is that each "column" is separated by a various number of spaces.
I have tried using .componentsSeparatedBySpaces(" ") but that doesn't give me a consistent number items in the array. I thought of using whiteSpace but some team names have 2 words in them and some have 3.
A sample of the text follows:
1 New England Patriots = 28.69 5 0 0 20.34( 13) 1 0 0 | 3 0 0 | 28.68 1 | 28.95 1 | 28.66 2
2 Green Bay Packers = 27.97 6 0 0 17.80( 28) 1 0 0 | 1 0 0 | 27.47 2 | 28.73 2 | 29.01 1
3 Denver Broncos = 26.02 6 0 0 19.02( 23) 0 0 0 | 2 0 0 | 25.21 5 | 27.25 3 | 27.98 3
4 Cincinnati Bengals = 25.96 6 0 0 19.91( 18) 1 0 0 | 3 0 0 | 25.71 4 | 26.38 4 | 26.36 4
5 Arizona Cardinals = 25.01 4 2 0 18.05( 27) 0 1 0 | 0 1 0 | 26.47 3 | 24.17 6 | 23.37 7
6 Pittsburgh Steelers = 24.87 4 2 0 21.17( 10) 1 1 0 | 1 2 0 | 25.17 6 | 24.53 5 | 24.39 5
7 Seattle Seahawks = 24.04 2 4 0 20.92( 12) 0 2 0 | 0 3 0 | 24.47 7 | 23.29 7 | 23.37 6
8 Philadelphia Eagles = 23.87 3 3 0 20.02( 17) 1 1 0 | 2 2 0 | 24.28 8 | 23.01 8 | 23.23 8
9 New York Jets = 22.95 4 1 0 18.41( 25) 0 1 0 | 0 1 0 | 23.83 9 | 22.77 10 | 21.69 11
10 Atlanta Falcons = 22.18 5 1 0 19.31( 21) 1 0 0 | 3 0 0 | 22.36 10 | 22.33 11 | 21.86 10

Ranking aggregated values in panel data

I have an unbalanced panel data set with daily data similar to this, for n countries:
quarter date id trade trade_quarterly rank i
1 1 1 1 2 1 10
1 2 1 1 2 1 17
1 1 2 1 1 2 12
2 1 1 0 1 1 5
2 2 1 1 1 1 9
2 1 2 0 1 1 14
2 2 2 1 1 1 8
2 2 3 0 0 3 6
Given are the first 4 columns.
Interested in information i, I would now like to keep only the 2 most traded ids for each quarter. I aggregated quarterly trades with
bysort quarter id: egen trade_quarterly =sum(trade)
to get column 5.
To calculate column 6, I tried using
bysort quarter id : egen xx =rank(trade_quarterly), "option"
which does not appear to produce the correct solution.
(Note that since the values are aggregated within ids ranking with rank(xx), field would produce a wrong rank for the following id)
The last line of syntax
bysort quarter id : egen xx =rank(trade_quarterly), option
is not legal, as the literal text option is itself not an option. More generally, egen, rank() can not help here with your present data structure.
But consider this, just a matter of a collapse to sums (totals) and then keeping only the largest two (the last two after sorting) within cross-combinations:
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
collapse (sum) trade, by(quarter id)
bysort quarter (trade) : keep if (_N - _n) < 2
list, sepby(id quarter)
+----------------------+
| quarter id trade |
|----------------------|
1. | 1 2 1 |
|----------------------|
2. | 1 1 2 |
|----------------------|
3. | 2 1 1 |
|----------------------|
4. | 2 2 1 |
+----------------------+
If you don't want to collapse, then extra technique is to tag each id-quarter pair just once when ranking.
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
egen sum = total(trade), by(quarter id)
egen tag = tag(quarter id)
bysort tag quarter (trade) : gen tokeep = tag & (_N - _n) < 2
bysort quarter id (tokeep) : replace tokeep = tokeep[_N]
list if tokeep, sepby(quarter)
+--------------------------------------------------+
| quarter date id trade sum tag tokeep |
|--------------------------------------------------|
1. | 1 2 1 1 2 0 1 |
2. | 1 1 1 1 2 1 1 |
3. | 1 1 2 1 1 1 1 |
|--------------------------------------------------|
4. | 2 2 1 1 1 0 1 |
5. | 2 1 1 0 1 1 1 |
6. | 2 2 2 1 1 0 1 |
7. | 2 1 2 0 1 1 1 |
+--------------------------------------------------+
Note, in agreement with #William Lisowski's comment, that the largest two may not be uniquely identifiable in the presence of ties.

Filling in gaps with awk or anything

I have a list such as below, where the 1 column is position and the other columns aren't important for this question.
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
I want to fill in the gaps such that the list is continuous and it reads
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
I am familiar with awk and shell scripts, but whatever way it can be done is fine with me.
Thanks for any help..
this one-liner may work for you:
awk '$1>++p{for(;p<$1;p++)print p"  0 0 0 0 0"}1' file
with your example:
kent$ echo '1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5'|awk '$1>++p{for(;p<$1;p++)print p" 0 0 0 0 0"}1'
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
You can use the following awk one-liner:
awk '{b=a;a=$1;while(a>(b++)+1){print(b+1)," 0 0 0 0 0"}}1' input.file
Tested with here-doc input:
awk '{b=a;a=$1;while(a>(b++)+1){print(b+1)," 0 0 0 0 0"}}1' <<EOF
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
EOF
the output is as follows:
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
Explanation:
On every input line b is set to a where a is the value of the first column. Because of the order in which b and a are initialized, b can be used in a while loop that runs as long as b < a-1 and inserts the missing lines, filled up with zeros. The 1 at the end of the script will finally print the input line.
This is only for fun:
join -a2 FILE <(seq -f "%g 0 0 0 0 0" $(tail -1 FILE | cut -d' ' -f1)) | cut -d' ' -f -6
produces:
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
Here is another way:
awk '{x=$1-b;while(x-->1){print ++b," 0 0 0 0 0"};b=$1}1' file
Test:
$ cat file
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
$ awk '{x=$1-b;while(x-->1){print ++b," 0 0 0 0 0"};b=$1}1' file
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5

Resources