Ranking aggregated values in panel data - panel

I have an unbalanced panel data set with daily data similar to this, for n countries:
quarter date id trade trade_quarterly rank i
1 1 1 1 2 1 10
1 2 1 1 2 1 17
1 1 2 1 1 2 12
2 1 1 0 1 1 5
2 2 1 1 1 1 9
2 1 2 0 1 1 14
2 2 2 1 1 1 8
2 2 3 0 0 3 6
Given are the first 4 columns.
Interested in information i, I would now like to keep only the 2 most traded ids for each quarter. I aggregated quarterly trades with
bysort quarter id: egen trade_quarterly =sum(trade)
to get column 5.
To calculate column 6, I tried using
bysort quarter id : egen xx =rank(trade_quarterly), "option"
which does not appear to produce the correct solution.
(Note that since the values are aggregated within ids ranking with rank(xx), field would produce a wrong rank for the following id)

The last line of syntax
bysort quarter id : egen xx =rank(trade_quarterly), option
is not legal, as the literal text option is itself not an option. More generally, egen, rank() can not help here with your present data structure.
But consider this, just a matter of a collapse to sums (totals) and then keeping only the largest two (the last two after sorting) within cross-combinations:
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
collapse (sum) trade, by(quarter id)
bysort quarter (trade) : keep if (_N - _n) < 2
list, sepby(id quarter)
+----------------------+
| quarter id trade |
|----------------------|
1. | 1 2 1 |
|----------------------|
2. | 1 1 2 |
|----------------------|
3. | 2 1 1 |
|----------------------|
4. | 2 2 1 |
+----------------------+
If you don't want to collapse, then extra technique is to tag each id-quarter pair just once when ranking.
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
egen sum = total(trade), by(quarter id)
egen tag = tag(quarter id)
bysort tag quarter (trade) : gen tokeep = tag & (_N - _n) < 2
bysort quarter id (tokeep) : replace tokeep = tokeep[_N]
list if tokeep, sepby(quarter)
+--------------------------------------------------+
| quarter date id trade sum tag tokeep |
|--------------------------------------------------|
1. | 1 2 1 1 2 0 1 |
2. | 1 1 1 1 2 1 1 |
3. | 1 1 2 1 1 1 1 |
|--------------------------------------------------|
4. | 2 2 1 1 1 0 1 |
5. | 2 1 1 0 1 1 1 |
6. | 2 2 2 1 1 0 1 |
7. | 2 1 2 0 1 1 1 |
+--------------------------------------------------+
Note, in agreement with #William Lisowski's comment, that the largest two may not be uniquely identifiable in the presence of ties.

Related

Generation of a counter variable for episodes in panel data in stata [duplicate]

This question already has an answer here:
Calculating consecutive ones
(1 answer)
Closed 1 year ago.
I am trying to generate a counter variable that describes the duration of a temporal episode in panel data.
I am using long format data that looks something like this:
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
I want to generate a variable like aim1 that starts with a value of 1 when var1==1, and counts up one unit with each subsequent observation per ID where var1 is still equal to 1. For each observation where var1!=1, aim1 should contain missing values.
I already tried using rangestat (count) to solve the problem, however the created variable does not restart the count with each episode:
ssc install rangestat
gen var2=1 if var1==1
rangestat (count) aim2=var2, interval(time -7 0) by (id)
Here are two ways to do it: (1) from first principles, but see this paper for more and (2) using tsspell from SSC.
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
bysort id (time) : gen wanted = 1 if var1 == 1 & var1[_n-1] != 1
by id: replace wanted = wanted[_n-1] + 1 if var1 == 1 & missing(wanted)
tsset id time
ssc inst tsspell
tsspell, cond(var1 == 1)
list, sepby(id _spell)
+---------------------------------------------------------+
| id time var1 aim1 wanted _seq _spell _end |
|---------------------------------------------------------|
1. | 1 1 0 . . 0 0 0 |
2. | 1 2 0 . . 0 0 0 |
|---------------------------------------------------------|
3. | 1 3 1 1 1 1 1 0 |
4. | 1 4 1 2 2 2 1 1 |
|---------------------------------------------------------|
5. | 1 5 0 . . 0 0 0 |
6. | 1 6 0 . . 0 0 0 |
7. | 1 7 0 . . 0 0 0 |
|---------------------------------------------------------|
8. | 2 1 0 . . 0 0 0 |
|---------------------------------------------------------|
9. | 2 2 1 1 1 1 1 0 |
10. | 2 3 1 2 2 2 1 0 |
11. | 2 4 1 3 3 3 1 1 |
|---------------------------------------------------------|
12. | 2 5 0 . . 0 0 0 |
|---------------------------------------------------------|
13. | 2 6 1 1 1 1 2 0 |
14. | 2 7 1 2 2 2 2 1 |
+---------------------------------------------------------+
The approach of tsspell is very close to what you ask for, except (a) its counter (by default _seq is 0 when out of spell, but replace _seq = . if _seq == 0 gets what you ask (b) its auxiliary variables (by default _spell and _end) are useful in many problems. You must install tsspell before you can use it with ssc install tsspell.

How to substract every nth from (n+3)th line in awk?

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

UNIX - Count occurrences of character per line between two fields and add new column with result

I have a PLINK ped file that looks like this:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I am interested in counting how many time the string "2" occurs between the 7th field (included) and the last field. Then, if the number of occurrences is:
0: add 1 (being absent) to the new last field
1: add 2 (being present) to the new last field
2: add 2 (being present) to the new last field
The output would be:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I know that to count the occurence of a string in every field I can use:
grep -n -o "2" file1 | sort -n | uniq -c | cut -d : -f 1
And that I can merge the 2 results using:
paste -d' ' file1 file2 > file3
But I don't know how to count the occurrences between two fields.
Thank you in advance for helping me!
You can use awk to check for column, row based data:
awk '{c=0; for(i=7; i<=NF; i++) if ($i==2) c++; if (c<2) c++; print $0, c}' file
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Perl to the rescue:
perl -ape 's/$/" " . (1 + !! grep 2 == $_, #F[6 .. $#F])/e'
-p reads the input line by line and prints the result
-a splits each input line on whitespace into the #F array
grep in scalar context returns the count, by !! (double negation) we change it to 0 or 1, and by adding 1 we make it into 1 and 2 as requested
s/// substitutes $ (end of line) with the result of the code in the replacement part (that's what /e does)
You could use awk:
awk '{s=0;for(i=7;i<=NF;i++) if($i==2) s+=1; s=s==0?1:2; print $0, s;}' data.txt
Explanations:
The instructions between the {} are executed on each line of the file.
NF is the number of fields in the line. They are numbered 1 to NF and you can access them with the $n notation.

swift2 How to parse variable spaced text

I have the following data (a portion shown below) and I'm trying to determine the best way to parse this into an array where each line would be parsed separately. The problem I'm running into is that each "column" is separated by a various number of spaces.
I have tried using .componentsSeparatedBySpaces(" ") but that doesn't give me a consistent number items in the array. I thought of using whiteSpace but some team names have 2 words in them and some have 3.
A sample of the text follows:
1 New England Patriots = 28.69 5 0 0 20.34( 13) 1 0 0 | 3 0 0 | 28.68 1 | 28.95 1 | 28.66 2
2 Green Bay Packers = 27.97 6 0 0 17.80( 28) 1 0 0 | 1 0 0 | 27.47 2 | 28.73 2 | 29.01 1
3 Denver Broncos = 26.02 6 0 0 19.02( 23) 0 0 0 | 2 0 0 | 25.21 5 | 27.25 3 | 27.98 3
4 Cincinnati Bengals = 25.96 6 0 0 19.91( 18) 1 0 0 | 3 0 0 | 25.71 4 | 26.38 4 | 26.36 4
5 Arizona Cardinals = 25.01 4 2 0 18.05( 27) 0 1 0 | 0 1 0 | 26.47 3 | 24.17 6 | 23.37 7
6 Pittsburgh Steelers = 24.87 4 2 0 21.17( 10) 1 1 0 | 1 2 0 | 25.17 6 | 24.53 5 | 24.39 5
7 Seattle Seahawks = 24.04 2 4 0 20.92( 12) 0 2 0 | 0 3 0 | 24.47 7 | 23.29 7 | 23.37 6
8 Philadelphia Eagles = 23.87 3 3 0 20.02( 17) 1 1 0 | 2 2 0 | 24.28 8 | 23.01 8 | 23.23 8
9 New York Jets = 22.95 4 1 0 18.41( 25) 0 1 0 | 0 1 0 | 23.83 9 | 22.77 10 | 21.69 11
10 Atlanta Falcons = 22.18 5 1 0 19.31( 21) 1 0 0 | 3 0 0 | 22.36 10 | 22.33 11 | 21.86 10

Stata: need help creating a binary variable from panel data

I have a dataset in which a household id (hhid) and a member id (mid) identify a unique person. I have results from two separate surveys taken a year apart (surveyYear). I also have data on whether or not the individual was enrolled in school at the time.
I want a binary variable which signifies if the individual in question dropped out of school between the surveys (i.e. 1 if dropped and 0 if still in school)
I have a decent understanding of Stata but this coding challenge seems a little beyond me because I am not sure how to compare the in-school status of the later id with the earlier id and then propagate that result into a binary column.
Here is an example of what I need
Previously:
+----------------------------------+
| hhid mid survey~r inschool |
|----------------------------------|
1. | 1 2 3 1 |
2. | 1 2 4 1 |
3. | 1 3 3 1 |
4. | 1 3 4 1 |
5. | 2 1 3 1 |
6. | 2 1 4 0 |
7. | 2 2 3 0 |
8. | 2 2 4 0 |
+----------------------------------+
After:
+--------------------------------------------+
| hhid mid survey~r inschool dropped |
|--------------------------------------------|
1. | 1 2 3 1 0 |
2. | 1 2 4 1 0 |
3. | 1 3 3 1 0 |
4. | 1 3 4 1 0 |
5. | 2 1 3 1 1 |
6. | 2 1 4 0 1 |
7. | 2 2 3 0 0 |
8. | 2 2 4 0 0 |
+--------------------------------------------+
bysort hhid mid (surveyyear) : gen dropped = inschool[1] == 1 & inschool[2] == 0
The commentary is longer than the code:
Within blocks of observations with the same hhid and mid, sort by surveyyear.
You want students who were inschool in year 3 but not in year 4. So, inschool is 1 in the first observation and 0 in the second.
Here subscripting [1] and [2] refers to order within blocks of observations defined by the by: statement.
If further detail is needed see e.g. this article. Note that contrary to one tag, no loop is needed (or, if you wish, that the loop over possibilities is built in to the by: framework).

Resources