How to create industry-year average of all variables from firm- year data using Stata? - panel

I am having a panel dataset with the following format
Firm Year Industry Sales Profit Export intensity R&D
1 2000 1 x x x x
2 2000 1 x x x x
3 2000 2 x x x x
4 2000 2 x x x x
1 2001 1 x x x x
2 2001 1 x x x x
3 2001 2 x x x x
4 2001 2 x x x x
1 2002 1 x x x x
2 2002 1 x x x x
3 2002 2 x x x x
4 2002 2 x x x x
1 2003 1 x x x x
2 2003 1 x x x x
3 2003 2 x x x x
4 2003 2 x x x x
I want to create industry average per year of all variables. The real data set has 2000 firms * 10 years observations and 25 industries.

If you want to maintain your data structure, the easiest way is probably to combine egen's by() option with a loop:
foreach v of varlist Sales Profit Export RD {
egen IndAvg`v' = mean(`v') , by(Industry Year)
}
E.g.,
clear all
input Firm Year Industry Sales Profit Export RD
1 2000 1 831 135 196 30
2 2000 1 44 847 885 780
3 2000 2 818 112 859 306
4 2000 2 777 700 903 858
1 2001 1 491 563 325 324
2 2001 1 411 468 927 720
3 2001 2 731 872 170 556
4 2001 2 587 273 833 656
1 2002 1 155 558 497 427
2 2002 1 210 853 792 575
3 2002 2 279 282 969 549
4 2002 2 683 176 902 538
1 2003 1 805 475 479 599
2 2003 1 226 178 37 225
3 2003 2 129 693 746 652
4 2003 2 347 509 406 102
end
foreach v of varlist Sales Profit Export RD {
egen IndAvg`v' = mean(`v') , by(Industry Year)
}
sort Industry Year Firm
li , sepby(Industry)
However, you may also want to look into collapse:
collapse (mean) Sales Profit Export RD , by(Industry Year)

Related

How can I get the max value of the sum value in a N x M matrix, each column and each rows can be chosen only once

Example input:
3 4
10 20 30
40 10 30
20 10 0
5 15 5
Example output:
1 3
2 1
4 2
85
30 + 40 + 15 = 85
I need get the result of a matrix with thousand rows and columns

How can i get an identified number with each groups?

I want following forms,
stnd_y person_id recu_day date sick_sym Admission
2002 100 20020929 02-09-29 A 1
2002 100 20020929 02-09-29 B 1
2002 100 20020929 02-09-29 D 1
2002 100 20020930 02-09-30 B 2
2002 100 20020930 02-09-30 E 2
2002 100 20021002 02-10-02 X 3
2002 100 20021002 02-10-02 W 3
2002 101 20020927 02-09-27 S 1
2002 101 20020927 02-09-27 O 1
2002 101 20020928 02-09-28 C 2
2002 102 20021001 02-10-01 F 1
2002 103 20021003 02-10-03 G 1
2002 104 20021108 02-11-08 H 1
2002 104 20021108 02-11-08 A 1
2002 104 20021112 02-11-12 B 2
proc sort data=a out=a1;
by person_id recu_fr_dt;
data a3;
set a1 ;
by person_id recu_fr_dt;
if first.person_id then adm+1;
run;
According to above codes, the results is following, as i don't mean it.
stnd_y person_id recu_day date sick_sym Admission
2002 100 20020929 02-09-29 A 1
2002 100 20020929 02-09-29 B 2
2002 100 20020929 02-09-29 D 3
2002 100 20020930 02-09-30 B 4
2002 100 20020930 02-09-30 E 5
2002 100 20021002 02-10-02 X 6
2002 100 20021002 02-10-02 W 7
2002 101 20020927 02-09-27 S 1
2002 101 20020927 02-09-27 O 2
2002 101 20020928 02-09-28 C 3
2002 102 20021001 02-10-01 F 1
2002 103 20021003 02-10-03 G 1
2002 104 20021108 02-11-08 H 1
2002 104 20021108 02-11-08 A 2
2002 104 20021112 02-11-12 B 3
At also, I used followings with sas,
proc sort data=old out=new;
by person_id recu_day;
data new1;
set new;
retain admission 0;
by person_id recu_day;
if recu_day^=lag(recu_day) and(or) person_id^=lag(person_id) then
admission+1;
run;
And,
data new1;
set new ;
by person_id recu_day;
retain adm 0;
if first.person_id and(or) first.recu_day then admission=admission+1;
run;
But, those are not working. How can i solve this? Please let me know about this.
Thank you
How could i fix it?
Thank you! :D
Please visit http://stackoverflow.com/questions/46076468/how-can-i-get-the-identification-number-with-each-groups/
Here's a modification to my answer to your previous question.
This time, it adds 1 to the adm variable each time the day changes for a given person_id. The retain statement ensures that the current value is copied for all subsequent rows where the peson_id and recu_day are the same.
data have;
input stnd_y person_id recu_day date :yymmdd8. sick_sym $ Admission;
datalines;
2002 100 20020929 02-09-29 A 1
2002 100 20020929 02-09-29 B 1
2002 100 20020929 02-09-29 D 1
2002 100 20020930 02-09-30 B 2
2002 100 20020930 02-09-30 E 2
2002 100 20021002 02-10-02 X 3
2002 100 20021002 02-10-02 W 3
2002 101 20020927 02-09-27 S 1
2002 101 20020927 02-09-27 O 1
2002 101 20020928 02-09-28 C 2
2002 102 20021001 02-10-01 F 1
2002 103 20021003 02-10-03 G 1
2002 104 20021108 02-11-08 H 1
2002 104 20021108 02-11-08 A 1
2002 104 20021112 02-11-12 B 2
;
run;
data want;
set have;
by person_id recu_day;
retain adm;
if first.person_id then adm=0;
if first.recu_day then adm+1;
run;

Calculate within correlation in panel data long form

We have a simple panel data set in long form, which has the following structure:
i t x
1 Aug-2011 282
2 Aug-2011 -220
1 Sep-2011 334
2 Sep-2011 126
1 Sep-2012 -573
2 Sep-2012 305
1 Nov-2013 335
2 Nov-2013 205
3 Nov-2013 485
I would like to get the cross-correlation between each i within the time-variable t.
This would be possible by converting the data in wide format. Unfortunately, this approach is not feasible due to the big number of i and t values in the real data set.
Is it possible to do something like in this fictional command:
by (tabulate t): corr x
You can easily calculate the correlations of a single variable such as x across panel groups using the reshape option of the community-contributed command pwcorrf:
ssc install pwcorrf
For illustration, consider (a slightly simplified version of) your toy example:
clear
input i t x
1 2011 282
2 2011 -220
1 2012 334
2 2012 126
1 2013 -573
2 2013 305
1 2014 335
2 2014 205
3 2014 485
end
xtset i t
panel variable: i (unbalanced)
time variable: t, 2011 to 2014
delta: 1 unit
pwcorrf x, reshape
Variable(s): x
Panel var: i
corrMatrix[3,3]
1 2 3
1 1 0 0
2 -.54223207 1 0
3 . . 1

print string n times, where n= (field in first line)-1

I have a file like this:
X 7 1 3
X 8 1 4
X 9 1 6
X 13 2 8
X 20 6 11
Y 13 2 8
Y 19 6 10
Y 20 6 11
basically if we call column 2 of the first line 'n', I want to add a string to the top n-1 times so that the output is:
X 1 0 0
X 2 0 0
X 3 0 0
X 4 0 0
X 5 0 0
X 6 0 0
X 7 1 3
X 8 1 4
X 9 1 6
X 13 2 8
X 20 6 11
Y 13 2 8
Y 19 6 10
Y 20 6 11
Note that column 1 line 1 is variable. Is there a way to do this in awk?
awk to the rescue!
$ awk 'NR==1{for(i=1;i<$2;i++) print $1, i, 0, 0} 1' file | column -t
X 1 0 0
X 2 0 0
X 3 0 0
X 4 0 0
X 5 0 0
X 6 0 0
X 7 1 3
X 8 1 4
X 9 1 6
X 13 2 8
X 20 6 11
Y 13 2 8
Y 19 6 10
Y 20 6 11

How to calculate classification error rate

Alright. Now this question is pretty hard. I am going to give you an example.
Now the left numbers are my algorithm classification and the right numbers are the original class numbers
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?
Or here another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.
This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D
Here's a longish example, a real confuson matrix with 10 input classes "0" - "9"
(handwritten digits),
and 10 output clusters labelled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, 415 of which are "8"s;
cluster B has 383 data points, 249 of which are "1"s; and so on.
The problem is that the output classes are scrambled, permuted;
they correspond in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
One could say that the "success rate" is
75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information —
here, that E and J both say "6", and no cluster says "9".
So, add up the biggest numbers in each column of the confusion matrix
and divide by the total.
But, how to count overlapping / missing clusters,
like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way
(doubt that the Hungarian algorithm
is used in practice).
Bottom line: don't throw away information; look at the whole confusion matrix.
NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set",
train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g.
Cross-validation.
You have to define the error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you're asking. In some clustering and machine learning algorithms you define the error metric and it minimizes it.
Take a look at this
https://en.wikipedia.org/wiki/Confusion_matrix
to get some ideas
You have to define a error metric to measure yourself. In your case, a simple method should be to find the properties mapping of your product as
p = properties(id)
where id is the product id, and p is likely be a vector with each entry of different properties. Then you can define the error function e (or distance) between two products as
e = d(p1, p2)
Sure, each properties must be evaluated to a number in this function. Then this error function can be used in the classification algorithm and learning.
In your second example, it seems that you treat the pair (203 7) as successful classification, so I think you have already a metric yourself. You may be more specific to get better answer.
Classification Error Rate(CER) is 1 - Purity (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Code of #john-colby
Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

Resources