How to create means in panel data for specific years? - panel

I need help in a particular issue with Stata. I have a panel dataset by id year from 1996 to 2018.
The panel data is a combination of world countries and regions, yearly observations, for 7 different crops, area cultivated.
I would like to create a mean around years 2000, 2010 and 2018, so that mean(year2000)= mean of (1999+2000+2001), mean(year2010)=mean from (2009+2010+2011) and mean(year2018)= mean from (2016+2017+2018) for every crop from my 7 crops selection.
Then the problem is even more complicated when I need to combine some countries to form sub-regions: say I need the sub-region RUS1 = Russia + Ukraine. How can I create another variable that shows the total from crop1 between crop1 area cultivated in Russia + crop1 area cultivated in Ukraine on yearly basis. Meaning another variable that shows these sums for each year using the above means.
I've tried with by id year: egen area_rus1=total(area) if area=="Russia" & area=="Ukraine"
but nothing works.
The names of area being strings I used encode (area), gen (area2) and automatically Stata generates a number.
In order to create a panel dataset i've used gen id=area2+itemcode
The panel data looks like this after sort year
Please be aware that the period is 1996-2018. The example above shows only year 1996.

This didn't get much of a response, for several reasons:
You didn't show very much code.
You didn't show data in a form that is especially useful. An image can't be copied and pasted easily into someone's Stata to allow experiment. In fact your image shows variables that are irrelevant and variables that are different versions of each other and so is much more complicated than we need.
You escalated the question to ask the most complicated version of what you want to know.
There is a problem you should have explained better. area is string and so totals can't be calculated at all and area2 is just arbitrary integers so totals can be calculated but don't make sense. "nothing works" is not informative as a problem report. The only totals that make sense to me are totals of value.
You need to simplify your problem first and then build up.
The essence seems to be as follows:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 country str6 item float year str1 region float value
"A" "barley" 1999 "X" 1
"B" "barley" 1999 "X" 2
"C" "barley" 1999 "Y" 3
"A" "barley" 2000 "X" 4
"B" "barley" 2000 "X" 5
"C" "barley" 2000 "Y" 6
"A" "barley" 2001 "X" 7
"B" "barley" 2001 "X" 8
"C" "barley" 2001 "Y" 9
end
* means by countries: similar variables for other periods
egen mean_9901_c = mean(cond(inrange(year, 1999, 2001), value, .)), by(country item)
* aggregation to regions, but ensure that you don't double count
egen value_region = total(value), by(region item year)
egen tag = tag(region item year)
* means by regions: similar variables for other periods
egen mean_9901_r = mean(cond(tag == 1 & inrange(year, 1999, 2001), value_region, .)), by(region item)
list, sepby(year)
+---------------------------------------------------------------------------------+
| country item year region value mean_9~c value_~n tag mean_9~r |
|---------------------------------------------------------------------------------|
1. | A barley 1999 X 1 4 3 1 9 |
2. | B barley 1999 X 2 5 3 0 9 |
3. | C barley 1999 Y 3 6 3 1 6 |
|---------------------------------------------------------------------------------|
4. | A barley 2000 X 4 4 9 1 9 |
5. | B barley 2000 X 5 5 9 0 9 |
6. | C barley 2000 Y 6 6 6 1 6 |
|---------------------------------------------------------------------------------|
7. | A barley 2001 X 7 4 15 1 9 |
8. | B barley 2001 X 8 5 15 0 9 |
9. | C barley 2001 Y 9 6 9 1 6 |
+---------------------------------------------------------------------------------+
The example shows just one item, but the code should work for several.
The example shows fake data for just three years, but means for other periods can be constructed similarly.
Results are repeated for all observations to which they apply. To see or use results just once, use if. For example the means over 1999 to 2001 are shown for each of those years (and others) but if year == 1999 would be a way to see results just once.
See also help collapse, help egen for its tag() function and this paper.
What was wrong with your code
Your problems start with
if area=="Russia" & area=="Ukraine"
which selects observations for which it is true that area is both "Russia" and "Ukraine" in the same observation, which is impossible. You need the | (or) operator there, not the & operator, or to approach the problem in another way.
The prefix id is wrong too. Using by id: enforces separate calculations for different values of id and is going to make the combinations of identifiers impossible.

Related

gnuplot time axis from two different columns

I'm trying to plot some data from a four columns file. The first one is the numbre of data the second one is the year the third one are months and the final one are values of temperature. The thing is that I woul like that my x axis takes a date from the second and the third columns.
The text file look like this:
1 1990 2 265.78945923
2 1990 3 260.53842163
3 1990 4 265.00366211
4 1990 5 277.61206055
5 1990 6 284.72595215
6 1990 7 291.54879761
7 1990 8 293.61392212
8 1990 9 288.47149658
9 1990 10 284.55172729
12 1991 1 285.98762388
13 1991 2 283.47484293
I'm using a code like this:
set xdata time
set timefmt '%Y %m'
plot 'datafile' u 2:4
But it doesn't work. I woul like to have on my x axis the year and the months.
All help appreciated! Thanks

Dropping people in Stata from a panel based on their situation in multiple years

I have an unbalanced panel of 7 years with every person interviewed 4 times and I want to drop all the people that reported that they were unemployed/inactive in all 4 periods. However, I do not want to drop the observations of the people that may have been out of the labour market for 1, 2 or 3 out of the 4 periods they were interviewed. How do I tell Stata to drop people based on their situation in multiple years (t to t-3)? When I do drop if ecostatus>3, for example, Stata drops observations that I need, i.e. the people that were inactive for less than the full period of the survey.
// create some example data
clear
input id t unemp
1 1 1
1 2 1
1 3 1
1 4 1
2 1 1
2 2 0
2 3 1
2 4 1
end
// create the total number of unemployment spells
bys id : egen totunemp = total(unemp)
// display the data
sort id t
list, sepby(id)
// keep those observations with at least one
// employment spell
keep if totunemp < 4
// display the data
list

How to randomize across categories holding the mean equal?

I am looking for some conceptional inputs detached from any specific platform/software for the following problem:
Let R be a Nx2 matrix with the first column denoting the object ID and the second column the category (e.g. from 1 to 10).
ID | Category
1 | 1
2 | 1
3 | 1
4 | 2
5 | 2
6 | 3
7 | 3
8 | 3
9 | 3
. | .
. | .
Further, assume we have a matrix C which assignes for each cateogry a number, e.g.:
Category | Number
1 | 0.5
2 | 0.2
3 | 0.9
. | .
. | .
So for each object in matrix R a number can be mapped according to matrix C (e.g. for ID=1 with category=1, the number according to matrix C is 0.5).
The goal now is to create an algorithm which randomizes the objects across a pre-specified category-range with the overall average of the column number (which is mapped to the corresponding category) being held constant.
E.g. assume that the category-range is defined as 2 meaning that each object from category 1 can either stay in category 1, randomly be shifted to category 2 or even up to category 3. Similarly, an object from category 3 with a selected category range of 1 can either be moved down to category 2, stay at category 3 or move up to category 4). If an object is now shifted to another category, it gets assigned a new number according to matrix C which impacts the overall average across the column numbers.
However, all swaps have to be executed on a purely random basis with the additional constraint that the average across the column number after the randomization is equal to the one from the beginning.
Any input would be greatly appreciated.

Relative quality of sorted array

I have 2 sorting alghoritms that provides different results (i sort info by relevancy). As result in both ways I get same items in different order. I know, that first alghorytm provides better results than second. I want to get relative value (from 0 to 1) that means "first N values of array2 is 0.73 quality of first N values of array1" (I compare first elements, because user see it without any actions).
First that comes to mind is use sum of differences between position in array1 and array2.
For example:
array1: 1 2 3 4 | 5 6 7 8 9
array2: 8 6 2 3 | 7 4 1 5 9 - positions in array1
array2*: 5 5 2 3 | (greater than 4 replaces with 5 to take relative value in diapasone 0..1)
I want to compare first 4 elements:
S = 1 + 2 + 3 + 4 - sum of etalon, maximum deviation
D = |1 - 5| + |2 - 5| + |3 - 2| + |4 - 3| = 9 - this is absolute deviation
To calculate relative quality I use next formula: (S - D)/S = 0.1.
Is there any standart algorithms? What disadvantages of this algoritm?
What you are looking for is probably DCG [Discounted Cumulative Gain] and nDCG [normalized DCG], which are used to rank relevance.
This assumes one list [let it be list2] is a baseline - the "absolute truth", and list1 should be as closest as possible to it.
The idea is that if the first element if out of order - it is more important if the 10th element is out of order.
The solution is described with more details and an example in my answer in this post [sorry for self-adving myself, it just seems to fit well in here]. and the basic idea is to evaluate:
DCG(list1)/DCG(list2)
Where the relevance of the each element is derived from list2 itself, for example: rel_i = 1/log(1+i)
Notes:
Of course DCG can be calculated only on the relvant n elements
and not on the entire list.
This solution will yield result of 1 if list1 == list2
This solution assumes what matters is only where elements appear, and not the numerical value - of the elements. It completely disregard the numerical value.

User submitted rankings

I was looking to have members submit their top-10 list of something, or their top 10 rankings, then have some algorithm combine the results. Is there something out there like that?
Thanks!
Ahhhh, that's open-ended alright. Let's consider a simple case where only two people vote:
1 ALPHA
2 BRAVO
3 CHARLIE
1 ALPHA
2 DELTA
3 BRAVO
We can't go purely by count... ALPHA should obviously win, though it has the same votes as BRAVO. Yet, we must avoid a case where just a few first place votes dominate a massive amount of 10th place votes. To do this, I suggest the following:
$score = log($num_of_answers - $rank + 2)
First place would then be worth just a bit over one point, and tenth place would get .3 points. That logarithmic scaling prevents ridiculous dominance, yet still gives weight to rankings. From those example votes (and assuming they were the top 3 of a list of 10), you would get:
ALPHA: 2.08
BRAVO: 1.95
DELTA: .1
CHARLIE: .95
Why? Well, that's subjective. I feel out of a very long list that 4,000 10th place votes is worth more than 1,000 1st place votes. You may scale it differently by changing the base of your log (natural, 2, etc.), or choose a different system.
You could just add up the total for each item of the ranking given by a user and then sort them.
ie:
A = (a,b,c)
B = (a,c,b)
C = (b,a,c)
D = (c,b,a)
E = (a,c,b)
F = (c,a,b)
a = 1 + 1 + 2 + 3 + 1 + 2 = 10
b = 2 + 3 + 1 + 2 + 3 + 3 = 14
c = 3 + 2 + 3 + 1 + 2 + 1 = 12
Thus,
a
c
b
I think you could solve this problem by using a max flow algorithm, to create an aggregate ranking, assuming the following:
Each unique item from the list of items is a node in a graph. E.g. if there are 10 things to vote on, there are 10 nodes.
An edge goes from node *a* to node *b* if *a* is immediately before *b* in a _single user submitted_ ranking.
The last node created from a _single user submitted_ ranking will have an edge pointed at the *sink*
The first node created from a _single user submitted_ ranking will have an incoming edge from the *source*
This should get you an aggregated top-10 list.

Resources