Is there a way to generate lags in a panel without collapsing the data? - panel

I have a dataset that looks like this
State Year Policy other_variables
a 2000 0 18
a 2000 0 19
.
.
.
a 2001 1 86
a 2001 1 23
The poicy value is constant within each state and year. But it changes for different state and different year. The other_variables are different for each observation.
I want to generate lags of the policy value for each state. However, I cannot use xtset state year and then use the L operator. There are repeated values within each state year combination. I know that collapsing the dataset, generate lag variables and then merge back to the dataset would work. My question is is there an easy way to do this operation?

This may help:
clear
input str1 State Year Policy
a 2000 0
a 2000 0
a 2001 1
a 2001 1
end
bysort State (Year) : gen diff = Policy - Policy[_n-1] if Year == Year[_n-1] + 1
by State Year: replace diff = diff[_n-1] if missing(diff)
list, sepby(State Year)
+------------------------------+
| State Year Policy diff |
|------------------------------|
1. | a 2000 0 . |
2. | a 2000 0 . |
|------------------------------|
3. | a 2001 1 1 |
4. | a 2001 1 1 |
+------------------------------+

Related

Finding tuples if it only exists in all occurrences of a constraint

Database (all entries are integers):
ID | BUDGET
1 | 20
8 | 20
10 | 20
5 | 4
9 | 4
10 | 4
1 | 11
9 | 11
Suppose my constraint is having a budget of >= 10.
I would want to return ID of 1 only in this case. How do I go about it?
I've tried taking the cross product of itself after selecting budget >= 10 and returning if id1 = id2 and budget1 <> budget2 but that does not work in the case where there's only 1 budget that is >= 10. (EG below)
ID | BUDGET
1 | 20
8 | 20
10 | 20
1 | 4
5 | 4
9 | 4
10 | 4
9 | 4
If I were to do what I did for the first example, nothing will be returned as budget1 <> budget2 will result in an empty table.
EDIT1: I can only use relational algebra to solve the problem. So SQL's exist, where and count keywords cant be used.
Edit2: Only project, select, rename, set difference, set union, left join, right join, full inner join, natural joins, set intersection and cross product allowed
The question is not completely clear to me. If you want to return all the ID for which there is a budget greater than 10, and no budget less than 10, the expression is simply the following:
π(ID)(σ(BUDGET>=10)(R)) - π(ID)(σ(BUDGET<10)(R))
If, an the other hand, you want all the ID which have all the budgets present in the relation and greater then 10, then we must use the ÷ operator:
R ÷ π(BUDGET)(σ(BUDGET>=10)(R))
From your comment, the second case is the correct one. Let’s see how to compute the division from its definition (applied to two generic relations R(A) and S(B)):
R ÷ S = πA-B(R) - πA-B((πA-B(R) x S) - R)
where R is the original relation, and
S = π(BUDGET)(σ(BUDGET>=10)(R)),
that is:
BUDGET
------
20
11
Starting from the inner expression:
πA-B(R) is equal to πID(R) =
ID
--
1
5
8
9
10
then πA-B(R) x S) is:
ID BUDGET
---------
1 20
1 11
5 20
5 11
8 20
8 11
9 20
9 11
10 20
10 11
then ((πA-B(R) x S) - R) is:
ID BUDGET
---------
5 20
5 11
8 11
9 20
10 20
then πA-B((πA-B(R) x S) - R) is:
ID
__
5
8
9
10
and, finally, subtracting this relation from πA-B(R) we obtain the result:
ID
--
1

How to create means in panel data for specific years?

I need help in a particular issue with Stata. I have a panel dataset by id year from 1996 to 2018.
The panel data is a combination of world countries and regions, yearly observations, for 7 different crops, area cultivated.
I would like to create a mean around years 2000, 2010 and 2018, so that mean(year2000)= mean of (1999+2000+2001), mean(year2010)=mean from (2009+2010+2011) and mean(year2018)= mean from (2016+2017+2018) for every crop from my 7 crops selection.
Then the problem is even more complicated when I need to combine some countries to form sub-regions: say I need the sub-region RUS1 = Russia + Ukraine. How can I create another variable that shows the total from crop1 between crop1 area cultivated in Russia + crop1 area cultivated in Ukraine on yearly basis. Meaning another variable that shows these sums for each year using the above means.
I've tried with by id year: egen area_rus1=total(area) if area=="Russia" & area=="Ukraine"
but nothing works.
The names of area being strings I used encode (area), gen (area2) and automatically Stata generates a number.
In order to create a panel dataset i've used gen id=area2+itemcode
The panel data looks like this after sort year
Please be aware that the period is 1996-2018. The example above shows only year 1996.
This didn't get much of a response, for several reasons:
You didn't show very much code.
You didn't show data in a form that is especially useful. An image can't be copied and pasted easily into someone's Stata to allow experiment. In fact your image shows variables that are irrelevant and variables that are different versions of each other and so is much more complicated than we need.
You escalated the question to ask the most complicated version of what you want to know.
There is a problem you should have explained better. area is string and so totals can't be calculated at all and area2 is just arbitrary integers so totals can be calculated but don't make sense. "nothing works" is not informative as a problem report. The only totals that make sense to me are totals of value.
You need to simplify your problem first and then build up.
The essence seems to be as follows:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 country str6 item float year str1 region float value
"A" "barley" 1999 "X" 1
"B" "barley" 1999 "X" 2
"C" "barley" 1999 "Y" 3
"A" "barley" 2000 "X" 4
"B" "barley" 2000 "X" 5
"C" "barley" 2000 "Y" 6
"A" "barley" 2001 "X" 7
"B" "barley" 2001 "X" 8
"C" "barley" 2001 "Y" 9
end
* means by countries: similar variables for other periods
egen mean_9901_c = mean(cond(inrange(year, 1999, 2001), value, .)), by(country item)
* aggregation to regions, but ensure that you don't double count
egen value_region = total(value), by(region item year)
egen tag = tag(region item year)
* means by regions: similar variables for other periods
egen mean_9901_r = mean(cond(tag == 1 & inrange(year, 1999, 2001), value_region, .)), by(region item)
list, sepby(year)
+---------------------------------------------------------------------------------+
| country item year region value mean_9~c value_~n tag mean_9~r |
|---------------------------------------------------------------------------------|
1. | A barley 1999 X 1 4 3 1 9 |
2. | B barley 1999 X 2 5 3 0 9 |
3. | C barley 1999 Y 3 6 3 1 6 |
|---------------------------------------------------------------------------------|
4. | A barley 2000 X 4 4 9 1 9 |
5. | B barley 2000 X 5 5 9 0 9 |
6. | C barley 2000 Y 6 6 6 1 6 |
|---------------------------------------------------------------------------------|
7. | A barley 2001 X 7 4 15 1 9 |
8. | B barley 2001 X 8 5 15 0 9 |
9. | C barley 2001 Y 9 6 9 1 6 |
+---------------------------------------------------------------------------------+
The example shows just one item, but the code should work for several.
The example shows fake data for just three years, but means for other periods can be constructed similarly.
Results are repeated for all observations to which they apply. To see or use results just once, use if. For example the means over 1999 to 2001 are shown for each of those years (and others) but if year == 1999 would be a way to see results just once.
See also help collapse, help egen for its tag() function and this paper.
What was wrong with your code
Your problems start with
if area=="Russia" & area=="Ukraine"
which selects observations for which it is true that area is both "Russia" and "Ukraine" in the same observation, which is impossible. You need the | (or) operator there, not the & operator, or to approach the problem in another way.
The prefix id is wrong too. Using by id: enforces separate calculations for different values of id and is going to make the combinations of identifiers impossible.

Using Arrays to Calculate Previous and Next Values

Is there a way I can use Clickhouse (Arrays?) to calculate sequential values that are dependent on previously calculated values.
For e.g.
On day 1, I start with 0 -- consume 5 -- Add 100 -- ending up with = 0 - 5 + 100 = 95
My day2, starts with what I ended up on day 1 which is 95 -- again consume 10 -- add 5 -- ending up with 95-10+5=90 (which will be the start for day3)
Given
ConsumeArray [5,10,25]
AddArray [100,5,10]
Calculate EndingPosition and (= StartingPosition for Next day)
-
Day1 Day2 Day3
--------------------------------------------------------------------
StartingPosition (a) = Previous Ending Position | 0 95 90 Calculate
Consumed (b) | 5 10 25
Added (c) | 100 5 10
EdingPosition (d) = a-b+c | 95 90 75 Calculate
Just finish all the add/consume operations first and then do an accumulation.
WITH [5,10,25] as ConsumeArray,
[100,5,10] as AddArray
SELECT
arrayCumSum(arrayMap((c, a) -> a - c, ConsumeArray, AddArray));

How to convert a positive duration into a negative in Google Spreadsheet

I have to deal with time and duration in Google Spreadsheet App and I have to calculate with negative duration.
Problem:
--------------------------------------------------------
Begin | End | Duration | calculated in negative (for some reasons)
--------------------------------------------------------
08:00 | 14:00 | 06:00 | no
10:00 | 15:00 | 05:00 | yes
If column 'Begin' and 'End' were formatted as "Time", the difference can be easily calculated in the duration column. However converting the duration value into a negative one with a simple solution like(end-begin)*(-1)seems not to be supported.
First solution:
With the following formula I achieved one goal:
[duration = end - begin]
(HOUR(duration)*60) + MINUTE(duration))(-1)
I had to convert the duration into minutes, multiply with -1 to convert the number into negative. But this leads to a strange behavior:
--------------------------------------------------------
Begin | End | Duration | calculated in negative (for some reasons)
--------------------------------------------------------
08:00 | 14:00 | 06:00 | no
10:00 | 15:00 | -7200:00:00 | yes
So I tried to divide it with 24, 60, 3600, but nothing seems to fit. Until I used the magic number 1440.
This number is a multiple of 60, exactly 24 times.
Final solution:
[duration = end - begin]
((HOUR(duration)*60) + MINUTE(duration))(-1))/1440
My questions are:
Does anyone know why to use the number 1440?
Is there another way to solve this problem?
Google Sheets treat dates and time like serial numbers (same as Excel does):
today() is 42 458;
tommorow = today() + 1 = 42 459;
each day counts one.
time is the number between 0 and 1. So we have 24 hours in 1, and 60 minutes in 1 hour. Therefore to get duration
in minutes: = 24 * 60 = 1440;
in seconds = 24 * 60 * 60 = 86 400;

Store values from a variable and reuse them

This is a question that could help me to solve another, still unsolved question I posted. Basically I need to condition a dataset in Stata and I thought a procedure which would need to first store certain values of a variable in a sort of matrix and then use compare the values of another variable with those stored in the matrix. A simple example could be the following:
obs id act1 act2 year act1year
1 1 0 1 2000 0
2 1 1 0 2001 2001
3 1 0 1 2004 0
4 2 1 0 2001 2001
5 2 1 0 2002 2002
6 2 0 1 2004 0
The code should be able to save in the matrix by(id) the value of act1year different from 0 (in this case 2001) for group 1 and then check if this value, for observations for which act2 is 1, is included in the range for obs i=1,3 [year(i) : year(i)-2] in this case the range does not contain the value stored in the matrix; therefore the observation will be dropped. For group id 2 the code should store [2001, 2002] and then check if the range [year(6):year(6)-2] contains any of the values stored in the matrix.
I hope my question is clear enough! Apologies for not posting any attempt but this is something I really have no idea about how to do.
Both this question and the previous discussion are difficult for me to understand, so let me suggest the following as a starting point to a solution that identifies observations for which either (a) act1 occurs or (b) act2 occurs no more than 2 years after the most recent act1 occurrence.
clear
input id act1 act2 year
1 0 1 2000
1 1 0 2001
1 0 1 2004
2 1 0 2001
2 1 0 2002
2 0 1 2004
end
generate a1yr = 0
replace a1yr = year if act1==1
generate act1r = -act1
bysort id (year act1r): replace a1yr=a1yr[_n-1] if a1yr==0 & _n>1
generate tokeep = 0
replace tokeep = 1 if act1==1
replace tokeep = 1 if act2==1 & year-a1yr<=2
list, clean noobs
Looking at the previous discussion, as it now stands, suggests substituting the following data into the code above and seeing if the code then meets the needs of that discussion.
input obsno id act1 act2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2002
5 1 0 1 2003
6 2 1 0 2000
7 2 1 0 2001
8 2 0 1 2002
9 2 0 1 2002
10 2 0 1 2003
end

Resources