Shuffle One Variable Within Group - random

This question is an extension of the excellent answer provided by Robert Picard here: How to Randomly Assign to Groups of Different Sizes
We have this dataset, which is the same as in the previous question, but adds the year variable:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
gen year=20
replace year=30 if _n>15
replace year=40 if _n>35
If I just wanted to re-randomly assign reg's across all observations (without regard to group), I could implement the answer to the previous post:
tempfile orig
save `orig'
keep reg
rename reg reg_new
set seed 234
gen double u = runiform()
sort u reg_new
merge 1:1 _n using `orig', nogen
How would the code be modified so that reg is shuffled, but only within year? For example, there are 15 observations where year==20. These observations should be shuffled separately than the other years.

Shuffling one variable doesn't require any file choreography. This can probably be shortened:
sysuse auto, clear
set seed 2803
gen double shuffle = runiform()
* example 1
sort shuffle
gen long which = _n
sort mpg
gen mpg_new = mpg[which]
list which mpg*
* example 2
bysort foreign (shuffle) : gen long which2 = _n
bysort foreign (mpg) : gen mpg2 = mpg[which2]
list which2 mpg mpg2, sepby(foreign)
All that said, I think sample does this so long as you specify the same sample size as the number in the dataset. It's overkill because you get all the variables.

Related

How to export from data using esttab or estout

Say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
drop if missing(rep78)
gen n = 1
collapse (mean) mpg (sum) (n), by(name)
replace name = "a b: c" if _n==1
I would like to export them to an .rtf (.tex, etc.) file directly from the data using esttab or estout. Is this possible? The key reason I want to do this is that I want to be able to preserve the spaces in the row names. And it would be nice to able to have the option to have commas after for 1,000's.
One partial approach is to save the data to a matrix, then export the matrix using esttab, but can I need this extra step?
mkmat mpg n, matrix(mat) rownames(name)
esttab matrix(mat)
A problem with this is that it replaces the spaces in the row names with _'s. Another problem is that if any of the rownames (from the variable name) are :, then this creates the category in the output. Is there another solution? Either to directly export from the data or possibly to somehow save the data in an estimation?
Instead of using collapse, you can calculate means and counts directly with estpost tabstat, statistics(mean count) by(). You can then use esttab to export the results.
If you really want to create a dataset first, you can still use estpost tabstat. This appears to work for your dataset:
estpost tabstat mpg n, by(name) nototal
esttab, cells("mpg n") varlabels(`e(labels)') noobs nonumber nomtitle
If you want to have "a b: c" on top again you can use the order option of esttab.

How to Randomly Assign to Groups of Different Sizes

Say I have a dataset and I want to assign observations to different groups, the size of groups determined by the data. For example, suppose that this is the data:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
*Create global with regions
global region NE NCntrl South West
*Count the number in each region
bys reg (pop): gen reg_N=_N
tab reg
There are four reg groups, all of different sizes. Now, I want to randomly assign observations to the four groups. This is accomplished below by generating a random number and then assigning observations to one of the groups based on the random number.
*Generate random number
set seed 1
gen random = runiform()
sort random
*Assign observations to number based on random sorting
egen reg_rand = seq(), from(1) to (4)
*Map number to region
gen reg_new = ""
global count 1
foreach i in $region {
replace reg_new = "`i'" if reg_rand==$count
global count = $count + 1
}
bys reg_new: gen reg_new_N = _N
tab reg_new
This is not what I want, though. Instead of using the seq() command, which creates groups of equal sizes (assuming N divided by number of groups is a whole number), I would like to randomly assign based on the size of the original groups. In this case, that is equivalent to reg_N. For example, there would be 12 observations that have a reg_new value of NCntrl.
I might have one solution similar to https://stats.idre.ucla.edu/stata/faq/how-can-i-randomly-assign-observations-to-groups-in-stata/. The idea would be to save the results of tab reg into a macro or matrix, and then use a loop and replace to cycle through the observations, which are sorted by a random number. Assume that there are many, many more groups than the four in this toy example. Is there a more reasonable way to accomplish this?
It looks like you want to shuffle around the values stored in a group variable across observations. You can do this by reducing the data to the group variable, sorting on a variable that contains random values and then using an unmatched merge to associate the random group identifiers to the original observations.
Assuming that the data example is stored in a file called "data_example.dta" and is currently loaded into memory, this would look like:
set seed 234
keep reg
rename reg reg_new
gen double u = runiform()
sort u reg_new
merge 1:1 _n using "data_example.dta", nogen
tab reg reg_new

Include only complete groups in panel regression using Stata

I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?
I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.

Stata -- extract regression output for 3500 regressions run in a loop

I am using a forval loop to run 3,500 regressions, one for each group. I then need to summarize the results. Typically, when I use loops to run regressions I use the estimates store function followed by estout. Below is a sample code. But I believe there is a limit of 300 that this code can handle. I would very much appreciate if someone could let me know how to automate the process for 3,500 regressions.
Sample code:
forval j = 1/3500 {
regress y x if group == `j'
estimates store m`j', title(Model `j')
}
estout m* using "Results.csv", cells(b t) ///
legend label varlabels(_cons constant) ///
stats(r2 df_r N, fmt(3 0 1) label(R-sqr dfres N)) replace
Here's an example using statsby where I run a regression of price on mpg for each of the 5 groups defined by the rep78 variable and store the results in Stata dataset called my_regs:
sysuse auto, clear
statsby _b _se, by(rep78) saving(my_regs): reg price mpg
use my_regs.dta
If you prefer, you can omit the saving() option and then your dataset will be replaced in memory by the regression results, so you won't need to open the file directly with use.

calculate standard deviation of daily data within a year

I have a question,
In Matlab, I have a vector of 20 years of daily data (X) and a vector of the relevant dates (DATES). In order to find the mean value of the daily data per year, I use the following script:
A = fints(DATES,X); %convert to financial time series
B = toannual(A,'CalcMethod', 'SimpAvg'); %calculate average value per year
C = fts2mat(B); %Convert fts object to vector
C is a vector of 20 values. showing the average value of the daily data for each of the 20 years. So far, so good.. Now I am trying to do the same thing but instead of calculating mean values annually, i need to calculate std annually but it seems there is not such an option with function "toannual".
Any ideas on how to do this?
THANK YOU IN ADVANCE
I'm assuming that X is the financial information and it is an even distribution across each year. You'll have to modify this if that isn't the case. Just to clarify, by even distribution, I mean that if there are 20 years and X has 200 values, each year has 10 values to it.
You should be able to do something like this:
num_years = length(C);
span_size = length(X)/num_years;
for n = 0:num_years-1
std_dev(n+1,1) = std(X(1+(n*span_size):(n+1)*span_size));
end
The idea is that you simply pass the date for the given year (the day to day values) into matlab's standard deviation function. That will return the std-dev for that year. std_dev should be a column vector that correlates 1:1 with your C vector of yearly averages.
unique_Dates = unique(DATES) %This should return a vector of 20 elements since you have 20 years.
std_dev = zeros(size(unique_Dates)); %Just pre allocating the standard deviation vector.
for n = 1:length(unique_Dates)
std_dev(n) = std(X(DATES==unique_Dates(n)));
end
Now this is assuming that your DATES matrix is passable to the unique function and that it will return the expected list of dates. If you have the dates in a numeric form I know this will work, I'm just concerned about the dates being in a string form.
In the event they are in a string form you can look at using regexp to parse the information and replace matching dates with a numeric identifier and use the above code. Or you can take the basic theory behind this and adapt it to what works best for you!

Resources