Include only complete groups in panel regression using Stata - panel

I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?

I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.

Related

Algorithm or Test Method to generate test case for Keno game

Keno Game rules: Keno is a lottery-like game which generates random combination of number ranging from 1 to 80 with the size of 20. Player may choose a number game to play (1,2,3,4,5,6,7,8,9,10,15). The payout depends on the number game and number of matches.
I understand the difficulties of generating a complete test case to cover all possible combination not to mention the possibility of matching the random game result. Therefore, I initially applied the Random Combination testing method but later found out it is hard to achieve high coverage of all possible cases (roughly about 10%). By now, I have come across Pure Random Combinatorial, CATS, AETG, K-combination but none is ideal for Keno game.
For now, the inputs are num_game_size, numSelected[num_game_size]. Meanwhile, the outputs are result[20], matchedNum[], matched_num_size, payout. Of course, there are more inputs: continuous_game_toplay_size, bet_amount.
I'm looking forward for any suggestion on any testing method or algorithm that has high coverage on pure random and large combination test case if executed for a month or two. My objective is to test combination of selected numbers and their payout for each different number of matches when the result is pure random generated. For instance:
/* Assume the result is pure random generated */
/* Match 0 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {}
matched_num_size = 0
payout = 0
/* Match 1 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {1}
matched_num_size = 1
payout = 1
/* Match 2 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {1,72,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {1,72}
matched_num_size = 2
payout = 5
The total possibility will be C(80,2) * C(80,20) = 3160 * 3535316142212174320 = 1.117159900939047e+19. Meaning for each combination of number with the size of two within range of 1 to 80, there are C(80,20) possible results. It will probably takes few years to cover all possibility (including 1,3,4,5,6,7,8,9,10,15 number game) when the result is pure random generated (quantum RNG).
Ps: Most test method I found only consider either random or combination problem and require a tremendous amount of time to complete test case generation. I'm trying to create any program to help me in winning the Keno game IRL.

Shuffle One Variable Within Group

This question is an extension of the excellent answer provided by Robert Picard here: How to Randomly Assign to Groups of Different Sizes
We have this dataset, which is the same as in the previous question, but adds the year variable:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
gen year=20
replace year=30 if _n>15
replace year=40 if _n>35
If I just wanted to re-randomly assign reg's across all observations (without regard to group), I could implement the answer to the previous post:
tempfile orig
save `orig'
keep reg
rename reg reg_new
set seed 234
gen double u = runiform()
sort u reg_new
merge 1:1 _n using `orig', nogen
How would the code be modified so that reg is shuffled, but only within year? For example, there are 15 observations where year==20. These observations should be shuffled separately than the other years.
Shuffling one variable doesn't require any file choreography. This can probably be shortened:
sysuse auto, clear
set seed 2803
gen double shuffle = runiform()
* example 1
sort shuffle
gen long which = _n
sort mpg
gen mpg_new = mpg[which]
list which mpg*
* example 2
bysort foreign (shuffle) : gen long which2 = _n
bysort foreign (mpg) : gen mpg2 = mpg[which2]
list which2 mpg mpg2, sepby(foreign)
All that said, I think sample does this so long as you specify the same sample size as the number in the dataset. It's overkill because you get all the variables.

How to Randomly Assign to Groups of Different Sizes

Say I have a dataset and I want to assign observations to different groups, the size of groups determined by the data. For example, suppose that this is the data:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
*Create global with regions
global region NE NCntrl South West
*Count the number in each region
bys reg (pop): gen reg_N=_N
tab reg
There are four reg groups, all of different sizes. Now, I want to randomly assign observations to the four groups. This is accomplished below by generating a random number and then assigning observations to one of the groups based on the random number.
*Generate random number
set seed 1
gen random = runiform()
sort random
*Assign observations to number based on random sorting
egen reg_rand = seq(), from(1) to (4)
*Map number to region
gen reg_new = ""
global count 1
foreach i in $region {
replace reg_new = "`i'" if reg_rand==$count
global count = $count + 1
}
bys reg_new: gen reg_new_N = _N
tab reg_new
This is not what I want, though. Instead of using the seq() command, which creates groups of equal sizes (assuming N divided by number of groups is a whole number), I would like to randomly assign based on the size of the original groups. In this case, that is equivalent to reg_N. For example, there would be 12 observations that have a reg_new value of NCntrl.
I might have one solution similar to https://stats.idre.ucla.edu/stata/faq/how-can-i-randomly-assign-observations-to-groups-in-stata/. The idea would be to save the results of tab reg into a macro or matrix, and then use a loop and replace to cycle through the observations, which are sorted by a random number. Assume that there are many, many more groups than the four in this toy example. Is there a more reasonable way to accomplish this?
It looks like you want to shuffle around the values stored in a group variable across observations. You can do this by reducing the data to the group variable, sorting on a variable that contains random values and then using an unmatched merge to associate the random group identifiers to the original observations.
Assuming that the data example is stored in a file called "data_example.dta" and is currently loaded into memory, this would look like:
set seed 234
keep reg
rename reg reg_new
gen double u = runiform()
sort u reg_new
merge 1:1 _n using "data_example.dta", nogen
tab reg reg_new

Poor h2o GBM Classification Performance in a balanced binomial response

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)
You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

Split test groups base on GUID

Users in the system are identified by GUID, and with a new feature, I want to divide users into two groups - test and control.
Is there a easy way to split users into one of the two group with a 50/50 chance, based on their GUID?
e.g. If the nth character's ascii code is an odd -> test group, otherwise control group.
What about 70/30, or other ratio?
The reason I want to classify users base on GUID, is because later I can easily tell which users are in which group and compare the performance between two groups, without having to keep track of the group assignment - I simply need to calculate it again.
As Derek Li notes, the GUID's bits might be based on a timestamp, so you shouldn't use them directly.
The safest solution is to hash the GUID using a hash function like MurmurHash. This will produce a random number (but the same random number every time for any given GUID) which you can then use to do the split.
For example, you could do a 30/70 split like this:
function isInTestGroup(user) {
var hash = murmurHash(user.guid);
return (hash % 100) < 30;
}
If some character in the GUID has a 1 in 16 change of being one of the following characters: "0123456789ABCEDF", then perhaps you could test a scheme that determines placement by that character.
Say the last character of the guid called c has a 1/16 chance of being any hex digit:
for 50/50 distribution -> c <= 7 for group 1, c > 7 for group 2
for 70/30 c <= A for group 1, c > A for group 2
etc...

Resources