Structure of the random effects in glmmLasso - random

I want to perform model selection among ~150 fixed-effect and 7 random-effect variables, on a set of 360 observations. I decided to use the Lasso procedure for mixed models, with the glmmLasso. I did a lost of researches to find some examples of comparable models without success. Here is a sample of my data:
> str(RHI_12)
'data.frame': 350 obs. of 164 variables:
$ RHI_counts_12 : int 0 14 1 3 2 2 2 0 0 1 ...
$ Site : Factor w/ 6 levels "14_metzerlen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Location : Factor w/ 30 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Dist_roost : num 0.985 0.88 0.908 0.888 0.89 ...
$ Natural_light : num -0.194 -0.194 -0.194 -0.194 -0.194 ...
$ Mean_wind : num 0.836 0.836 0.836 0.836 0.836 ...
$ Mean_temp : num -0.427 -0.427 -0.427 -0.427 -0.427 ...
$ Day : num -0.993 -0.993 -0.993 -0.993 -0.993 ...
$ Artificial_light: num -0.2016 -0.2016 0.0772 -0.2016 -0.2016 ...
$ WBdi : num 1.14 1.14 1.14 1.14 1.14 ...
$ WCdi : num 1.49 1.49 1.49 1.49 1.47 ...
... (many more fixed-effect variables)
The response variable is counts (RHI_counts_12).
My question is about the structure of the random-effect variables in the model.
I have 2 categorical random-effect variables ("Site" and "Location"; "Location" is nested in "Site") and 5 numerical random-effect variables. I have structured my model like this (using only a sample of the fixed-effect variables):
lasso1<-glmmLasso(RHI_counts_12 ~ Artificial_light+WBdi+WCdi+BUdi+FOdi+TIdi, list(Site=~1,Location=~1+Dist_roost+Natural_light+Mean_wind+Mean_temp+Day),
lambda = 500,family = poisson(link = log), data = RHI_12)
I am not convinced at all about the right way to structure the random effects if I have these 2 categorical nested random effects. I want to have a model with Location nested in Site, and I do not think that this is what I get. Here is my output for the random effects(in this output, "Loc" stands for Location, "siteName" for Site):
Random Effects:
StdDev:
[[1]]
siteName
siteName 1.180514
[[2]]
Loc Loc:Dist_roost Loc:Natural_light Loc:Mean_wind
Loc 1.15105859 -0.66317669 -0.35354821 -0.10805268
Loc:Dist_roost -0.66317669 1.42601945 0.46004662 -0.42795987
Loc:Natural_light -0.35354821 0.46004662 0.49532786 -0.15485395
Loc:Mean_wind -0.10805268 -0.42795987 -0.15485395 0.76175417
Loc:Mean_temp 0.02677276 0.03961902 -0.01431360 -0.03649499
Loc:Day 0.03756960 -0.02081360 0.02520654 -0.12082652
Loc:Mean_temp Loc:Day
Loc 0.02677276 0.03756960
Loc:Dist_roost 0.03961902 -0.02081360
Loc:Natural_light -0.01431360 0.02520654
Loc:Mean_wind -0.03649499 -0.12082652
Loc:Mean_temp 0.36923939 -0.08311209
Loc:Day -0.08311209 0.56876662
Do you think that it is right? I was not able to build this model with "Location" nested in "Site" (and all the other random factors would also be nested in "Site".) I have tried many different ways without success.
I already thank you a lot for having read me and for any advices for the structure of random effects in glmmLasso! :-)
Thomas

Related

Invalid syntax loop in Stata

I'm trying to run a for loop to make a balance table in Stata (comparing the demographics of my dataset with national-level statistics)
For this, I'm prepping my dataset and attempting to calculate the percentages/averages for some key demographics.
preserve
rename unearnedinc_wins95 unearninc_wins95
foreach var of varlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019 { //continuous or binary; to put categorical vars use kwallis test
dis "for variable `var':"
tabstat `var'
summ `var'
local `var'_samplemean=r(mean)
}
clear
set obs 11
gen var=""
gen sample=.
gen F=.
gen pvalue=.
replace var="% Female" if _n==1
replace var="Age" if _n==2
replace var="% Non-white" if _n==3
replace var="HH size" if _n==4
replace var="% Parent" if _n==5
replace var="% Employed" if _n==6
replace var="Savings stock ($)" if _n==7
replace var="Debt stock ($)" if _n==8
replace var="Earned income last mo. ($)" if _n==9
replace var="Unearned income last mo. ($)" if _n==10
replace var="% Under FPL 2019" if _n==11
foreach col of varlist sample {
replace `col'=100*round(`fem_`col'mean', 0.01) if _n==1
replace `col'=round(`age_`col'mean') if _n==2
replace `col'=100*round(`nonwhite_`col'mean', 0.01) if _n==3
replace `col'=round(`hhsize_`col'mean', 0.1) if _n==4
replace `col'=100*round(`parent_`col'mean', 0.01) if _n==5
replace `col'=100*round(`employed_`col'mean', 0.01) if _n==6
replace `col'=round(`savings_wins95_`col'mean') if _n==7
replace `col'=round(`debt_wins95_`col'mean') if _n==8
replace `col'=round(`earnedinc_wins95_`col'mean') if _n==9
replace `col'=round(`unearninc_wins95_`col'mean') if _n==10
replace `col'=100*round(`underfpl2019_`col'mean', 0.01) if _n==11
}
I'm trying to run the following loop, but in the second half of the loop, I keep getting an 'invalid syntax' error. For context, in the first half of the loop (before clearing the dataset), the code stores the average values of the variables as a macro (`var'_samplemean). Can someone help me out and mend this loop?
My sample data:
clear
input byte fem float(age nonwhite) byte(hhsize parent) float employed double(savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95) float underfpl2019
1 35 1 6 1 1 0 2500 0 0 0
0 40 0 4 1 1 0 10000 1043 0 0
0 40 0 4 1 1 0 20000 2400 0 0
0 40 0 4 1 1 .24 20000 2000 0 0
0 40 0 4 1 1 10 . 2600 0 0
Thanks!
Thanks for sharing the snippet of data. Apart from the fact the variable unearninc_wins95 has already been renamed in your sample data, the code runs fine for me without returning an error.
That being said, the columns for your F-statistics and p-values are empty once the loop at the bottom of your code completes. As far as I can see there is no local/varlist called sample which you're attempting to call with the line foreach col of varlist sample{. This could be because you haven't included it in your code, in which case please do, or it could be because you haven't created the local/varlist sample, in which case this could well be the source of your error message.
Taking a step back, there are more efficient ways of achieving what I think you're after. For example, you can get (part of) what you want using the package stat2data (if you don't have it installed already, run ssc install stat2data from the command prompt). You can then run the following code:
stat2data fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019, saving("~/yourstats.dta") stat(count mean)
*which returns:
preserve
use "~/yourstats.dta", clear
. list, sep(11)
+----------------------------+
| _name sN smean |
|----------------------------|
1. | fem 5 .2 |
2. | age 5 39 |
3. | nonwhite 5 .2 |
4. | hhsize 5 4.4 |
5. | parent 5 1 |
6. | employed 5 1 |
7. | savings_wins 5 2.048 |
8. | debt_wins95 4 13125 |
9. | earnedinc_wi 5 1608.6 |
10. | unearninc_wi 5 0 |
11. | underfpl2019 5 0 |
+----------------------------+
restore
This is missing the empty F-statistic and p-value variables you created in your code above, but you can always add them in the same way you have with gen F=. and gen pvalue=.. The presence of these variables though indicates you want to run some tests at some point and then fill the cells with values from them. I'd offer advice on how to do this but it's not obvious to me from your code what you want to test. If you can clarify this I will try and edit this answer to include that.
This doesn't answer your question directly; as others gently point out the question is hard to answer without a reproducible example. But I have several small comments on your code which are better presented in this form.
Assuming that all the variables needed are indeed present in the dataset, I would recommend something more like this:
local myvarlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019
local desc `" "% Female" "Age" "% Non-white" "HH size" "% Parent" "% Employed" "Savings stock ($)" "Debt stock ($)" "Earned income last mo. ($)" "Unearned income last mo. ($)" "% Under FPL 2019" "'
local i = 1
gen variable = ""
gen mean = ""
local i = 1
foreach var of local myvars {
summ `var', meanonly
local this : word `i' of `desc'
replace variable = "`this'" in `i'
if inlist(`i', 1, 3, 5, 6, 11) {
replace mean = strofreal(100 * r(mean), "%2.0f") in `i'
}
else if `i' == 4 {
replace mean = strofreal(r(mean), "%2.1f") in `i'
}
else replace mean = strofreal(r(mean), "%2.0f") in `i'
local ++i
}
This has not been tested.
Points arising include:
Using in is preferable for what you want over testing the observation number with if.
round() is treacherous for rounding to so many decimal places. Most of the time you will get what you want, but occasionally you will get bizarre results arising from the fact that Stata works in binary, like any equivalent program. It is safer to treat rounding as a problem in string manipulation and use display formats as offering precisely what you want.
If the text you want to show is just the variable label for each variable, this code could be simplified further.
The code hints at intent to show other stuff, which is easily done compatibly with this design.

Sorting data with gnuplot

Sometimes it might be required to sort data. Unfortunately, gnuplot (as far as I know) doesn't offer this possibility. Of course, you can use external tools like awk, Perl, Python, etc. However, for maximum platform independence and avoiding the installation of additional programs and related complications, and also for curiosity, I was interested whether gnuplot can sort somehow nevertheless.
I will be grateful for comments on improvements, limitations.
Does anybody have ideas how to sort alphanumerical data with gnuplot only?
### Sorting with gnuplot
reset session
# generate some random example data
N = 10
set samples N
RandomNo(n) = sprintf("%.02f",rand(0)*n)
set table $Data
plot '+' u (RandomNo(10)):(RandomNo(10)):(RandomNo(10)) w table
unset table
print $Data
# Settings for sorting
ColNo = 2 # ColumnNo for sorting
stats $Data nooutput # get the number of rows if data is from file
RowCount = STATS_records # with the example data above, of course RowCount=N
# create the sortkey and put it into an array
array SortKey[RowCount]
set table $Dummy
plot $Data u (SortKey[$0+1] = sprintf("%.06f%02d",column(ColNo),$0+1)) w table
unset table
# print $Dummy
# get lines as whole into array
set datafile separator "\n"
array DataSeq[RowCount]
set table $Dummy2
plot $Data u (SortKey[$0+1]):(DataSeq[$0+1] = stringcolumn(1)) with table
unset table
print $Dummy2
set datafile separator whitespace
# do the actual sorting with 'smooth unique'
set table $Dummy3
plot $Dummy2 u 1:0 smooth unique
unset table
# print $Dummy3
# extract the sorted sortkeys
set table $Dummy4
plot $Dummy3 u (SortKey[$0+1]=$2) with table
unset table
# print $Dummy4
# create the table with sorted lines
set table $DataSorted
plot $Data u (DataSeq[SortKey[$0+1]+1]) with table
unset table
print $DataSorted
### end of code
First datablock unsorted data
second datablock intermediate with sortkeys
third datablock sorted data by the second column
Output:
5.24 6.68 3.09
1.64 1.27 9.82
6.44 9.23 7.03
8.14 8.87 3.82
4.27 5.98 0.93
7.96 3.64 6.15
6.21 6.28 6.17
1.52 3.17 3.58
4.24 2.16 8.99
8.73 6.54 1.13
6.68000001 5.24 6.68 3.09
1.27000002 1.64 1.27 9.82
9.23000003 6.44 9.23 7.03
8.87000004 8.14 8.87 3.82
5.98000005 4.27 5.98 0.93
3.64000006 7.96 3.64 6.15
6.28000007 6.21 6.28 6.17
3.17000008 1.52 3.17 3.58
2.16000009 4.24 2.16 8.99
6.54000010 8.73 6.54 1.13
1.64 1.27 9.82
4.24 2.16 8.99
1.52 3.17 3.58
7.96 3.64 6.15
4.27 5.98 0.93
6.21 6.28 6.17
8.73 6.54 1.13
5.24 6.68 3.09
8.14 8.87 3.82
6.44 9.23 7.03
For curiosity, I wanted to know whether an alphanumerical sort could be implemented with gnuplot code only.
This avoids the need for external tools and ensures maximum platform compatibility.
I haven't heard yet about an external tool which could assist gnuplot and which works under Windows and Linux and MacOS.
I am happy to take comments and suggestions about bugs, simplifications, improvements, performance comparisons, and limits.
For alphanumerical sort, the first stage is alphanumerical string comparison, which to my knowledge does not exist in gnuplot directly. So, the first part Compare.plt is about comparison of strings.
### compare function for strings
# Compare.plt
# function cmp(a,b,cs) returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
reset session
ASCII = ' !"' . "#$%&'()*+,-./0123456789:;<=>?#".\
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\`".\
"abcdefghijklmnopqrstuvwxyz{|}~"
ord(c) = strstrt(ASCII,c)>0 ? strstrt(ASCII,c)+31 : 0
# comparing char: case-sensitive
cmpcharcs(c1,c2) = sgn(ord(c1)-ord(c2))
# comparing char: case-insentitive
cmpcharci(c1,c2) = sgn(( cmpcharci_o1=ord(c1), ((cmpcharci_o1>96) && (cmpcharci_o1<123)) ?\
cmpcharci_o1-32 : cmpcharci_o1) - \
( cmpcharci_o2=ord(c2), ((cmpcharci_o2>96) && (cmpcharci_o2<123)) ?\
cmpcharci_o2-32 : cmpcharci_o2) )
# function cmp returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
cmp(a,b,cs) = ((cmp_r=0, cmp_flag=0, cmp_maxlen=strlen(a)>strlen(b) ? strlen(a) : strlen(b)),\
(sum[cmp_i=1:cmp_maxlen] \
((cmp_flag==0 && (cmp_c1 = substr(a,cmp_i,cmp_i), cmp_c2 = substr(b,cmp_i,cmp_i), \
(cmp_r = (cs==0 ? cmpcharci(cmp_c1,cmp_c2) : cmpcharcs(cmp_c1,cmp_c2) ) )!=0 ? \
(cmp_flag=1, cmp_r) : 0)), 1 )), cmp_r)
cmpsymb(a,b,cs) = (cmpsymb_r = cmp(a,b,cs))<0 ? "<" : cmpsymb_r>0 ? ">" : "="
### end of code
Example:
### example compare strings
load "Compare.plt"
a="Alligator"
b="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Zebra"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
### end of code
Result:
-1: Alligator < Tiger
0: Tiger = Tiger
1: Zebra > Tiger
The second part makes use of the comparison for sorting.
### alpha-numerical sort with gnuplot
reset session
load "Compare.plt"
$Data <<EOD
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
EOD
stats $Data u 0 nooutput
RowCount = STATS_records
ColSort = 3
array Key[RowCount]
array Index[RowCount]
set table $Dummy
plot $Data u (Key[$0+1]=stringcolumn(ColSort),Index[$0+1]=$0+1) w table
unset table
# Bubblesort
do for [n=RowCount:2:-1] {
do for [i=1:n-1] {
if ( cmp(Key[i],Key[i+1],0) > 0) {
tmp=Key[i]; Key[i]=Key[i+1]; Key[i+1]=tmp
tmp2=Index[i]; Index[i]=Index[i+1]; Index[i+1]=tmp2
}
}
}
set datafile separator "\n"
set table $Dummy # and reuse Key-array
plot $Data u (Key[$0+1]=stringcolumn(1)) with table
unset table
set datafile separator whitespace
set table $DataSorted
plot $Data u (Key[Index[$0+1]]) with table
unset table
print $DataSorted
set grid xtics,ytics
plot [-0.5:RowCount-0.5][0:1.1] $DataSorted u 0:2:xtic(3) w lp lt 7 lc rgb "red"
### end of code
Input:
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
Output:
2 0.456 Apple
5 0.654 Banana
7 0.111 Lemon
1 0.123 Orange
3 0.789 Peach
4 0.987 Pineapple
6 0.321 Raspberry
and the output graph:

How to fetch two associated Database values Using Rails 3

Hi I have two tables in DB.The first table is given below.
Table name-
t_hcsy_details
class name in model-
class THcsyDetails < ActiveRecord::Base
end
The values in side table is given below.
HCSY_Details_ID HCSY_ID HCSY_Fund_Type_ID Amount
1 2 1 1125
2 2 2 390
3 2 3 285
4 2 4 100
5 2 5 60
6 2 6 40
My second table is given below.
Table Name:
t_hcsy_fund_type_master
class in model:
class THcsyFundTypeMaster < ActiveRecord::Base
end
Table values are given below.
HCSY_Fund_Type_ID Fund_Type_Code Fund_Type_Name Amount
1 1 woods 1125
2 2 Burning 390
3 3 goods 285
4 4 brahmin 100
5 5 swd 60
6 6 Photo 40
I know only HCSY_ID value(i.e-2) of first table.But i need Fund_Type_Name and Amount from second table.As you can see one HCSY_ID has 6 different records.But i need all Fund_Type_Name and Amount of one HCSY_ID. Please help me to resolve this issue by creating object for both two classes shown above.Please help me.
You haven't specified any relationships setup, so it would be easier to split this in two queries:
# you already have hcsy_id
fund_type_ids = THcsyDetails.where(hcsy_id: hcsy_id).pluck(:hcsy_fund_type_id)
fund_types = THcsyFundTypeMaster.where(id: fund_type_ids)
fund_types.group(:fund_type_name).sum(:amount)
In case you had proper relationships setup, the above would've simplified to:
THcsyDetails.
joins(association_name). # THcsyFundTypeMaster
where(hcsy_id: hcsy_id).
group("#{t = THcsyFundTypeMaster.table_name}.fund_type_name").
sum("#{t}.amount")

WINBUGS : adding time and product fixed effects in a hierarchical data

I am working on a Hierarchical panel data using WinBugs. Assuming a data on school performance - logs with independent variable logp & rank. All schools are divided into three categories (cat) and I need beta coefficient for each category (thus HLM). I am wanting to account for time-specific and school specific effects in the model. One way can be to have dummy variables in the list of variables under mu[i] but that would get messy because my number of schools run upto 60. I am sure there must be a better way to handle that.
My data looks like the following:
school time logs logp cat rank
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school (hierarchies)
My WinBUGS code is given below.
model {
# N observations
for (i in 1:n){
logs[i] ~ dnorm(mu[i], tau)
mu[i] <- bcons +bprice*(logp[i])
+ brank[cat[i]]*(rank[i])
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
As you can see in the data sample above, I have multiple observations for school over time. How can I modify the code to account for time and school specific fixed effects. I have used STATA in the past and we get fe,be,i.time options to take care of fixed effects in a panel data. But here I am lost.

faster way to create variable that aggregates a column by id [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.
df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))
I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?
Edit: Here is a speed comparison of the answers applied to my data.
> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
user system elapsed
14.405 0.000 14.479
> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
user system elapsed
15.973 0.000 15.992
> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
user system elapsed
0.048 0.000 0.048
> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
user system elapsed
0.032 0.000 0.030
> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
user system elapsed
0.028 0.000 0.028
Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:
library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
cand.perc = 1:9,
key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
ID Perc.total cand.perc
[1,] 1 6 1
[2,] 1 6 2
[3,] 1 6 3
[4,] 2 15 4
[5,] 2 15 5
[6,] 2 15 6
[7,] 3 24 7
[8,] 3 24 8
[9,] 3 24 9
Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/
For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.
df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
Use tapply to get the group stats, then add them back into your dataset afterwards.
Reproducible example:
means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]
Untested solution for your scenario:
sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]
ilprincipe if none of the above fits your needs you could try transposing your data
dft=t(df)
then use aggregate
dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)
next have back your rownames
rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]
Transpose back to original orientation
df2=t(dfta)
and bind to original data
newdf=cbind(df,df2)
Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:
ddply(df, "id", transform, perc.total = sum(cand.perc))
getting rid of the superfluous cbind should speed things up.
You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.

Resources