I am running a structural equation model with lavaan.survey package to account for complex survey design.
I have three latent and two manifest exogenous variables, and a manifest endogenous variable. All variables are ordinal.
I run sem with "DWLS" estimator followed by the same estimator with lavaan.survey function. This is giving me weird results, with large standard errors and p value close to 1.
I don't follow the 2-step estimation (with and without the survey procedure) employed in lavaan.survey. Do I need "DWLS" in both estimation steps. Or can I use robust Maximum Likelihood for the final estimation?
Are you referring to R package or in Amos ?
As you know , Amos is best software for SEM.
Statistical Consultant
"The lasso method requires initial standardization of the regressors,
so that the penalization scheme is fair to all regressors. For
categorical regressors, one codes the regressor with dummy variables
and then standardizes the dummy variables" (p. 394).
Tibshirani, R. (1997). The lasso method for variable selection in the Cox model.
Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf
Similar to package ‘glmnet,’ the h2o.glm function includes a ‘standardize’ parameter that is true by default. However, if predictors are stored as factors within the input H2OFrame, H2O does not appear to standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:
For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).
In contrast, package 'glmnet' seems to expect categorical variables to be dummy-coded prior to fitting a model, using a function like model.matrix. The dummy variables are then standardized along with the continuous variables. It seems like the only way to avoid this would be to pre-standardize the continuous predictors only, concatenate them with the dummy variables, and then run glmnet with standardize=FALSE.
Statistical Considerations:
For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.
It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.
A commonly expressed concern is that standardizing dummy variables removes their normal interpretation. To avoid this issue, while still placing continuous and categorical predictors on an equal footing, Gelman (2008) suggested standardizing continuous predictors by dividing by 2 SD, rather than 1, resulting in standardized predictors with SD = 0.5. However, it seems like this would still be biased for class-imbalanced dummy variables, for which the SD might be substantially less than 0.5.
Gelman, A. (2008). Scaling regression inputs by dividing by two
standard deviations. Statistics in medicine, 27(15), 2865-2873.
Is H2O's approach of not standardizing one-hot vectors for regularized regression correct? Could this lead to a bias toward over-penalizing dummy variables / one-hot vectors? Or has Tibshirani (1997)'s recommendation since been revised for some reason?
Personally, I rather keep the binary features untouched and apply MinMaxScalar between 0 and 1 to the numeric features instead of the normalization. This puts the numeric features on a similar standard deviation scale as those of binaries.
My question is Can I generate a random number in Uppaal?
I would like to generate a number from a range of values. Even more, I would like to generate not just integers I would like to generate double values as well.
for example: double [7.25,18.3]
I found this question that were talking about the same. I tried it.
However, I got this error: syntax error unexpected T_SELECT.
It doesn't work. I'm pretty new in Uppaal world, I would appreciate any help that you can provide me.
This is a common and misunderstood question in Uppaal.
Simple answer:
double val; // declaration
val = random(18.3-7.25)+7.25; // use in update, works in SMC (Uppaal v4.1)
Verbose answer:
Uppaal supports symbolic analysis as well as statistical and the treatment and possibilities are radically different. So one has to decide first what kind of analysis is needed. Usually one starts with simple symbolic analysis and then augment with stochastic features, sometimes stochastic behavior needs also to be checked symbolically.
In symbolic analysis (queries A[], A<>, E<>, E[] etc), random is synonymous with non-deterministic, i.e. if the model contains some "random" behavior, then verification should check all of them any way. Therefore such behavior is modelled as non-deterministic choices between edges. It is easy to setup a set of edges over an integer range by using select statement on the edge where a temporary variable is declared and its value can be used in guards, synchronization and update. Symbolic analysis supports only integer data types (no floating point types like double) and continuous ranges over clocks (specified by constraints in guards and invariants).
Statistical analysis (via Monte-Carlo simulations, queries like Pr[...](<> p), E[...](max: var), simulate, etc) supports double types and floating point functions like sin, cos, sqrt, random(MAX) (uniform distribution over [0, MAX)), random_normal(mean, dev) etc. in addition to int data types. Clock variables can also be treated as floating point type, except that their derivative is set to 1 by default (can be changed in the invariants which allow ODEs -- ordinary differential equations).
It is possible to create models with floating point operations (including random) and still apply symbolic analysis provided that the floating point variables do not influence/constrain the model behavior, and act merely as a cost function over the state space. Here are systematic rules to achieve this:
a) the clocks used in ODEs must be declared of hybrid clock type.
b) hybrid clock and double type variables cannot appear in guard and invariant constraints. Only ODEs are allowed over the hybrid clocks in the invariant.
Suppose we have the set of floating point number with "m" bit mantissa and "e" bits for exponent. Suppose more over we want to approximate a function "f".
From the theory we know that usually a "range reduced function" is used and then from such function we derive the global function value.
For example let x = (sx,ex,mx) (sign exp and mantissa) then...
log2(x) = ex + log2(1.mx) so basically the range reduced function is "log2(1.mx)".
I have implemented at present reciprocal, square root, log2 and exp2, recently i've started to work with the trigonometric functions. But i was wandering if given a global error bound (ulp error especially) it is possible to derive an error bound for the range reduced function, is there some study about this kind of problem? Speaking of the log2(x) (as example) i would lke to be able to say...
"ok i want log2(x) with k ulp error, to achieve this given our floating point system we need to approximate log2(1.mx) with p ulp error"
Remember that as i said we know we are working with floating point number, but the format is generic, so it could be the classic F32, but even for example e=10, m = 8 end so on.
I can't actually find any reference that shows such kind of study. Reference i have (i.e. muller book) doesn't treat the topic in this way so i was looking for some kind of paper or similar. Do you know any reference?
I'm also trying to derive such bound by myself but it is not easy...
There is a description of current practice, along with a proposed improvement and an error analysis, at https://hal.inria.fr/ensl-00086904/document. The description of current practice appears consistent with the overview at https://docs.oracle.com/cd/E37069_01/html/E39019/z4000ac119729.html, which is consistent with my memory of the most talked about problem being the mod pi range reduction of trigonometric functions.
I think IEEE floating point was a big step forwards just because it standardized things at a time when there were a variety of computer architectures, so lowering the risks of porting code between them, but the accuracy requirements implied by this may have been overkill: for many problems the constraint on the accuracy of the output is the accuracy of the input data, not the accuracy of the calculation of intermediate values.
Following most estimation commands in Stata (e.g. reg, logit, probit, etc.) one may access the estimates using the _b[ParameterName] syntax (or the synonymous _coef[ParameterName]). For example:
regress y x
followed by
di _b[x]
will display the estimate of the coefficient of x. di _b[_cons] will display the coefficient of the estimated intercept (assuming the regress command was successful), etc.
But if I use the nonlinear least squares command nl I (seemingly) have to do something slightly different. Now (leaving aside that for this example model there is absolutely no need to use a NLLS regression):
nl (y = {_cons} + {x}*x)
followed by (notice the forward slash)
di _b[/x]
will display the estimate of the coefficient of x.
Why does accessing parameter estimates following nl require a different syntax? Are there subtleties to be aware of?
"leaving aside that for this example model there is absolutely no need to use a NLLS regression": I think that's what you can't do here....
The question is about why the syntax is as it is. That's a matter of logic and a matter of history. Why a particular syntax was chosen is ultimately a question for the programmers at StataCorp who chose it. Here is one limited take on your question.
The main syntax for regression-type models grows out of a syntax designed for linear regression models in which by default the parameters include an intercept, as you know.
The original syntax for nonlinear regression models (in the sense of being estimated by nonlinear least-squares) matches a need to estimate a bundle of parameters specified by the user, which need not include an intercept at all.
Otherwise put, there is no question of an intercept being a natural default; no parameterisation is a natural default and each model estimated by nl is sui generis.
A helpful feature is that users can choose the names they find natural for the parameters, within the constraints of what counts as a legal name in Stata, say alpha, beta, gamma, a, b, c, etc. If you choose _cons for the intercept in nl that is a legal name but otherwise not special and just your choice; nl won't take it as a signal that it should flip into using regress conventions.
The syntax you cite is part of what was made possible by a major redesign of nl but it is consistent with the original philosophy.
That the syntax is different because it needs to be may not be the answer you seek, but I guess you'll get a fuller answer only from StataCorp; developers do hang out on Statalist, but they don't make themselves visible here.
I'm not sure StackOverflow is the right place to ask this question, because this question is half-programming and half-mathematics. And also really sorry if my question is stupid ^_^
I'm studying about Monte Carlo simulations via the "Monte Carlo Methods" book. One of the first thing I must learn is about Random Number Generator. The basic algorithm of RNG is:
1. Initialize: Draw the seed S0 from the distribution µ on S. Set t = 1.
2. Transition: Set St = f(St−1).
3. Output: Set Ut = g(St).
4. Repeat: Set t = t+ 1 and return to Step 2.
(µ is a probability distribution on the finite set of states S, the input is S0 and the random number we desire it the output Ut)
It is not hard to understand, but the problem here is I don't see the random factor which lie in the number of repeat. How can we decide when to stop the loop of the RNG? All examples I read which implement a RNG are loop for 100 times, and they returns the same value for a specific seed. It is not random at all >_<
Can someone explain what I'm missing here? Any help will be appreciated. Thanks everyone
You can't get a true sequence of random numbers on a computer, without specialized hardware. (Such specialized hardware performs the equivalent of an initial roll of the dice using physics to provide the randomness. Electronic ones often use the electronic noise of specialized diodes at constant temperatures; others use radioactive decay events.)
Without that specialized hardware, what you can generate are pseudorandom numbers which, as you've observed, always generate the same sequence of numbers for the same initial seed. For simple applications, you can often get away with generating an initial seed from the time of invocation, which is effectively random.
And when I say "simple applications," I am excluding cryptography. (Not just that, but especially that.)
Sometimes when you are trying to debug a simulation, you actually want to have a reproducible stream of "random" numbers so you might specifically sent a stream to start with a specific seed.
For instance in the answer Creating a facet_wrap plot with ggplot2 with different annotations in each plot rcs starts the answer by creating a reproducible set of data using the R code
df <- data.frame(x=rnorm(300), y=rnorm(300), cl=gl(3,100)) # create test data
before going on to demonstrate how to answer the actual question.