How to correct standard errors standard errors in a multinomial logit using IV - standard-error

I am trying to estimate a multinomial logit model using an instrumental variable. I didn't find any preexisting package, so I tried to estimate using a two-stage approach.
First estimating the first stage as an OLS with the IV
tsls1<-lm(d~x+z)
Then I used
d.hat<-fitted.values(tsls1)
With that, I used multinom function from the nnet.
tsls2<-multinom(y~x+d.hat)
The problem is that the standard errors are wrong. I was wondering how I could correct them. Or if there is an easier way.

Related

Package for Multivariate Multinomial Logit

I would like to jointly estimate 3 variables. Two of them are categorical and the other one is binary. So I thought about a "multivariate multinomial logit model". I found a lot of theory about it (for Example Agresti 2007 Ch. 9 or Beel and Paap 2014) but I cannot find a package for R. Is there a built-in function or package I can use? I can switch to a bivariate multinomial logit if it is needed.
Thank you very much for your help in this matter!
There are several packages that might interest you for a multinomial logit model. They are mlogit, mnlogit, antitrust, and nnet.
mlogit: This is the most direct Multinomial Logit package currently available. It provides sample data, tools to estimate a multinomial logit model, and additional useful functions such as mlogit.optim to optimize specific parameters of multinomial logit functions.
mnlogit: This package is similar to mlogit, but it does not provide as many additional functions. It may be faster for the actual estimation process though.
antitrust: This package can estimate merger effects under logit (or nested logit) demand. This does not directly provide the multinomial logit coefficients, but it is very good at solving for the bottom line HHI and price effects of a merger.
nnet: This is a package for general multinomial log-linear models, and it can also estimate multinomial logit models.
Hope one of these packages helps for your purposes!

How to implement Breusch-Godfrey test for a regression with ARIMA errors in R

I’m fitting a regression with ARIMA errors with the fable package and as mentioned im my previous question the Breusch-Godfrey test is not available there.
The regression part of the model has two pairs of Fourier terms to account for yearly seasonality and several exogenous regressors. The residuals are modeled with a seasonal ARIMA(2,0,0)(1,0,0)[7] model. My goal is to check for autocorrelation in residuals.
I can use the Ljung-Box test but according to this thread and textbook sources there it will not be valid in presence of lags of the dependent variable.
And I’m afraid i will loose my model specification using different packages/libraries. An alternative might be to use Arima from the forecast package and retain model specification. Then use bgtest from lmtest package. But I can’t figure out how to do this.
According to this R forum the Breusch-Godfrey test for an ARIMA model can be done by fitting a simple regression of the residuals from the fitted model on a constant and then perform a bgtest. But it only concerns a simple AR(1) model with no exogenous regressors.
Is this the right way to do it? I’m concerned that for the BG test you have to perform an auxiliary regression on the regressors and lagged resuduals up to order p. How in this case the bgtest knows the X variables since they are not stored in the residuals object - this should be a simple vector.

Gensim LDA: coherence values not reproducible between runs

I used this code, https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/, to find topic coherence for a dataset. When I tried this code with the same number of topics, I got new values after each running. For example, for the number of topics =10, I got the following value after 2 running:
First Run for the number of topics =10
Coherence Score CV_1: 0.31230269562327095
Coherence Score UMASS_1: -3.3065236823786064
Second Run the number of topics =10
Coherence Score CV_2: 0.277016662550274
Coherence Score UMASS_2: -3.6146150653617743
What is the reason? In this unstable case, how we can trust this library? The highest coherence value changed as well.
TL;DR: coherence is not "stable" -i.e. reproducible between runs - in this case because of fundamental LDA properties. You can make LDA reproducible by setting random seeds and PYTHONHASHSEED=0. You can take other steps to improve your results.
Long Version:
This is not a bug, it's a feature.
It is less a question of trust in the library, but an understanding of the methods involved. The scikit-learn library also has an LDA implementation, and theirs will also give you different results on each run. But by its very nature, LDA is a generative probabilistic method. Simplifying a little bit here, each time you use it, many Dirichlet distributions are generated, followed by inference steps. These steps and distribution generation depend on random number generators. Random number generators, by their definition, generate random stuff, so each model is slightly different. So calculating the coherence of these models will give you different results every time.
But that doesn't mean the library is worthless. It is a very powerful library that is used by many companies (Amazon and Cisco, for example) and academics (NIH, countless researchers) - to quote from gensim's About page:
By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.
If that is what you want, gensim is the way to go - certainly not the only way to go (tmtoolkit or sklearn also have LDA) but a pretty good choice of paths. That being said, there are ways to ensure reproducability between model runs.
Gensim Reproducability
Set PYTHONHASHSEED=0
From the Python documentation: "On Python 3.3 and greater, hash randomization is turned on by default."
Use random_state in your model specification
Afaik, all of the gensim methods have a way of specifying the random seed to be used. Choose any number you like, but the default value of zero ("off") and use the same number for each rerun - this ensures that the same input into the random number generators always results in the same output (gensim ldamodel documentation).
Use ldamodel.save() and ldamodel.load() for model persistency
This is also a very useful, timesaving step that keeps you from having to re-run your models every time you start (very important for long-running models).
Optimize your models and data
This doesn't technically make your models perfectly reproducable, but even without the random seed settings, you will see your model perform better (at the cost of computation time) if you increase iterationsor passes. Preprocessing also makes a big difference and is an art unto itself - do you choose to lemmatize or stem and why do you do so? This all can have important effects on the outputs and your interpretations.
Caveat: you must use one core only
Multicore methods (LdaMulticore and the distributed versions) are never 100% reproducible, because of the way the operating system handles multiprocessing.

PyMC: Hidden Markov Models

How suitable is PyMC in its currently available versions for modelling continuous emission HMMs?
I am interested in having a framework where I can easily explore model variations, without having to update E- and M-step, and dynamic programming recursions for every change I make to the model.
More specific questions are:
When modelling an HMM in PyMC can I answer the 'typical' tasks that one would like to solve -- i.e., besides parameter estimation also infer the most likely sequence (as usually done with the Viterbi algorithm), or solve a smoothing problem?
As compared to an implementation with Expectation Maximization, I would expect a sampling based approach to be slower. If that gives me more flexibility on the model building side, that is fine. I would imagine using PyMC for prototyping models. I am wondering though, if I can expect PyMC to handle inference for models with > 10k observations to finish in any reasonable amount of time.
Would you recommend starting out with PyMC2 or PyMC3 for model building. I know that the inference engine changed between the version, so I would especially wonder what type of sampler might be more suited.
If you'ld think PyMC is not a good choice for my use case, that definitely helps as an answer as well.

Automatic probability densities

I have found automatic differentiation to be extremely useful when writing mathematical software. I now have to work with random variables and functions of the random variables, and it seems to me that an approach similar to automatic differentiation could be used for this, too.
The idea is to start with a basic random vector with given multivariate distribution and then you want to work with the implied probability distributions of functions of components of the random vector. The idea is to define operators that automatically combine two probability distributions appropriately when you add, multiply, divide two random variables and transform the distribution appropriately when you apply scalar functions such as exponentiation. You could then combine these to build any function you need of the original random variables and automatically have the corresponding probability distribution available.
Does this sound feasible? If not, why not? If so and since it's not a particularly original thought, could someone point me to an existing implementation, preferably in C
There has been a lot of work on probabilistic programming. One issue is that as your distribution gets more complicated you start needing more complex techniques to sample from it.
There are a number of ways this is done. Probabilistic graphical models gives one vocabulary for expressing these models, and you can then sample from them using various Metropolis-Hastings-style methods. Here is a crash course.
Another model is Probabilistic Programming, which can be done through an embedded domain specific language, directly. Oleg Kiselyov's HANSEI is an example of this approach. Once they have the program they can inspect the tree of decisions and expand them out by a form of importance sampling to gain the most information possible at each step.
You may also want to read "Nonstandard Interpretations of Probabilistic
Programs for Efficient Inference" by Wingate et al. which describes one way to use extra information about the derivative of your distribution to accelerate Metropolis-Hastings-style sampling techniques. I personally use automatic differentiation to calculate those derivatives and this brings the topic back to automatic-differentiation. ;)

Resources