Which statistic to use when testing for multicollinearity in python? - correlation

I've been reading a lot about multicollinearity but am still unsure whether to use the Durbin-Watson score, the eigenvalues or the variance inflation factor. I only have three independent variables and the eigenvalues are:
1.81768828 0.95241948 0.22989225
How I understood it, only values too close to zero indicate multicollinearity. I wasn't sure if the last one (0.22) counts as "close to zero" but when checking its eigenvector, the result is:
-0.53977799 -0.44013805 0.71757802
and each of them would indicate collinearity as they are NOT close to zero (this time it's the other way around). Am I correct until here?
The Durbin-Watson score is 1.93 (calculated through the summary() function from statsmodels with an added intercept). This does NOT show strong multicollinearity, right?
As nobody gives clear "cutoff" values, I am a bit confusing as to which values count as "close to zero" or not.
Should I calculate the VIF as well, just to be extra sure?
Any help is much appreciated!

Related

App Inventor app does not take answers in divisions

I am trying to make an app that has the four basic mathematical operations. Addition, subtraction, multiplication and division.
Each operation has a series of exercises and for each correct exercise, assign a point in a score counter.
I clarify that both, the exercises and the answers are selected at random. They are not questions and answers stored in a list.
Everything is ready and finished, but I have a problem with the division and it is as follows.
If, for example, the result of the division has exactly two decimals, the score counter takes as correct the answer that is selected. But if the result of the division has more than two decimals, the score counter does not take the answer as correct.
Example:
20/8 = 1.25 No more decimals, then the score counter takes it as the correct answer
9/7 = 1.28571428571 This answer has many decimals, then the score counter does not take it as the correct answer
The problem is not in rounding up the figures or in formatting the number of decimals. The problem is that for some reason, answers with more than two decimal numbers are not taken as correct.
Not matter if I round the result to a integer or if I set only a 2 decimals for each result, for some reason, the score counter do not show the result as correct.
For example, if I take the division 9/7 = 1.28571428571 and I set only 2 decimals for the result, leaving it as 1.28, the score counter do not take this result as a correct result.
Even if I round the result to 1, occur the same problem.
How can this be fixed?
Many thanks to anyone who can help me find a solution.
P.S.: I'm not a programmer I'm just an amateur and nothing else, that just starts, so I appreciate, please, answers for a layman like me. Thanks in advance.
Here are the blocks

How I predict how some formula will behave with integers?

I am making some software that need to work with integers.
Also I need to apply some formula to those integers, repeatedly over time (example, do x/=z several times in a row for a indefinite amount).
All tools, algorithms and formulas I could think or find, or don't work with integers at all, or work as approximations at best.
For example the x/=z several times in a row for example, you can theoretically calculate what x will be in the 10th time by doing x = x/(z^10), but that will be wrong if the result is fractional, you can use floor(x/(z^10)), but the result will STILL be wrong.
Plotting software that I found also don't have integers at all, or has "floor()/ceil()" functions support, at best, and still the result would fall in the problem of the previous paragraph.
So how I do it?
Here's something to get you going for the iteration of x/=z:
(that should have ended in "all three terms are 0 with regard to integer division")
Now if x or z are negative, you can try and see whether this still holds; I did not invest the time to make the necessary case distinctions, but they should be fairly analogous.
As Karoly Horvath mentions in a comment, without a clear specification of the kinds of functions for which you would like to find a shortcut to replace iterative evaluation, helping you out won't be possible since there are uncountably many functions over the integers, and the same approach won't work for all of them.

metafor() non-negative sampling variance

I am trying to learn meta regression using the metafor() package. In running
one of the mixed regression models, I received an error indicating
"There are outcomes with non-positive sampling variances."
I am at lost as to how to proceed with this error. I understand that certain
model statistics (e.g., I^2 and QE) cannot be computed with due to the
presence of non-positive sampling variances. However, I am not sure whether
these results can be interpreted similarly as we would have otherwise. I
also tried using other estimators and/or the unweighted option; the error
still persists.
Any suggestions would be much appreciated.
First of all, to clarify: You are getting a warning, not an error.
Aside from that, I can't think of many situations where it is reasonable to assume that the sampling variance is really equal to 0 in a particular study. I would first question whether this really makes sense. This is why the rma() function is generating this warning message -- to make the user aware of this situation and question whether this really is intended/reasonable.
But suppose that we really want to go through with this, then you have to use an estimator for tau^2 that can handle this (e.g., method="REML" -- which is actually the default). If the estimate of tau^2 ends up equal to 0 as well, then the model cannot be fitted at all (due to division by zero -- and then you get an error). If you do end up with a positive estimate of tau^2, then the results should be okay (but things like the Q-test, I^2, or H^2 cannot be computed then).

Expectation Maximization Reestimation

Typically, the re-estimation iterative procedure stops when lambda.bar - lambda is less than some epsilon value.
How exactly does one determine this epsilon value? I often only see is written as the general epsilon symbol in papers, and never the actual value used, which I assume would change depending on the data.
So, for instance, if the lambda value of my first iteration was 5*10^-22, second iteration was 1.3*10^-15, third was 8.45*10^-15, fourth was 1.65*10^-14, etc., how would I determine when the algorithm needed no more iteratons?
Moreover, what if I were to apply the same alogrithm to a different datset? would I need to change my epsilon definitions?
Sorry for the long question. Pretty puzzled by it... :)
"how would I determine when the algorithm needed no more iteratons?"
When you get a "good-enough" result within a reasonable amount of time. ;-)
"Moreover, what if I were to apply the same alogrithm to a different datset? would I need to
change my epsilon definitions?"
Yes, most probably.
If you can afford it, you can just let it iterate until the updated value <= the old value (it could be < due to floating point error). I would be inclined to go with this until I ran out of patience or cpu budget.

Does Kernel::srand have a maximum input value?

I'm trying to seed a random number generator with the output of a hash. Currently I'm computing a SHA-1 hash, converting it to a giant integer, and feeding it to srand to initialize the RNG. This is so that I can get a predictable set of random numbers for an set of infinite cartesian coordinates (I'm hashing the coordinates).
I'm wondering whether Kernel::srand actually has a maximum value that it'll take, after which it truncates it in some way. The docs don't really make this obvious - they just say "a number".
I'll try to figure it out myself, but I'm assuming somebody out there has run into this already.
Knowing what programmers are like, it probably just calls libc's srand(). Either way, it's probably limited to 2^32-1, 2^31-1, 2^16-1, or 2^15-1.
There's also a danger that the value is clipped when cast from a biginteger to a C int/long, instead of only taking the low-order bits.
An easy test is to seed with 1 and take the first output. Then, seed with 2i+1 for i in [1..64] or so, take the first output of each, and compare. If you get a match for some i=n and all greater is, then it's probably doing arithmetic modulo 2n.
Note that the random number generator is almost certainly limited to 32 or 48 bits of entropy anyway, so there's little point seeding it with a huge value, and an attacker can reasonably easily predict future outputs given past outputs (and an "attacker" could simply be a player on a public nethack server).
EDIT: So I was wrong.
According to the docs for Kernel::rand(),
Ruby currently uses a modified Mersenne Twister with a period of 2**19937-1.
This means it's not just a call to libc's rand(). The Mersenne Twister is statistically superior (but not cryptographically secure). But anyway.
Testing using Kernel::srand(0); Kernel::sprintf("%x",Kernel::rand(2**32)) for various output sizes (2*16, 2*32, 2*36, 2*60, 2*64, 2*32+1, 2*35, 2*34+1), a few things are evident:
It figures out how many bits it needs (number of bits in max-1).
It generates output in groups of 32 bits, most-significant-bits-first, and drops the top bits (i.e. 0x[r0][r1][r2][r3][r4] with the top bits masked off).
If it's not less than max, it does some sort of retry. It's not obvious what this is from the output.
If it is less than max, it outputs the result.
I'm not sure why 2*32+1 and 2*64+1 are special (they produce the same output from Kernel::rand(2**1024) so probably have the exact same state) — I haven't found another collision.
The good news is that it doesn't simply clip to some arbitrary maximum (i.e. passing in huge numbers isn't equivalent to passing in 2**31-1), which is the most obvious thing that can go wrong. Kernel::srand() also returns the previous seed, which appears to be 128-bit, so it seems likely to be safe to pass in something large.
EDIT 2: Of course, there's no guarantee that the output will be reproducible between different Ruby versions (the docs merely say what it "currently uses"; apparently this was initially committed in 2002). Java has several portable deterministic PRNGs (SecureRandom.getInstance("SHA1PRNG","SUN"), albeit slow); I'm not aware of something similar for Ruby.

Resources