Is multicollinearity an issue if I do not intend to interprest the coefficient if the varaible? - interpretation

I am creating a model to check whether the threat of entry affects prices in the same way in two different samples. I will therefore only interpret the coefficients of the time dummies, which will hopefully be significant. In the model I will also add other parameters to control for systematic differences in characteristics between the two groups, to make them more comparable. Therefore, I do not care about whether these coefficients are significant, they just have to be controls. Do I have to care about multicollinearity among those controls? or even if the information they carry is for the most part similar (corr.coef 0.77), the extra is still worth keeping to better isolate the effect expressed by the time dummies?

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

How to calculate vectors with word2vec

Assume I have a number of pictures. Let’s say 10 pictures which are annotated by 50 people each.
So Pic 1 might be „beach, vacation, relax, sand, sun…“ I now trained word2vec with a domain specific content. I have the vectors of each word and can represent them. But what I want now, is to create ONE final vector representing each picture. So one vector with represents the 50 annotations (beach, vacation, relax, sand, sun…)
Let’s assume each vector is represented with 100 dimensions – do I just add the first dimension (the 100 dimensions) of all 50 vectors, than the 2nd dimension of all 50 vectors… etc.
I am very thankful for any comments that might help me!
I tried this, but I am not sure if this is the right way to do it.
I also tried doc2vec but I guess this is problematic as the word order of the annotations is irrelevant – but relevant for doc2vec.???
A few thoughts:
A list of annotations isn't quite like natural-language narrative text – in either the relative frequencies of tokens, or the importance of neighboring-tokens. So you may want to try out a extra-wide range of training parameters. For example, using a giant window (far larger than each of your texts) could essentially negate the (possibly-arbitrary) ordering of the annotations, putting every word in every other word's context. (That'd increase training time, but might help in other ways.) Also, look into the newly-tunable ns_exponent parameter - the paper referenced from the gensim docs suggests values very different from the default may help in certain recommendation contexts.
That said, the most-simple way to combine many vectors into one is to average them all together. Whether that works well for your purposes, you'd have to test. (It unavoidably loses the information of a larger set of independent vectors, but if other aspects of your modeling are strong enough – enough training data, enough dimensions – the really important shared aspects may be retained in the summary vector.)
(You can see some code for averaging word-vectors in another recent answer.)
You could also try Doc2Vec. It is no more ordering-dependent than Word2Vec – some modes use a window-sized context where neighboring words influence each other. (But other modes don't, or, as mentioned above, an oversized window can essentially make neighboring-distances less-relevant).

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

How to detect anomalous resource consumption reliably?

This question is about a whole class of similar problems, but I'll ask it as a concrete example.
I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.
It doesn't really matter what it is -- it might, for example, be a queue of "work".
During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:
Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected
The statistical characteristics of the process are basically unknown.
What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.
I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.
Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.
I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.
There are three cases, I think, of interest, not in order:
The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.
It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..
If it is actually related to a queue of work, then queueing theory may be the best route to an answer.
For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place.
Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. For example, see this post.
edit: as per my comment I think the problem is somewhat analogous to the GW problem. You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. If you could do this then you could perhaps identify a significant change in the trend. Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. But it might still be in time to head off resource depletion. At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future.

Resources