Correlation/multicollinearity between multiple dependent variables - correlation

I am in doubt what the right approach is for analyzing multiple dependent variables that seem to be correlated. So far I found only regression type of answers; assuming one dependent variables and multiple independent variables. The problem here is that I have multiple variables that are all correlated and depend on each other, and I want to find the relationship between them.
Is it wise to still conduct some type of regression (LASSO, Ridge, Partial least squares), while I am uncertain which would then be the independent variable, or is there some other approach I have missed in my research?
I also read about canonical correlation analysis, but that is clearly between two sets of variables - which is not the case here.
Thank you in advance for all of your help!
I tried/ studied LASSO, partial least squares, principal component analysis, canonical correlation analysis, but am unsure about right approach.

Related

Will non-linear regression algorithms perform better if trained with normally distributed target values?

After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)
I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.
I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well.
If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.
Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values.
Will this theoretically speaking give better results?
Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.
There are no assumptions on the target variable to be Normal at all.
Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.
I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?
Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.
Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.

How to judge performance of algorithms for Text Clustering?

I am using K-Means algorithm for Text Clustering with initial seeding with K-Means++.
I try to make the algorithm more efficient with some changes like changing the stop-word dictionary and increasing the max_no_of_random_iterations.
I get different results. How do i compare them ? I could not apply the idea of confusion matrix here. Output is not in the form of some document getting some value or tag. A document goes to a set. It is just relative "good clustering" or the set that matters.
So Is there some standard way for marking the performance for this output set ?
If confusion matrix is the answer, please explain how to do it ?
Thanks.
You could decide in advance how to measure the quality of the clusters, for example count how many empty ones or some stats like Within Sum of Squares
This paper says
"... three distinctive approaches to cluster validity are possible.
The first approach relies on external criteria that investigate the
existence of some predefined structure in clustered data set. The
second approach makes use of internal criteria and the clustering
results are evaluated by quantities describing the data set such as
proximity matrix etc. Approaches based on internal and external
criteria make use of statistical tests and their disadvantage is
high computational cost. The third approach makes use of relative
criteria and relies on finding the best clustering scheme that meets
certain assumptions and requires predefined input parameters values"
Since clustering is unsupervised, you are asking for something difficult. I suggest researching how people cluster using genetic algorithms and see what fitness criteria they use.

Automatic probability densities

I have found automatic differentiation to be extremely useful when writing mathematical software. I now have to work with random variables and functions of the random variables, and it seems to me that an approach similar to automatic differentiation could be used for this, too.
The idea is to start with a basic random vector with given multivariate distribution and then you want to work with the implied probability distributions of functions of components of the random vector. The idea is to define operators that automatically combine two probability distributions appropriately when you add, multiply, divide two random variables and transform the distribution appropriately when you apply scalar functions such as exponentiation. You could then combine these to build any function you need of the original random variables and automatically have the corresponding probability distribution available.
Does this sound feasible? If not, why not? If so and since it's not a particularly original thought, could someone point me to an existing implementation, preferably in C
There has been a lot of work on probabilistic programming. One issue is that as your distribution gets more complicated you start needing more complex techniques to sample from it.
There are a number of ways this is done. Probabilistic graphical models gives one vocabulary for expressing these models, and you can then sample from them using various Metropolis-Hastings-style methods. Here is a crash course.
Another model is Probabilistic Programming, which can be done through an embedded domain specific language, directly. Oleg Kiselyov's HANSEI is an example of this approach. Once they have the program they can inspect the tree of decisions and expand them out by a form of importance sampling to gain the most information possible at each step.
You may also want to read "Nonstandard Interpretations of Probabilistic
Programs for Efficient Inference" by Wingate et al. which describes one way to use extra information about the derivative of your distribution to accelerate Metropolis-Hastings-style sampling techniques. I personally use automatic differentiation to calculate those derivatives and this brings the topic back to automatic-differentiation. ;)

algorithm to combine data for linear fit?

I'm not sure if this is the best place to ask this, but you guys have been helpful with plenty of my CS homework in the past so I figure I'll give it a shot.
I'm looking for an algorithm to blindly combine several dependent variables into an index that produces the best linear fit with an external variable. Basically, it would combine the dependent variables using different mathematical operators, include or not include each one, etc. until an index is developed that best correlates with my external variable.
Has anyone seen/heard of something like this before? Even if you could point me in the right direction or to the right place to ask, I would appreciate it. Thanks.
Sounds like you're trying to do Multivariate Linear Regression or Multiple Regression. The simplest method (Read: less accurate) to do this is to individually compute the linear regression lines of each of the component variables and then do a weighted average of each of the lines. Beyond that I am afraid I will be of little help.
This appears to be simple linear regression using multiple explanatory variables. As the implication here is that you are using a computational approach you could do something as simple apply a linear model to your data using every possible combination of your explanatory variables that you have (whether you want to include interaction effects is your choice), choose a goodness of fit measure (R^2 being just one example) and use that to rank the fit of each model you fit?? The quality of a model is also somewhat subjective in many fields - you could reject a model containing 15 variables if it only moderately improves the fit over a far simpler model just containing 3 variables. If you have not read it already I don't doubt that you will find many useful suggestions in the following text :
Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
You might also try doing a google for the LASSO method of model selection.
The thing you're asking for is essentially the entirety of regression analysis.
this is what linear regression does, and this is a good portion of what "machine learning" does (machine learning is basically just a name for more complicated regression and classification algorithms). There are hundreds or thousands of different approaches with various tradeoffs, but the basic ones frequently work quite well.
If you want to learn more, the coursera course on machine learning is a great place to get a deeper understanding of this.

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

Resources