How does corr() work on a deeper level? Because some correlations seem wrong in my examples - correlation

I know all the basics of correlation and also that it ranges from -1 to 1 etc.
The Problem I have is, that the correlation does not make sense in some cases. That means, that there is a positive correlation, even though one answer agrees to something (in my case high on the likert scale from 1 to 5) and the other one low (1s and 2s on the likert scale).
How does this add up?
I will provide an example.
The two questions that are correlated in this picture have a correlation of 0.34, but the answer patterns are quite diverse, how does that work? The picture can be seen below
Correllation Table

Related

Method to choose overall winner across multiple categories [migrated]

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?
The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.
Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.
One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.
I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.
Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

Importance of covariance

If I have calculated coefficient of correlation, I already have the idea of covariance. But I have seen many data scientist calculates the covariance after it. If I have coefficient of correlation with me, I can say that data is positively or negatively correlated with the strength, while covariance give the same thing without the strength. Then what is the importance of covariance if I have coefficient of correlation.
Please suggest, apologies if my question is of low importance.
The correlation and covariance are stricty related, indeed rho=cov(x,y)/(sigma_x*sigma_y)
However, the units of a covariance are hard to interpret. For example, if we wanted to know the covariance between wages paid to employees and the number of employees at a firm, it can be shown that by converting the wages from dollars to cents, we would increase the covariance by a factor of 100. This is odd given that the underlying relationship shouldn't be different if we are talking about dollars or cents. Another way to express this is:
Cov(a*X,Y)=a*Cov(X,Y)
The correlation is always bounded between -1 and 1 and easier to be interpreted
I tend to see correlation versus covariance as an opposition between a quick dry mathematical relation overview and a more raw relation analysis. Imagine yourself joining a project in a field you know approximately nothing about:
if a team member gives you a correlation coefficient for two key variables/indicators associated with the project, you will be able to extract all the information out of this coefficient immediately, without knowing the samples respective scales
if he gives you a covariance, you will probably want to have a look at the data to appreciate exactly what it implies
Covariance is easily understood when the samples being compared live on a similar scale/have a similar nature, since the value you'll be considering will not try to compare two completely different things with an intuitively absurd compromise in nature/scale (remember that to compute the covariance you use the products of two things possibly very different with (x-mean(x))(y-mean(y))). Correlation being standardized, issues associated with varying scales and nature in the data are simply absent of your indicator, leading to an "easier interpretation" feeling.
One should therefore realize that while correlation can make it easier to understand a mathematical relationship, it obfuscates the actual nature of the data you're playing with. Looking at both can't hurt you to appreciate what's going on with your samples, and that's probably why you'll like considering both. If you aren't convinced you can also read this related stats.stackexchange question.
In case you wonder why would you want to keep close to the nature and scales of your data while trying to highlight relations between samples, a good example would be the efforts deployed in AI to extract useful features in images to feed models: you want to emphasize discriminatory descriptions of data, without filtering out other potentially interesting information with a standardization. See for example this paper that uses covariance matrices to build a dictionary on images.

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

Multiple parameter optimization with a stochastic element

I am looking for a method to find the best parameters for a simulation. It's about break-shots in billiards / pool. A shot is defined by 7 parameters, I can simulate the shot and then rate the outcome and I would like to compute the best parameters.
I have found the following link here:
Multiple parameter optimization with lots of local minima
suggesting 4 kinds of algorithms. In the pool simulator I am using, the shots are altered by a little random value each time it is simulated. If I simulate the same shot twice, the outcome will be different. So I am looking for an algorithm like the ones in the link above, only with the addition of a stochastical element, optimizing for the 7 parameters that will on average yield the best parameters, i.e. a break shot that most likely will be a success. My initial idea was simulating the shot 100 or 1000 times and just take the average as rating for the algorithms above, but I still feel like there is a better way. Does anyone have an idea?
The 7 parameters are continuous but within different ranges (one from 0 to 10, another from 0.0 to 0.028575 and so on).
Thank you
At least for some of the algorithms, simulating the same shot repeatedly might not be neccessary. As long as your alternatives have some form of momentum, like in the swarm simulation approach, you can let that be affected by the outcome of each individual simulation. In that case, a single unlucky simulation would slow the movement in parameter space only slightly, whereas a serious loss of quality should be enough to stop and reverse the movement. Thos algorithms which don't use momentum might be tweaked to have momentum. If not, then repeated simulation seems the best approach. Unless you can get your hands on the internals of the simulator, and rate the shot as a whole without having to simulate it over and over again.
You can use the algorithms you mentioned in your non-deterministic scenario with independent stochastic runs. Your idea with repeated simulations is good, you can read more about how many repeats you might have to consider for your simulations (unfortunately, there is no trivial answer). If you are not so much into maths, and the runs go fast, do 1.000 repeats, then 10.000 repeats, and see if the results differ largely. If yes, you have to collect more samples, if not, you are probably on the safe side (the central limit theorem states that the results converge).
Further, do not just consider the average! Make sure to look into the standard deviation for each algorithm's results; you might want to use box plots to compare their quartiles. If you rely on the average only, you could pick an algorithm that produces very varying results, sometimes excellent, sometimes terrible in performance.
I don't know what language you are using, but if you use Java, I am maintaining a tool that could simplify your "monte carlo" style experiments.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)
As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.
Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)
I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill
I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.
These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.
You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.
Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.
Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.
If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).
The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.
Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.
I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Resources