How to find correlation between two values - correlation

I have a table with two columns emailid and keyword and I am pivoting(kind of matrix) the value is sql such as the columns are the distinct keywords and the rows are the distinct users the values at [emailid][keyword] is 1 if the value is present and null if it is not, and I am trying to find the correlation between keywords i.e. if two users have searched for the same keyword then there is a correlation between those two keywords, How can I achieve this ?

You should replace the null value with 0 to begin. You may want to explore various correlation techniques such as Pearson and Spearman correlation.
This is a page on Pearson Correlation: http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442
from scipy.stats.stats import pearsonr
a =[1.0001345,0.000656];b=[1.00001345,0.000656]
print pearsonr(a,b)[0]
This gives the output as 1.0 which means total correlation or positive correlation. The output of Pearson correlation varies from -1.0 (Most negative correlation) to 1.0 (high positive correlation). Here 0 means no correlation between the two data quantity.
The more information on this could be found under:
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html

Related

Return column headers (columns B onwards) based on a text value in Column A and number value in other columns - in a Google spreadsheet

I have a matrix - 1,172 words down column A, then the same 1,172 names across row 1. Then each word is cross-referenced with all the other names to give a similarity score (this is already done).
In another sheet, I want to look up a word, and return all the words with which it has a certain similarity score - in this case, greater than or equal to 0.33. I attach a MWE, in which I give an idea of the answer I am looking for by looking it up manually.
I think it's some sort of reverse lookup. As in, instead of finding the value corresponding to a particular row and a particular column, it's finding the column based on value in the main sheet and row. I'm just really stuck at this point and would massively appreciate some help. Thanks! MWE here
If your words on the second sheet are in the same order then:
=IFERROR(TEXTJOIN(", ",,FILTER(Scores!B$1:W$1,(Scores!B2:W2>=0.33)*((Scores!B2:W2<1)))),"-")
Drag down.
Explanation:
Filter the values from row 1 according to the similarity score condition, using FILTER.
Concatenate the filtered values using TEXTJOIN.

how to add filter on measure in power BI?

I have Power BI report having some filters implemented on columns. Now I have to add a new filter on the basis of measure, I have data in the following format for that column, +ve integer, -ve integer or 0.
What I'm trying to achieve, there should be a filter with three default values (+ve integers, -ve integers and 0).
When I select +ve, it should show only records having +ve integer values and so on for two other cases.
Problem: I am creating measure from a measure but not getting the exact data from it.
The second thing I did was created a measure of positive and negative, I am getting the exact data if I will use in table visual but not in the slicers form.
Measures are not meant to be used as slicers. A best practice for Power BI is that if you need to filter by a certain criteria then you should create a new column in the data to reflect which values you want to filter by.
If you have a very large data set then I would do this in the Query Editor using a conditional column.
If you size isn't a factor then make a calculated column like so,
Answer =
SWITCH (
TRUE (),
'Table'[Values] > 0, "Postive",
'Table'[Values] < 0, "Negative",
"Zero"
)
Now insert the column into the filter and you should be able to easily switch between what you need.

Understanding Spark CosineSimillarity output

I am using spark 1.6 cosine similarity (DIMSUM) algorithm.
Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
Here is what I am doing.
Input:
50k documents' text with ids in a dataframe.
Processing :
Tokenized the texts
Generated vectors using word2Vec
Generated RowMatrix
Used columnSimilarities method with threshold (DIMSUM)
Output:
Got a coordinate matrix
On printing out entries of this coordinate matrix I get output of
format example: MatrixEntry(133,185,0.04106425850610451)
I do not understand what are the numbers 133 and 185. My guess was these were the document IDs/sequence number but I am not sure. Can anyone please help here?
Apologies if this question is very trivial.
MatrixEntry(i, j, value) represents a similarity between i-th and j-th column so
MatrixEntry(133,185,0.04106425850610451)
is a similarity between 133th and 185th column. These value correspond to terms not documents.

How to work with Grouped Responses (Event/Trial) in StatsModels Logistic Regression

I'm starting with StatsModels, coming from Minitab. And I can't find the option to do a binary logistic regression with the response in event/trial format.
Here's a very simple example of what I'm saying:
I have the data like this, grouped by variables, with the number of events (number of ones in binary) in one side and the number of trials (number of zeroes and ones) in the other:
enter image description here
Do you know how can tell this to StatsModels?
Thanks a lot!!
Logit and Probit are only defined for binary (Bernoulli) events, 0 or 1. (In the quasi-likelihood interpretation it can take any values in the interval [0, 1]).
However, GLM with family binomial can be used for either binary Bernoulli data or for Binomial counts.
see the description of endog (which is the statsmodels term for response or dependent variable) in
http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html
"Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure]."
an example is here
http://www.statsmodels.org/dev/examples/notebooks/generated/glm.html

stratified sampling in pig?

Does anyone have an idea of how to make a stratified sampling in pig?
(wikipedia)
For the moment, I do something like :
relation2 = SAMPLE relation1 0.05;
but my dataset contains a label columns with a few occurrences, some of them are rare (0.5 % for example) and I would like my random down sampling not to forget all of them.
Thanks a lot.
You could implement your own method of sampling by using RANDOM() and then filtering out rows with values below, say, 0.95. So, if you want to stratify this sampling, you could compute what fraction of your rows contain a certain value, and then scale your random value accordingly so that different values get sampled at different rates.

Resources