Pearson Correlation from multiple rows - oracle

I want to calculate the Pearson Correlation between two arrays.
The function CORR only accepts 2 values which has to be in a table. In my procedure I select multiple rows of numbers from two different sets and I want to calculate the correlation from them.
EDIT:
The corr function is an oracle function which calculates the pearson correlation between two values. Here is the problem. I want to calculate the correlation between two arrays which says to me array1 is similar to array2 of for example 50%.

You can simply calculate average of pairwise correlations
select
(abs(corr1) + abs(corr2) + abs(corr3))/3 as Avg_Corr
from (
SELECT
CORR(a.col1, b.col1) as corr1,
CORR(a.col2, b.col2) as corr2,
CORR(a.col3, b.col3) as corr3
FROM table1 a, table2 b
WHERE a.id = b.id
)
or use more complex but more adequate generalization of Pearson correlation (there are no internal function in Oracle for this)

Related

How to simulate BigQuery's quantiles in Hive

I want to simulate BigQuery's QUANTILES function in Hive.
Data set: 1,2,3,4
BigQuery's query result will return value 2
select nth(2, quantiles(col1, 3))
But in Hive:
select percentile(col1, 0.5)
I've got 2.5
Note: I've got same result for odd number of records.
Is there any adequate Hive's udf functions?
I guess what you are looking for is the percentile_approx UDF.
This page gives you the list of all built-in UDFs in Hive.
percentile_approx(DOUBLE col, p [, B])
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

A function to reproduce the same number for each input

say you have some data consisting of 2 columns and 1 billion rows, like:
0,0
1,0
2,3
3,2
etc
I want to create a function that will always give what's in column 2 if given an input from column one, so that it will be mapping values from column one to column two the same way it appeared in the data.
Column 1 is sequential from 0 to 1E9 (one billion)
Column 2 can ONLY be {0,1,2,3}
I don't want to just store the data in an array.. I want code that can calculate this map.
Any ideas?
Thanks in advance
If the keys are dense, a 1d array should be fine where weights[key] = weight
Otherwise, a lookup structure such as a dictionary would work if the keys are sparse.
Not sure if you also needed help on the random part, but the cumulative sum and a rand(sum(weights)) will select randomly with a bias on numbers with larger weights.
edited for clarity weights is the array
Assuming #munch1324 is correct, and the problem is:
Given a collection of 1000 data points, dynamically generate a function that matches the data set.
then yes, I think it is possible. However, if your goal is for the function to be a more compact representation of the data collection, then I think you are out of luck.
Here are two possibilities:
Piecewise-defined function
int function foo(int x)
{
if (x==0) return 0;
if (x==1) return 0;
if (x==2) return 3;
if (x==3) return 4;
...
}
Polynomial interpolation
N data points can be fit to exactly match a N-1 degree polynomial.
Given the collection of 1000 data points, use your favorite method to solve for the 1000 coeffecients of a 999-degree polynomial.
Your resulting function would then be:
int[] c; // Array of 1000 polynomial coefficients that you solved for when given the data collection
...
int function foo(int x)
{
return c[999]*x^999 + c[998]*x^998 + ... + c[1]*x + c[0];
}
This has obvious issues, because you have 1000 coefficients to store, and will have numerical issues raising x values to such high powers.
If you are looking for something a little more advanced, the Lagrange polynomial will give you the polynomial of least degree that fits all of your data points.

choosing row m from two matrices at random

I have two m*n matrices, A and P. I want to randomly choose the same 3 rows from both matrices, e.g. rows m, m+1, m+2 are picked from both matrices. I want to be able to make the calculation U=A-P on the selected subset (i.e. Usub-Psub), rather than before the selection. So far I have only been able to select rows from one matrix, without being able to match it to the other. The code I use for this is:
A=[0,1,1,3,2,4,4,5;0,2,1,1,3,3,5,5;0,3,1,1,4,4,2,5;0,1,1,1,2,2,5,5]
P=[0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0]
U=A-P
k = randperm(size(U,1));
Usub = U(k(1:3),:);
I would first create a function that returned a submatrix that had only three rows in it that takes an integer as the first of the three row. Then i'd do something like this:
m = number of rows;
randomRow = rand() % m;
U = A.sub(randomRow) - P.sub(randomRow);

Complexity of: One matrix is row/col permutation of another matrix

Given two m x n matrices A and B whose elements belong to a set S.
Problem: Can the rows and columns of A be permuted to give B?
What is the complexity of algorithms to solve this problem?
Determinants partially help (when m=n): a necessary condition is that det(A) = +/- det(B).
Also allow A to contain "don't cares" that match any element of B.
Also, if S is finite allow permutations of elements of A.
This is not homework - it is related to the solved 17x17 puzzle.
See below example of permuting rows and columns of a matrix:
Observe the start matrix and end matrix. All elements in a row or column are retained its just that their order has changed. Also the change in relative positions is uniform across rows and columns
eg. see 1 in start matrix and end matrix. Its row has elements 12, 3 and 14 along with it. Also its column has 5, 9 and 2 along with it. This is maintained across the transformations.
Based on this fact I am putting forward this basic algo to find for a given matrix A, can its rows and columns of A be permuted to give matrix B.
1. For each row in A, sort all elements in the row. Do same for B.
2. Sort all rows of A (and B) based on its columns. ie. if row1 is {5,7,16,18} and row2 is {2,4,13,15}, then put row2 above row1
3. Compare resultant matrix A' and B'.
4. If both equal, then do (1) and (2) but for columns on ORIGINAL matrix A & B instead of rows.
5. Now compare resultant matrix A'' and B''

Leads to find tables correlation

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

Resources