Leads to find tables correlation - algorithm

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.

It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/

In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

Related

Indexing Strategy in Oracle

I have a table with 2 million rows.
The ndv ( number of distinct values ) in the columns are as follows :
A - 3
B - 60
D - 150
E - 600,000
The most frequently updated columns are A & B ( NDV = 3 for both ).
Assuming every query will have either column D or column E in WHERE clause, which of following will be the best set of indexes for SELECT statement:
D
D,E,A
E,A
A,E
Not really enough information to give a definitive assessment, but some things to consider:
You're unlikely to get a skip scan benefit, so if you want snappy
response from predicates with leading E or leading D, that will be 2
indexes. (One leading with D, and one leading with E).
If A/B are updated frequently (although that's a generic term),
you might choose to leave them out of the index definition in
order to reduce index maintenance overhead.

Pandas pivot table Nested Sorting

Given this data frame and pivot table:
import pandas as pd
df=pd.DataFrame({'A':['x','y','z','x','y','z'],
'B':['one','one','one','two','two','two'],
'C':[7,5,3,4,1,6]})
df
A B C
0 x one 7
1 y one 5
2 z one 3
3 x two 4
4 y two 1
5 z two 6
table = pd.pivot_table(df, index=['A', 'B'],aggfunc=np.sum)
table
A B
x one 7
two 4
y one 5
two 1
z one 3
two 6
Name: C, dtype: int64
I want to sort the pivot table such that the order of 'A' is z, x, y and the order of 'B' is based on the descendingly-sorted values from data frame column 'C'.
Like this:
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64
Thanks in advance!
I don't believe there is an easy way to accomplish your objective. The following solution first sorts your table is descending order based on the values of column C. It then concatenates each slice based on your desired order.
order = ['z', 'x', 'y']
table = table.reset_index().sort_values('C', ascending=False)
>>> pd.concat([table.loc[table.A == val, :].set_index(['A', 'B']) for val in order])
C
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
If you can read in column A as categorical data, then it becomes much more straightforward. Setting your categories as list('zxy') and specifying ordered=True uses your custom ordering.
You can read in your data using something similar to:
'A':pd.Categorical(['x','y','z','x','y','z'], list('zxy'), ordered=True)
Alternatively, you can read in the data as you currently are, then use astype to convert A to categorical:
df['A'] = df['A'].astype('category', categories=list('zxy'), ordered=True)
Once A is categorical, you can pivot the same as before, and then sort with:
table = table.sort_values(ascending=False).sortlevel(0, sort_remaining=False)
Solution
custom_order = ['z', 'x', 'y']
kwargs = dict(axis=0, level=0, drop_level=False)
new_table = pd.concat(
[table.xs(idx_v, **kwargs).sort_values(ascending=False) for idx_v in custom_order]
)
Alternate one liner
pd.concat([table.xs(i, drop_level=0).sort_values(ascending=0) for i in list('zxy')]
Explanation
custom_order is your desired order.
kwargs is a convenient way to improve readability (in my opinion). Key elements to note, axis=0 and level=0 might be important for you if you want to leverage this further. However, those are also the default values and can be left out.
drop_level=False is the key argument here and is necessary to keep the idx_v we are taking a xs of such that the pd.concat puts it all together in the way we'd like.
I use a list comprehension in almost the exact same manner as Alexander within the pd.concat call.
Demonstration
print new_table
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64

Efficient database lookup based on input where not all digits are sigificant

I would like to do a database lookup based on a 10 digit numeric value where only the first n digits are significant. Assume that there is no way in advance to determine n by looking at the value.
For example, I receive the value 5432154321. The corresponding entry (if it exists) might have key 54 or 543215 or any value based on n being somewhere between 1 and 10 inclusive.
Is there any efficient approach to matching on such a string short of simply trying all 10 possibilities?
Some background
The value is from a barcode scan. The barcodes are EAN13 restricted circulation numbers so they have the following structure:
02[1234567890]C
where C is a check sum. The 10 digits in between the 02 and the check sum consist of an item identifier followed by an item measure. There might be a check digit after the item identifier.
Since I can't depend on the data to adhere to any single standard, I would like to be able to define on an ad-hoc basis, how particular barcodes are structured which means that the portion of the 10 digit number that I extract, can be any length between 1 and 10.
Just a few ideas here:
1)
Maybe store these numbers in reversed form in your DB.
If you have N = 54321 you store it as N = 12345 in the DB.
Say N is the name of the column you stored it in.
When you read K = 5432154321, reverse this one too,
you get K1 = 1234512345, now check the DB column N
(whose value is let's say P), if K1 % 10^s == P,
where s=floor(Math.log(P) + 1).
Note: floor(Math.log(P) + 1) is a formula for
the count of digits of the number P > 0.
The value floor(Math.log(P) + 1) you may also
store in the DB as precomputed one, so that
you don't need to compute it each time.
2) As this 1) is kind of sick (but maybe best of the 3 ideas here),
maybe you just store them in a string column and check it with
'like operator'. But this is trivial, you probably considered it
already.
3) Or ... you store the numbers reversed, but you also
store all their residues mod 10^k for k=1...10.
col1, col2,..., col10
Then you can compare numbers almost directly,
the check will be something like
N % 10 == col1
or
N % 100 == col2
or
...
(N % 10^10) == col10.
Still not very elegant though (and not quite sure
if applicable to your case).
I decided to check my idea 1).
So here is an example
(I did it in SQL Server).
insert into numbers
(number, cnt_dig)
values
(1234, 1 + floor(log10(1234)))
insert into numbers
(number, cnt_dig)
values
(51234, 1 + floor(log10(51234)))
insert into numbers
(number, cnt_dig)
values
(7812334, 1 + floor(log10(7812334)))
select * From numbers
/*
Now we have this in our table:
id number cnt_dig
4 1234 4
5 51234 5
6 7812334 7
*/
-- Note that the actual numbers stored here
-- are the reversed ones: 4321, 43215, 4332187.
-- So far so good.
-- Now we read say K = 433218799 on the input
-- We reverse it and we get K1 = 997812334
declare #K1 bigint
set #K1 = 997812334
select * From numbers
where
#K1 % power(10, cnt_dig) = number
-- So from the last 3 queries,
-- we get this row:
-- id number cnt_dig
-- 6 7812334 7
--
-- meaning we have a match
-- i.e. the actual number 433218799
-- was matched successfully with the
-- actual number (from the DB) 4332187.
So this idea 1) doesn't seem that bad after all.

How can I do joins given a threashold in Hadoop using PIG

Let's say I have a dataset with following schema:
ItemName (String) , Length (long)
I need to find items that are duplicates based on their length. That's pretty easy to do in PIG:
raw_data = LOAD...dataset
grouped = GROUP raw_data by length
items = FOREACH grouped GENERATE COUNT(raw_data) as count, raw_data.name;
dups = FILTER items BY count > 1;
STORE dups....
The above finds exact duplicates. Given the set bellow:
a, 100
b, 105
c, 100
It will output 2, (a,c)
Now I need to find duplicates using a threshold. For example a threshold of 5 would mean match items if their length +/- 5. So the output should look like:
3, (a,b,c)
Any ideas how I can go about doing this?
It is almost like I want PIG to use a UDF as its comparator when it is comparing records during its join...
I think the only way to do what you want is to load the data into two tables and do a cartesian join of the data set onto itself, so that each value can be compared to each other value.
Pseudo-code:
r1 = load dataset
r2 = load dataset
rcross = cross r1, r2
rcross is a cartesian product that will allow you to check the difference in length between each pair.
I was solving a similar problem once and got one crazy and dirty solution.
It is based on next lemma:
If |a - b| < r then there exists such an integer number x: 0 <= x < r that
floor((a+x)/r) = floor((b+x)/r)
(further I will mean only integer division and will omit floor() function, i.e. 5/2=2)
This lemma is obvious, I'm not gonna prove it here
Based on this lemma you may do a next join:
RESULT = JOIN A by A.len / r, B By B.len / r
And get several values from all values corresponding to |A.len - B.len| < r
But doing this r times:
RESULT0 = JOIN A by A.len / r, B By (B.len / r)
RESULT1 = JOIN A by (A.len+1) / r, B By (B.len+1) / r
...
RESULT{R-1} = JOIN A by (A.len+r-1) / r, B By (B.len+r-1) / r
you will get all needed values. Of course you will get more rows than you need, but as I said already it's a dirty solution (i.e. it's not optimal, but works)
The other big disadvantage of this solution is that JOINs should be written dynamically and their number will be big for big r.
Still it works if you know r and it is rather small (like r=6 in your case)
Hope it helps

It' possible create a clause that mixes the clauses that it contains?

i have to get a value by developing a list through various clauses, but do not know the best combination between the clauses, because each clause removes items from the initial list and the subsequent works on the remaining list.
Is it possible to create a single clause that you find the best combination of the clauses?
calculate(List,Value):-
calculate_value1(List,R,Value1),
calculate_value1(List,R,Value2),
calculate_value1(List,R,Value3),
max([Value1,Valu2,Value3],Value).
calculate_value1(List,Rest,Value1):-
funcA(List,Rest1,ValueA),
funcB(Rest1,Rest2,ValueB),
funcC(Rest2,Rest3,ValueC).
Value1 is ValueA + ValueB + ValueC.
calculate_value2(List,Rest,Value2):-
funcB(List,Rest1,ValueB),
funcA(Rest1,Rest2,ValueA),
funcC(Rest2,Rest3,ValueC).
Value2 is ValueA + ValueB + ValueC.
calculate_value3(List,Rest,Value3):-
funcC(List,Rest1,ValueC),
funcB(Rest1,Rest2,ValueB),
funcA(Rest2,Rest3,ValueA).
Value3 is ValueA + ValueB + ValueC.
thank you.
I have to compare two lists and I have to be able to find the best balance between them. Then run on lists various clauses that identify if there are identical elements between the first and the second , and then with a ratio of 100 % or if one of a list is the sum of the other second . Also check if they are neighbors who have close relationship , 110 is closer to 100 than to 150 . But the data is not only numeric only .
Now I have several separate clauses : equals ( ) , that identifies the elements equally between the two lists , sum (), which identifies items with a ratio of the sum of them, multiply (), etc. ....
For each clause in the input do I get a list and a list of items that met the criteria of that clause (sum , multiplication, c ....) , with the percentage found and a list of remaining elements that give input to the clause next .
In doing so , however, is a procedural program , because I first calculate the elements the same, then the sum , etc ...
I would like to be able to create a dynamic program that is able to identify the best percentage in any order by clauses .
I hope I was more clear .

Resources