How to compute a natural join??? 5 - relational-algebra

Table R (A, C) contains the following entries:
A C
3 3
6 4
2 3
3 5
7 1
Table S (B, C, D) following
B C D
5 1 6
1 5 8
4 3 9
Calculate the natural join of R and S. Which of the lines would be the result? Each resulting string has the following schema (A, B, C, D).
Please help!!!

Got the answer by looking at this. So your answer should be: {(3,4,3,9),(2,4,3,9),(3,1,5,8),(7,5,1,6)}
A B C D
3 4 3 9
2 4 3 9
3 1 5 8
7 5 1 6

Related

Pandas pivot table Nested Sorting Part 3

Episode 3:
In part 2, we retained the hierarchical nature of the indices while sorting within right-most level. In part 1, we applied a custom sort to the left-most index level while sorting the values within the right-most index.
Now, I'd like to combine both methods.
Given the following data frame and resultant pivot table:
import pandas as pd
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
df
A B C D
0 a x a 7
1 a y b 5
2 a z a 3
3 a x b 4
4 a y a 1
5 b z b 6
6 b x a 5
7 b y b 3
8 b z a 1
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I would like to specify a custom order of 'B'.
This seems to work:
df['B']=df['B'].astype('category')
df['B'].cat.set_categories(['z','x','y'],inplace=True)
Next, I'd like for the pivot table to keep the order for 'B' specified above while sorting the values 'D' descendingly within each category of 'B'.
Like this:
D
A B C
z a 3
x a 7
a b 4
y b 5
a 1
z b 6
b a 1
x a 5
y b 3
Thanks in advance!
UPDATE: using pivot_table()
In [79]: df.pivot_table(index=['A','B','C'], aggfunc='sum').reset_index().sort_values(['A','B','D'], ascending=[1,1,0]).set_index(['A','B','C'])
Out[79]:
D
A B C
a x a 7
b 4
y b 5
a 1
z a 3
b x a 5
y b 3
z b 6
a 1
is that what you want?
In [64]: df.sort_values(['A','B','D'], ascending=[1,1,0]).set_index(['A','B','C'])
Out[64]:
D
A B C
a z a 3
x a 7
b 4
y b 5
a 1
b z b 6
a 1
x a 5
y b 3

Pandas Sort Order of Columns by Row

Given the following data frame:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,5,4],
'C' : [5,6,7,1],
'D' : [1,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 1
2 6 6 9
3 5 7 9
7 4 1 8
I want to sort the order of the columns descendingly on the bottom row like this:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Thanks in advance!
Easiest way is to take the transpose, then sort_values, then transpose back.
df.T.sort_values('7', ascending=False).T
or
df.T.sort_values(df.index[-1], ascending=False).T
Gives:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Testing
my solution
%%timeit
df.T.sort_values(df.index[-1], ascending=False).T
1000 loops, best of 3: 444 µs per loop
alternative solution
%%timeit
df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]
1000 loops, best of 3: 525 µs per loop
You can use sort_values (by the index position of your row) with axis=1:
df.sort_values(by=df.index[-1],axis=1,inplace=True)
Here is a variation that does not involve transposition:
df = df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]

Hive - add column with number of distinct values in groups

Suppose I have the following data.
number group
1 a
1 a
3 a
4 a
4 a
5 c
6 b
6 b
6 b
7 b
8 b
9 b
10 b
14 b
15 b
I would like to group the data by group and add a further column which say how many distinct values of number each group has.
My desired output would look as follows:
number group dist_number
1 a 3
1 a 3
3 a 3
4 a 3
4 a 3
5 c 1
6 b 9
6 b 9
6 b 9
7 b 9
8 b 9
9 b 9
10 b 9
14 b 9
15 b 9
What I have tried is:
> select *, count(distinct number) over(partition by group) from numbers;
11 11
As one sees, this aggregates globally and calculates the number of distinct values independently from the group.
One thing I could do is to use group by as follows:
hive> select *, count(distinct number) from numbers group by group;
a 3
b 7
c 1
And then join over group
But I thought maybe there is a more easy solution to this, e.g., using the over(partition by group) method?
You definitely want to use windowing functions here. I'm not exactly sure how you got 11 11 from the query your tried; I'm 99% sure if you try to count(distinct _) in Hive with an over/partition it will complain. To get around this you can use collect_set() to get an array of the distinct elements in the partition and then you can use size() to count the elements.
Query:
select *
, size(num_arr) dist_num
from (
select *
, collect_set(num) over (partition by grp) num_arr
from db.tbl ) x
Output:
4 a [4,3,1] 3
4 a [4,3,1] 3
3 a [4,3,1] 3
1 a [4,3,1] 3
1 a [4,3,1] 3
15 b [15,14,10,9,8,7,6] 7
14 b [15,14,10,9,8,7,6] 7
10 b [15,14,10,9,8,7,6] 7
9 b [15,14,10,9,8,7,6] 7
8 b [15,14,10,9,8,7,6] 7
7 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
5 c [5] 1
I included in the arrays in the output so you could see what was happening, obviously you can discard them in your query. As as note, doing a self-join here is really a disaster with regards to performance (and it's more lines of code).
As per your requirement,this may work:
select number,group1,COUNT(group1) OVER (PARTITION BY group1) as dist_num from table1;

Special Case of Natural Join

What is the the result of natural join if two relations have exactly the same attributes? That is suppose
A B A B
1 2 7 8
3 4 9 10
5 6 1 2
Would the result just be
A B
1 2

How to sort dataframe in R with specified column order preservation?

Let's say I have a data.frame
x <- data.frame(a = c('A','A','A','A','A', 'C','C','C','C', 'B','B','B'),
b = c('a','c','a','a','c', 'd', 'e','e','d', 'b','b','b'),
c = c( 7, 3, 2, 4, 5, 3, 1, 1, 5, 5, 2, 3),
stringsAsFactors = FALSE)
> x
a b c
1 A a 7
2 A c 3
3 A a 2
4 A a 4
5 A c 5
6 C d 3
7 C e 1
8 C e 1
9 C d 5
10 B b 5
11 B b 2
12 B b 3
I would like to sort x by columns b and c but keeping order of a as before. x[order(x$b, x$c),] - breaks order of column a. This is what I want:
a b c
3 A a 2
4 A a 4
1 A a 7
2 A c 3
5 A c 5
6 C d 3
9 C d 5
7 C e 1
8 C e 1
11 B b 2
12 B b 3
10 B b 5
Is there a quick way of doing it?
Currently I run "for" loop and sort each subset, I'm sure there must be a better way.
Thank you!
Ilya
If column "a" is ordered already, then its this simple:
> x[order(x$a,x$b, x$c),]
a b c
3 A a 2
4 A a 4
1 A a 7
2 A c 3
5 A c 5
6 B d 3
9 B d 5
7 B e 1
8 B e 1
11 C b 2
12 C b 3
10 C b 5
If column a isn't ordered (but is grouped), create a new factor with the levels of x$a and use that.
Thank you Spacedman! Your recommendation works well.
x$a <- factor(x$a, levels = unique(x$a), ordered = TRUE)
x[order(x$a,x$b, x$c),]
Following Gavin's comment
x$a <- factor(x$a, levels = unique(x$a))
x[order(x$a,x$b, x$c),]
require(doBy)
orderBy(~ a + b + c, data=x)

Resources