Hive - add column with number of distinct values in groups - hadoop

Suppose I have the following data.
number group
1 a
1 a
3 a
4 a
4 a
5 c
6 b
6 b
6 b
7 b
8 b
9 b
10 b
14 b
15 b
I would like to group the data by group and add a further column which say how many distinct values of number each group has.
My desired output would look as follows:
number group dist_number
1 a 3
1 a 3
3 a 3
4 a 3
4 a 3
5 c 1
6 b 9
6 b 9
6 b 9
7 b 9
8 b 9
9 b 9
10 b 9
14 b 9
15 b 9
What I have tried is:
> select *, count(distinct number) over(partition by group) from numbers;
11 11
As one sees, this aggregates globally and calculates the number of distinct values independently from the group.
One thing I could do is to use group by as follows:
hive> select *, count(distinct number) from numbers group by group;
a 3
b 7
c 1
And then join over group
But I thought maybe there is a more easy solution to this, e.g., using the over(partition by group) method?

You definitely want to use windowing functions here. I'm not exactly sure how you got 11 11 from the query your tried; I'm 99% sure if you try to count(distinct _) in Hive with an over/partition it will complain. To get around this you can use collect_set() to get an array of the distinct elements in the partition and then you can use size() to count the elements.
Query:
select *
, size(num_arr) dist_num
from (
select *
, collect_set(num) over (partition by grp) num_arr
from db.tbl ) x
Output:
4 a [4,3,1] 3
4 a [4,3,1] 3
3 a [4,3,1] 3
1 a [4,3,1] 3
1 a [4,3,1] 3
15 b [15,14,10,9,8,7,6] 7
14 b [15,14,10,9,8,7,6] 7
10 b [15,14,10,9,8,7,6] 7
9 b [15,14,10,9,8,7,6] 7
8 b [15,14,10,9,8,7,6] 7
7 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
5 c [5] 1
I included in the arrays in the output so you could see what was happening, obviously you can discard them in your query. As as note, doing a self-join here is really a disaster with regards to performance (and it's more lines of code).

As per your requirement,this may work:
select number,group1,COUNT(group1) OVER (PARTITION BY group1) as dist_num from table1;

Related

Pandas Sort Order of Columns by Row

Given the following data frame:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,5,4],
'C' : [5,6,7,1],
'D' : [1,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 1
2 6 6 9
3 5 7 9
7 4 1 8
I want to sort the order of the columns descendingly on the bottom row like this:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Thanks in advance!
Easiest way is to take the transpose, then sort_values, then transpose back.
df.T.sort_values('7', ascending=False).T
or
df.T.sort_values(df.index[-1], ascending=False).T
Gives:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Testing
my solution
%%timeit
df.T.sort_values(df.index[-1], ascending=False).T
1000 loops, best of 3: 444 µs per loop
alternative solution
%%timeit
df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]
1000 loops, best of 3: 525 µs per loop
You can use sort_values (by the index position of your row) with axis=1:
df.sort_values(by=df.index[-1],axis=1,inplace=True)
Here is a variation that does not involve transposition:
df = df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]

How to compute a natural join??? 5

Table R (A, C) contains the following entries:
A C
3 3
6 4
2 3
3 5
7 1
Table S (B, C, D) following
B C D
5 1 6
1 5 8
4 3 9
Calculate the natural join of R and S. Which of the lines would be the result? Each resulting string has the following schema (A, B, C, D).
Please help!!!
Got the answer by looking at this. So your answer should be: {(3,4,3,9),(2,4,3,9),(3,1,5,8),(7,5,1,6)}
A B C D
3 4 3 9
2 4 3 9
3 1 5 8
7 5 1 6

Special Case of Natural Join

What is the the result of natural join if two relations have exactly the same attributes? That is suppose
A B A B
1 2 7 8
3 4 9 10
5 6 1 2
Would the result just be
A B
1 2

How to Use Where with Many to Many Comparison

I have problem with LINQ Query in following scenario:
I have Activity and ActivityTeacher Two Table and List of Some Teachers.
Activity Table
ActivityID Date Class
1 4/4/2012 1
2 4/5/2013 2
3 4/6/2013 5
4 5/6/2013 2
5 5/16/2013 1
6 5/20/2013 8
7 5/21/2013 7
8 6/22/2013 6
9 8/10/2013 5
10 8/12/2013 4
ActivityTeacher Table
ActivityID TeacherID
1 2
1 3
1 4
2 6
3 6
3 6
3 4
2 5
4 2
4 3
4 6
5 8
5 7
5 6
6 6
6 7
6 9
6 10
6 1
6 2
7 2
7 8
7 9
7 10
8 3
8 4
8 6
8 7
9 10
9 3
9 2
10 1
10 2
List of Teachers={2,3,4}
Now I want to select records from Activity based on List of Teachers={2,3,4}
without using foreach loop.
The Activity entity should have a Teachers navigation property you can utilize:
context.Activities
.Where(x => listOfTeachers.Contains(x.Teachers.Select(t => t.TeacherId)));
If listOfTeachers contains the three IDs 2, 3, 4, this query should translate to SQL that is similar to the following:
select a.*
from Activity a
inner join ActivityTeacher at
on a.activityid = at.activityid
where at.teacherid in (2, 3, 4);

Want to generate o/p as Below in Oracle

I need an o/p as below.
1,1
2,1
2,2
3,1
3,2
3,3
4,1
4,2
4,3
4,4
... and so on.
I tried to write the query as below. But throwing error. SIngle row subquery returns more than one row.
with test1 as(
SELECT LEVEL n
FROM DUAL
CONNECT BY LEVEL <59)
select n,(
SELECT LEVEL n
FROM DUAL
CONNECT BY LEVEL <n) from test1
Appreciate your help in solving the same.
Here is one of the methods how you could get the desired result:
SQL> with t1(col) as(
2 select level
3 from dual
4 connect by level <= 5
5 )
6 select a.col
7 , b.col
8 from t1 a
9 join t1 b
10 on a.col >= b.col
11 ;
COL COL
---------- ----------
1 1
2 1
2 2
3 1
3 2
3 3
4 1
4 2
4 3
4 4
5 1
5 2
5 3
5 4
5 5
15 rows selected

Resources