Hive Query array as field - hadoop

I have two Hive table :
Client Table :
id,name,salary
1 ,John, 10000
2 ,Melissa, 5000
Account Table :
id,account_number,client_id
1 ,00920202, 1
2 ,00920203, 1
3 ,00920204, 1
4 ,00920205, 2
5 ,00920206, 2
I need a hive query that return this results :
id,name,salary,accounts
1 ,John, 10000, {00920202, 00920203, 00920204}
2 ,Melissa, 5000, {00920205, 00920206}
Thanks in advance

Use collect_list if you are sure the account numbers are unique. Else use collect_set which eliminates duplicates.
select c.id,c.name,c.salary,collect_list(a.account_number) as all_accounts
from client c
join account a on a.client_id=c.id
group by c.id,c.name,c.salary

Related

Update an Oracle table using listagg statement on the same table

I have a table that contains one or more records for each item. Each item can contain multiple sub-items (boards) and so the Itemid is often replicated with each record showing the division category (a number) that the Item/sub-item combo resides in:
ItemId Board# Division
142585109 0 6
142585114 0 3
142585116 0 1
142585120 0 4
142585197 0 5
142585197 2 4
142585197 3 3
142585197 5 6
142585197 8 1
142585294 0 4
142585317 0 1
I want to update the table and aggregate all of the division values (as a comma separated string) in a new field in this table, something like:
ItemId Board# AggDivisions
142585109 0 6
142585114 0 3
142585116 0 1
142585120 0 4
142585197 0 1,3,4,5,6
142585294 0 4
142585317 0 1
I used a ListAgg query to do the aggregation which works correctly but when I tried to incorporate this into an update query, I end up with multiple duplicates in the aggregated field for each record.
Here is my update attempt:
update itemtable dd
set aggregateddivisions = (SELECT Listagg(division, ',') within GROUP (ORDER BY division)
FROM itemtable ev
WHERE ev.itemid = dd.itemid
)
where exists (select 1
from itemtable ev
where ev.itemid = dd.itemid
);
How can I update the table with the aggregated list of values from the same table without ending up with duplicates?

Hierarchical query get all children as rows

Data:
ID PARENT_ID
1 [null]
2 1
3 1
4 2
Desired result:
ID CHILD_AT_ANY_LEVEL
1 2
1 3
1 4
2 4
I've tried SYS_CONNECT_BY_PATH, but I don't understand how to convert it result into "inline view" which I can use for JOIN with main table.
select connect_by_root(id) id, id child_at_any_level
from table
where level <> 1
connect by prior id = parent_id;

Complex Networks in Hive - Optimization Code

I have a problem with how to get my Hive code optimized.
I have a huge table as follows:
Customer_id Product_id Date Value
1 1 02/28 100.0
1 2 02/02 120.0
1 3 02/10 144.0
2 2 02/15 120.0
2 3 02/28 144.0
... ... ... ...
I want to create a complex network where I link the products through the buyers. The graph does not have to be directed and I have to count the number of links between them.
In the end I need this:
Product_x Product_y amount
1 2 1
1 3 1
2 3 2
Can anyone help me with this?
I need an optimized way to do this. The join of the table with itself is not the solution. I really need an optimum way on this =/
CREATE TABLE X AS
SELECT
a.product_id as product_x,
b.product_id as product_y,
count(*) as amout
FROM table as a
JOIN table as b
ON a.customer_id = b.customer_id
WHERE a.product_id < b.product_id
GROUP BY product_x, product_y;

Select and sum multiple columns for statistic purposes with Laravel query

I have one table scores where I have saving users scores. It's looks like this
table `scores`
id | points | user_id
1 5 1
2 2 1
3 4 1
4 1 3
5 10 2
I want to select each user, sum his points and show as a ranking. The result from above should be
user_id | points
1 11
2 10
3 1
The query with which I came up is
$sumPoints = Scores::select( \DB::raw("sum(points) as numberOfPoints"), \DB::raw("count(id) as numberId"))->groupBy("user_id")->first();
The problem is in ->first() because it's return only one result.. it is working as must. If I try to use ->get() instead I've got Undefined property error. How should I use this?
The query which is working in phpmyadmin
SELECT count(id) as numberId, sum(points) as numberOfPoints FROM `points` GROUP BY `user_id`
You can use something like this
$sumPoints = Scores::select( \DB::raw("sum(points) as numberOfPoints"), \DB::raw("count(id) as numberId"))->groupBy("user_id")->get();
foreach($sumPoints as $point){
dd($point); //OR dd($point->numberOfPoints)
}

SAS Sorting within group

I would like to try and sort this data by descending number of events and from latest date, grouped by ID
I have tried proc sql;
proc sql;
create table new as
select *
from old
group by ID
order by events desc, date desc;
quit;
The result I currently get is
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
2 09/11/2015 2
3 01/01/2015 2
2 16/10/2014 2
3 08/12/2013 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Although the dates and events are sorted descending. Those IDs with multiple events are no longer grouped.
Would it be possible to achieve the below in fewer steps?
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
3 01/01/2015 2
3 08/12/2013 2
2 09/11/2015 2
2 16/10/2014 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Thanks
It looks to me like you're trying to sort by descending event, then by either the earliest or latest date (I can't tell which one from your explanation), also descending, and then by id. In your proc sql query, you could try calculating the min or max of the Date variable, grouped by event and id, and then sort the result by descending event, the descending min/max of the date, and id.

Resources