I'm working in HIVE,
I have a dataset like :
client_id date nb_pts
1 2016-06-01 1
1 2016-06-02 3
1 2016-06-03 4
2 2016-06-01 2
2 2016-06-02 3
I need to output for each client, the difference between current nb_pts and previous nb_pts.
So my output should be :
client_id date nb_pts nb_pts_per_row
1 2016-06-01 1 1 (1-0)
1 2016-06-02 3 2 (3-1)
1 2016-06-03 4 1 (4-3)
2 2016-06-01 2 2 (2-0)
2 2016-06-02 3 1 (3-2)
I've tried to use LAG function un HIVE:
SELECT client_id, date, nb_pts,
nb_pts - (LAG(nb_pts, 1, 0) OVER (PARTITION BY client_id ORDER BY date ROWS 1 PRECEDING)) as nb_pts_per_row
FROM MyTable
But the validation failed. Its says :
Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Expecting left window frame boundary for function LAG((TOK_TABLE_OR_COL nb_pts), 1, 0) org.apache.hadoop.hive.ql.parse.WindowingSpec$WindowSpec#27a007cd as LAG_window_0 to be unbounded.
EDIT (SOLUTION):
So it works without ROWS 1 PRECEDING :
SELECT client_id, date, nb_pts,
nb_pts - (LAG(nb_pts, 1, 0) OVER (PARTITION BY client_id ORDER BY date)) as nb_pts_per_row
FROM MyTable
Related
I have a following records
++++++++++++++++++++++
rid cid result timestamp
1 2 true t1
1 2 false t2
1 3 false t3
1 3 true t4
1 4 false t5
++++++++++++++++++++++
I need to do aggregation in such a way that:
for rid 1 and cid unique combination get the latest record with recent timestamp
for example i should get:
++++++++++get reslts w.r.t. timestamp for cid and rid combination+++++++++++++
rid cid result timestamp
1 2 false t2
1 3 true t4
1 4 false t5
+++++++++++++++++++++++++
once you have the recent recotds in step 1 aggregate the results at rid level and get the pass and fail count for each rid
++++++final output++++++++
rid: 1
result_false: 2
result_true: 1
+++++++++++++++++
Data:
ID PARENT_ID
1 [null]
2 1
3 1
4 2
Desired result:
ID CHILD_AT_ANY_LEVEL
1 2
1 3
1 4
2 4
I've tried SYS_CONNECT_BY_PATH, but I don't understand how to convert it result into "inline view" which I can use for JOIN with main table.
select connect_by_root(id) id, id child_at_any_level
from table
where level <> 1
connect by prior id = parent_id;
I'm trying to calculate a time difference between 2 rows and applied the solution from this SO question. However I get an exception:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling
> statement: FAILED: SemanticException Failed to breakup Windowing
> invocations into Groups. At least 1 group must only depend on input
> columns. Also check for circular dependencies. Underlying error:
> Expecting left window frame boundary for function
> LAG((tok_table_or_col time), 1, 0) Window
> Spec=[PartitioningSpec=[partitionColumns=[(tok_table_or_col
> client_id)]orderColumns=[(tok_table_or_col time) ASC
> NULLS_FIRST]]window(type=ROWS, start=1 PRECEDING, end=currentRow)] as
> LAG_window_0 to be unbounded. Found : 1
HiveQL:
SELECT id, loc, LAG(time, 1, 0) OVER (PARTITION BY id, loc ORDER BY time ROWS 1 PRECEDING) - time AS response_time FROM mytable
How to I fix this? What is the issue?
EDIT:
Sample data:
id loc time
0 1 1414250523591
0 1 1414250523655
1 2 1414250523655
1 2 1414250523661
1 3 1414250523661
1 3 1414250523662
And what I want is the difference of time between rows with same id and loc (always pairs of 2).
EDIT2: I should also mention I'm new to hadoop/hive ecosystem.
So as the error said, the window should be unbounded. So I just removed the ROWS clause and now at least it is doing something but it still is wrong. So I just wanted to check what the LAG value actually is:
SELECT id, loc, LAG(time, 1) OVER (PARTITION BY id, loc ORDER BY time) AS lag_col FROM mytable
And I get this as output:
id loc lag_col
1 2 null
1 2 -1
1 3 null
1 3 -1
The null is clear because I removed the default value but why -1? Are the large values in time column leading to somekind of overflow? Column is defined as bigint so it should actually fit without problem but maybe there is a conversion to int during the query?
I just want to count duplicated dae columnds in my table. My tables are like that:
VISIT:
ID_VISIT FK_PATIENT DATEA
0 1 20160425
1 2 20160425
2 3 20160426
I tried these :
SELECT VISIT.DATEA, COUNT(VISIT.DATEA) as numberOfDate FROM VISIT
SELECT VISIT.DATEA, COUNT(VISIT.DATEA) as numberOfDate FROM VISIT GROUP BY numberOfDate
but I got only like this :
DATEA NUMBEROFDATE
20160502 1
20160430 1
20160503 1
20160501 1
20160429 1
20160425 1
20160425 1
20160425 1
20160428 1
20160504 1
but I want to get like this
DATEA NUMBEROFDATE
20160502 1
20160430 1
20160503 1
20160501 1
20160429 1
20160425 3
20160428 1
20160504 1
Group by the column you want to be unique. Then aggregate functions like count() apply to each group
SELECT DATEA, COUNT(DATEA) as numberOfDate
FROM VISIT
GROUP BY DATEA
I would like to try and sort this data by descending number of events and from latest date, grouped by ID
I have tried proc sql;
proc sql;
create table new as
select *
from old
group by ID
order by events desc, date desc;
quit;
The result I currently get is
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
2 09/11/2015 2
3 01/01/2015 2
2 16/10/2014 2
3 08/12/2013 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Although the dates and events are sorted descending. Those IDs with multiple events are no longer grouped.
Would it be possible to achieve the below in fewer steps?
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
3 01/01/2015 2
3 08/12/2013 2
2 09/11/2015 2
2 16/10/2014 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Thanks
It looks to me like you're trying to sort by descending event, then by either the earliest or latest date (I can't tell which one from your explanation), also descending, and then by id. In your proc sql query, you could try calculating the min or max of the Date variable, grouped by event and id, and then sort the result by descending event, the descending min/max of the date, and id.