Hierarchical Update in Hive - hadoop

I got a hive table as follows:
Table A
docid corr_docid header
100 a
101 100 b
102 c
105 101 d
106 102 e
107 106 f
108 107 g
109 h
Is it possible to create another table.
Here corr_docid 107 corrects the document with docid 107.
Table B as follows:
Table A
docid corr_docid header newdocid
100 a 105
101 100 b 105
102 c 108
105 101 d 105
106 102 e 108
107 106 f 108
108 107 g 108
109 h 109
Is this possible in hive.

You can try this native SQL to get desired result, This will work only if you know the hierarchy depth/level, is 4 here.
`select a.docid,
a.corr_docid,
case when b.docid is null then a.docid
when c.docid is null then b.docid
when d.docid is null then c.docid
else d.docid
end newdocid
from Table_A a left join Table_A b on a.docid = b.corr_docid
left join Table_A c on b.docid = c.corr_docid
left join Table_A d on c.docid = d.corr_docid ;`

Related

Google sheet query 2 columns as search key and search

I met some problem with google sheet function.
I have 2 tables. I want to search table1 Date+User as key value in table2.
example:
Date User Unit
2022/05/30 A 109
2022/05/30 B 119
2022/05/30 C 119
2022/05/29 D 109
2022/05/29 E 114
Date User Amount
2022/05/30 A 1
2022/05/30 B 2
2022/05/30 C 3
2022/05/30 D 41
2022/05/30 E 5
2022/05/29 D 6
2022/05/29 E 7
2022/05/29 F 81
2022/05/29 G 9
2022/05/29 A 101
2022/05/29 B 11
2022/05/29 C 121
2022/05/29 D 13
after query I hope the table looks like
Hope Result
Date User Unit Amount
2022/05/30 A 109 1
2022/05/30 B 119 2
2022/05/30 C 119 3
2022/05/29 D 109 6
2022/05/29 E 114 7
This is a sample google sheet
https://docs.google.com/spreadsheets/d/1oxhWMVPt-GziG10agob-xbiNYfKrZVFK9ro0Pj7tn6Y/edit#gid=0
Can I ask for help ?
Many Thanks
Two options. The first pulls all matching combinations of DATE and USER
=ARRAYFORMULA(
QUERY(
{E2:G,
IF(ISBLANK(E2:E),,
IFERROR(
VLOOKUP(
E2:E&"|"&F2:F,
{A2:A&"|"&B2:B,C2:C},
2,FALSE)))},
"select Col1, Col2, Col4, Col3
where Col4 is not null
label
Col1 'Date',
Col2 'User',
Col3 'Amount',
Col4 'Unit'"))
which returns
Date
User
Unit
Amount
2022/05/30
A
109
1
2022/05/30
B
119
2
2022/05/30
C
119
3
2022/05/29
D
109
6
2022/05/29
E
114
7
2022/05/29
D
109
13
The second matches your output exactly, but does omit that second D value for the 29th (13)
=ARRAYFORMULA(
QUERY(
{IFERROR(
VLOOKUP(
UNIQUE(E2:E&"|"&F2:F),
{E2:E&"|"&F2:F,E2:G},
{2,3,4},FALSE)),
IFERROR(
VLOOKUP(
UNIQUE(E2:E&"|"&F2:F),
{A2:A&"|"&B2:B,C2:C},
2,FALSE))},
"where Col4 is not null
format Col1 'yyyy/mm/dd'"))
Both have been added to your sheet. If either of these work out for you, I can break it down.

Sum data in one column in a specific order in Spotfire

Does anyone know how to create a calculated column (in Spotfire) that will sum data in order of increasing values contained within another column?
For example, what would the expression be to Sum data in [P] in increasing order of [K], for each [Well]
Some example data:
Well Depth P K
A 85 0.191 108
A 85.5 0.192 102
A 87 0.17 49
A 88 0.184 47
A 89 0.192 50
B 298 0.215 177
B 298.5 0.2 177
B 300 .017 105
B 301 0.23 200
You can use:
Sum([P]) OVER (intersect([Well],AllPrevious([K])))
This returns the cumulative sum of P in order of K per Well in ascending order of K.
Well K P Cumulative Sum of P
A 47 0,184 0,184
A 49 0,17 0,354
A 50 0,192 0,546
A 102 0,192 0,738
A 108 0,191 0,929
B 105 0,017 0,017
B 177 0,215 0,432
B 177 0,2 0,432
B 200 0,23 0,662
Edit Based on OP's comment:
you can use to get the cumulative sum in descending order of K:
Sum([P]) OVER (intersect([Well],AllNExt([K])))

Computing lag in Hive by a variable

My input table looks like:
guest_id days
101 79
101 70
101 68
101 61
102 101
102 90
102 55
103 99
103 90
Note that, days are in descending order,by guest_id
Desired output table:
guest_id days days_diff
101 79 0
101 70 9
101 68 2
101 61 7
102 101 0
102 90 11
102 55 35
103 99 0
103 90 9
days_diff is the first order difference by guest_id (not throughout days column)
You need to have a unique id column as well (otherwise Hive doesn't know about the order of your rows).
Then you can just self join on id=id+1 to get your differences:
select a.guest_id,
a.days,
case when a.guest_id = b.guest_id then b.days-a.days else 0 end days_diff
from
input a
join input b on a.id=b.id-1
Edit: As pointed out by Kunal in the comments, Hive does have a Lag window function which requires a PARTITION BY ... ORDER BY clause; you still need something to order your table by, for example if you have a date column you would used this like the following:
SELECT guest_id,
days,
LAG(days, 1, 0) OVER (PARTITION BY guest_id ORDER BY date)
FROM input;

Merging two files and ordering them

I want to merge two files in one and order them based on the values of the second column. The example is the following:
File 1:
+ 1.01 id 120
- 1.20 id 145
+ 2.15 id 411
(continues)
File 2:
r 0.21 id 4
r 1.78 id 85
r 102 id 850
(continues)
I want to merge them in one file but I would like to put them in ascending order based on the column 2 like this:
File 3:
r 0.21 id 4
+ 1.01 id 120
- 1.20 id 145
r 1.78 id 85
+ 2.15 id 411
r 102 id 850
How could I do this?
how about
sort -k2n file1 file2
f1 and f2 are your files:
kent$ sort -k2n f1 f2
r 0.21 id 4
+ 1.01 id 120
- 1.20 id 145
r 1.78 id 85
+ 2.15 id 411
r 102 id 850

Getting the first occurrence of group of rows

Am having a case where I have transaction and this transaction consists of several steps. I want to return the first occurrence of this transaction, for example:
Trn Loc step
111 0 1
111 0 2
111 0 3
222 3 1
222 3 2
333 5 1
333 5 2
333 5 3
and i want to get this result :
tran loc
111 0
222 3
333 5
I think it is supposed to be done by partition function but I don't know how...any help please?
select t.trn, t.loc
from (select trn, loc, ROW_NUMBER() OVER (PARTITION BY trn, loc ORDER BY trn, loc) as rnum
from table ) t
where t.rnum = 1
Or you can use RANK() function instead of ROW_NUMBER(), rest of syntax is same.
http://www.techonthenet.com/oracle/functions/rank.php

Resources