Collect to a Map in Hive - hadoop

I have a Hive table such as
id | value
-------------
A 1
A 2
B 3
A 4
B 5
Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.
Query:
select COLLECT_TO_A_MAP(id, value)
from table
Output:
{A:[1,2,4], B:[3,5]}
I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?
EDIT:
Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table
num |id |value
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6
What I am looking for is for a UDAF that provides this output
num |new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
To this query
select num
,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num
There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select num
,collect(id_array, value_array) as new_map
from (
select collect_list(id) as id_array
,collect_list(value) as value_array
,num
from table
group by num
) A
group by num
However, I would rather not write a nested query.
EDIT #2
(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)
1 {A:2, B:3}
2 {A:4, B:6}
As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).

Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )
It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.

the CollectUDAF in Brickhouse (http://github.com/klout/brickhouse ) will get you there.
regarding your comment EDIT #2:
first, collect the values to a list, then collect the k,v pairs to a map:
select
num,
collectUDAF(id, values) as new_map
from
(
SELECT
num,
id,
collect_set(value) as values
FROM
tbl
GROUP BY
num,
id
) as sub
GROUP BY
num
will return
num | new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}

If you don't care about the order in which the values appear, you could use the collect_set() UDAF that comes with Hive.
SELECT id, collect_set(value) FROM table GROUP BY id;
This should solve your issue.

Your current query groups by num in both the inner and outer query -- you need to group by id in the inner query to accomplish what you're trying to do.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java#L55
see brickhouse udaf,when args num larger than 1, MapCollectUDAFEvaluator would be used.
add jar */brickhouse.jar ;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select
collect(a,b)
from( select 1232123 a,21 b
union all select 123 a,23 b)a;
result:{1232123:21,123:23}

Related

Input list to Snowflake SQL udf

I have created a Snowflake SQL udf I call with the following code:
select *
from table(drill_top_down('12345','XXX)) order by depth,path;
If I need to run the query for multiple items is it then possible to input a list or similar to the udf, and then loop through my input list?
Or can I somehow call my function in smarter way so I can get the result from multiple inputs?
You can provide a Snowflake Array, Object or Variant with your argument sets nested within, and use that as input to the table function.
Adapting your example, using array construct to provide two sets of arguments the input would look something like :
select *
from table(drill_top_down(
array_construct(
array_construct('12345','XXX'),
array_construct('67890','YYY')
)::array;
Or my preference is to use parse_json, as I find it easier to read
select *
from table(drill_top_down(parse_json('
[ ["12345","XXX"],
["67890","YYY"] ]')::array;
You will need to adapt your Table Function to unpack the Argument Sets using a common-table-expression (CTE) to tabularise the input arguments and then unnest them with Lateral Flatten.
Here's a trivial example:
CREATE OR REPLACE FUNCTION array_concat ( arr array)
RETURNS TABLE ( concatenated_string varchar )
AS
$$
With a as (Select arr)
Select listagg(value)
From a, table(flatten(input => arr))
$$
;
Here is a slightly more sophisticated example that performs an operation with each argument set, using row_number() to group them.
CREATE OR REPLACE FUNCTION array_calcs ( arg_list array)
RETURNS TABLE
( arg_id integer,
array_sz integer,
array_sum integer,
array_mean decimal(12,2) )
AS
$$
With
-- CTE containing the ARGS
arg_input as (select arg_list),
-- CTE un-nest (flatten) first level of args list to each args set
arg_sets as
(Select row_number() over (order by NULL desc) as arg_id, value as arg_set
From arg_input, lateral flatten(input => arg_list))
-- Do something with the Args. e.g. Perform some calculations with the Input arguments
Select arg_id , count(*) array_sz, sum(value)::integer array_sum, array_sum/array_sz::decimal(12,2) array_mean
From arg_sets, table(flatten(input => arg_set))
Where is_decimal( value ) or is_integer( value ) or is_double( value ) -- filter out non-numeric arguments i.e. validate inputs
Group By arg_id
$$;
This works if we provide the following input arguments
Select * from table(array_calcs(parse_json('[ [1],
[1,2],
[1,2,3],
[1,2,3,4],
["A","B"],
["A",1]
]')::array));
Producing the following:
ARG_ID
ARRAY_SZ
ARRAY_SUM
ARRAY_MEAN
1
1
1
1.0
2
2
3
1.5
3
3
6
2.0
4
4
10
2.5
6
1
1
1.0
But word of caution. If your aim was to build your arguments directly from your data, rather than hard-code them in the function call, you are more than likely to run into this issue:
Create or replace View V_array_calcs_input as
Select parse_json($1)::array arg_list
from (values ('[[1],[1,2],[1,2,3],[1,2,3,4],["A","B"], ["A",1]'));
Select *
from V_array_calcs_input,
table(array_calcs(arg_list));
SQL compilation error: Unsupported subquery type cannot be evaluated
A Stored Procedure, or JavaScript UDF/UDTF may be better options to resolve this, if you can build the functional logic you need in either of those.

Oracle: How to check whether the given array exists in the table?

I want to check whether the array of values (4690, 4693) both is exists in the contextid column without using functions as the table contains more that a million records
Table structure:
ID
CONTEXTID
4
4690
5
4690
6
4693
7
4693
8
4690
What about this query?
select
case when count(distinct CONTEXTID) = 2 then 'Y' else 'N' end as contains_4690_4693
from tab
where CONTEXTID in (4690, 4693)
it retuns Y if both keys are in the table at least once, N otherwise.
If you just want to find out if they exist then
SELECT DISTINCT(CONTEXTID)
FROM SOME_TABLE
wHERE CONTEXTID IN (4690, 4693)
will do it. If CONTEXTID isn't indexed, though, the database will have to do a full table scan which will probably be slow.
Takeaway: add an index on CONTEXTID or live with the fact that it's going to be slow.
If the "test" values are known when you write the query (as they very rarely are - even though all the solutions presented so far make the implicit assumption that they are), you could do something like this - which is probably the most efficient way, regardless of whether there is an index on the relevant column or not:
select case when exists
( select *
from sys.odcinumberlist(4690, 4693)
where column_value not in ( select contextid
from the_table
where context_id is not null
)
) then 'Not all found' else 'All found' end as result
from dual
;
Note how I gave an array of input values to the query - I used the sys.odcinumberlist constructor. You will have to clarify how you plan to "input" an array of "test" values.

Is it possible to disaggregate in hive sql? [duplicate]

I'm trying to find a way to split a row in Hive into multiple rows based on a delimited column. For instance taking a result set:
ID1 Subs
1 1, 2
2 2, 3
And returning:
ID1 Subs
1 1
1 2
2 2
2 3
I've found some road signs at http://osdir.com/ml/hive-user-hadoop-apache/2009-09/msg00092.html, however I wasn't able enough detail to point me in the direction of a solution, and I don't know how I would set up the transform function to return an object that would split the rows.
Try this wording
SELECT ID1, Sub
FROM tableName lateral view explode(split(Subs,',')) Subs AS Sub
SELECT ID1, new_Subs_clmn
FROM tableName lateral view explode(split(Subs,',')) Subs AS new_Sub_clmn;
I was initially confused with the names used, sharing the above query thinking it would be of help.

Duplicate removal from a table using Informatica

I have a scenario to be implemented in informatica where I need to remove duplicate records from a table based on PK. But I need to keep the 1st occurrence of the PK values and remove the others(in case of duplicate PK).
For example, If my source has 1,1,1,2,3,3,4,5,4. I want to see my target data as 1,2,3,4,5. I have to read data from the same table and need to load into the same table., no new table can be introduced. please help me with your inputs.
Thanks in Advance!
I suppose you want the first occurrence because there are other (data) columns in addition to the key you entered. Therefore you want
1,b
1,c
1,a
2,d
3,c
3,d
4,e
5,f
4,b
Turned into
1,b
2,d
3,c
4,e
5,f
??
In that case try this mapping layout:
SRC -> SQ -> SRT -> AGG -> TGT
SEQ /
Where the sorter is set to sort on the KEY,sequence_port (desc)
And the aggregator is set to group by the KEY, and the sequence_port should not go out of the sorter
Hope you can follow me :)
There are multiple ways to do this, the simplest would be too do it in the SQL override.
Assuming the example quoted above, the SQL would be like this. General idea is to set a row number for a primary key ( so if you have 3 rows with same pk they will have 1,2,3 as row numbers before being reset for the next pk)
SQL:
select * from (
Select primary_key,column2 row_number() over (partition by primary_key order by primary_key) as distinct_key) where distinct_key=1
Before:
1,b
1,c
1,a
2,d
3,c
3,d
Intermediate query:
1,c,1
1,a,2
2,d,1
3,c,1
3,d,2
output:
1,c
2,d
3,d
I am able to achieve this by following the below steps.
1. Passing Sorted data(keys are EMP_ID, MOBILE, DEPTID) to an expression.
2. Creating the following variable ports in the expression and getting the counts.
V_CURR_EMP_ID = EMP_ID
V_CURR_MOBILE = MOBILE
V_CURR_DEPTID = DEPTID
V_COUNT =
IIF(V_CURR_EMP_ID=V_PREV_EMP_ID AND V_CURR_MOBILE=V_PREV_MOBILE AND V_CURR_DEPTID=V_PREV_DEPTID ,V_COUNT+1,1)
V_PREV_EMP_ID = EMP_ID
V_PREV_MOBILE = MOBILE
V_PREV_DEPTID = DEPTID
O_COUNT =V_COUNT
3. In the next transformation which is filter, I am taking only the records which have count more than 1 and deleting them using update strategy(DD_DELETE).
Here is the mapping flow.
SQ->SRTR->EXP->FIL->UPD->TGT
Also, when I tried to delete them using aggregator , it is deleting only the first occurrence of duplicates but not all.
Thanks again for your inputs!

Oracle aggregate function to return a random value for a group?

The standard SQL aggregate function max() will return the highest value in a group; min() will return the lowest.
Is there an aggregate function in Oracle to return a random value from a group? Or some technique to achieve this?
E.g., given the table foo:
group_id value
1 1
1 5
1 9
2 2
2 4
2 8
The SQL query
select group_id, max(value), min(value), some_aggregate_random_func(value)
from foo
group by group_id;
might produce:
group_id max(value), min(value), some_aggregate_random_func(value)
1 9 1 1
2 8 2 4
with, obviously, the last column being any random value in that group.
You can try something like the following
select deptno,max(sal),min(sal),max(rand_sal)
from(
select deptno,sal,first_value(sal)
over(partition by deptno order by dbms_random.value) rand_sal
from emp)
group by deptno
/
The idea is to sort the values within group in random order and pick the first.I can think of other ways but none so efficient.
You might prepend a random string to the column you want to extract the random element from, and then select the min() element of the column and take out the prepended string.
select group_id, max(value), min(value), substr(min(random_value),11)
from (select dbms_random.string('A', 10)||value random_value,foo.* from foo)
In this way you cand avoid using the aggregate function and specifying twice the group by, which might be useful in a scenario where your query is very complicated / or you are just exploring the data and are entering manually queries with a lengthy and changing list of group by columns.

Resources