Is it possible to disaggregate in hive sql? [duplicate] - hadoop

I'm trying to find a way to split a row in Hive into multiple rows based on a delimited column. For instance taking a result set:
ID1 Subs
1 1, 2
2 2, 3
And returning:
ID1 Subs
1 1
1 2
2 2
2 3
I've found some road signs at http://osdir.com/ml/hive-user-hadoop-apache/2009-09/msg00092.html, however I wasn't able enough detail to point me in the direction of a solution, and I don't know how I would set up the transform function to return an object that would split the rows.

Try this wording
SELECT ID1, Sub
FROM tableName lateral view explode(split(Subs,',')) Subs AS Sub

SELECT ID1, new_Subs_clmn
FROM tableName lateral view explode(split(Subs,',')) Subs AS new_Sub_clmn;
I was initially confused with the names used, sharing the above query thinking it would be of help.

Related

Count Length and then Count those records.

I am trying to create a view that displays size (char) of LastName and the total number of records whose last name has that size. So far I have:
SELECT LENGTH(LastName) AS Name_Size
FROM Table
ORDER BY Name_Size;
I need to add something like
COUNT(LENGTH(LastName)) AS Students
This is giving me an error. Do I need to add a GROUP BY command? I need the view:
Name_Size Students
3 11
4 24
5 42
SELECT LENGTH(LastName) as Name_Size, COUNT(*) as Students
FROM Table
GROUP BY Name_Size
ORDER BY Name_Size;
You may have to change the group by and order by to LENGTH(LastName) as not all SQL engines let you reference an alias from the select statement in a clause on that same statement.
HTH,
Eric

Comparing Similar Hive Tables

I have two hive tables (t1 and t2) that I would like to compare. The second table has 5 additional columns that are not in the first table. Other than the five disjoint fields, the two tables should be identical. I am trying to write a query to check this. Here is what I have so far:
SELECT * FROM t1
UNION ALL
select * from t2
GROUP BY some_value
HAVING count(*) == 2
If the tables are identical, this should return 0 records. However, since the second table contains 5 extra fields, I need to change the second select statement to reflect this. There are almost 60 column names so I would really hate to write it like this:
SELECT * FROM t1
UNION ALL
select field1, field2, field3,...,fieldn from t2
GROUP BY some_value
HAVING count(*) == 2
I have looked around and I know there is no select * EXCEPT syntax, but is there a way to do this query without having to explicity name each column that I want included in the final result?
You should have used UNION DISTINCT for the logic you are applying.
However, the number and names of columns returned by each select_statement have to be the same otherwise a schema error is thrown.
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
To skip the 5 extra fields, you could use the "--ignore-columns" option.

Collect to a Map in Hive

I have a Hive table such as
id | value
-------------
A 1
A 2
B 3
A 4
B 5
Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.
Query:
select COLLECT_TO_A_MAP(id, value)
from table
Output:
{A:[1,2,4], B:[3,5]}
I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?
EDIT:
Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table
num |id |value
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6
What I am looking for is for a UDAF that provides this output
num |new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
To this query
select num
,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num
There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select num
,collect(id_array, value_array) as new_map
from (
select collect_list(id) as id_array
,collect_list(value) as value_array
,num
from table
group by num
) A
group by num
However, I would rather not write a nested query.
EDIT #2
(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)
1 {A:2, B:3}
2 {A:4, B:6}
As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).
Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )
It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.
the CollectUDAF in Brickhouse (http://github.com/klout/brickhouse ) will get you there.
regarding your comment EDIT #2:
first, collect the values to a list, then collect the k,v pairs to a map:
select
num,
collectUDAF(id, values) as new_map
from
(
SELECT
num,
id,
collect_set(value) as values
FROM
tbl
GROUP BY
num,
id
) as sub
GROUP BY
num
will return
num | new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
If you don't care about the order in which the values appear, you could use the collect_set() UDAF that comes with Hive.
SELECT id, collect_set(value) FROM table GROUP BY id;
This should solve your issue.
Your current query groups by num in both the inner and outer query -- you need to group by id in the inner query to accomplish what you're trying to do.
https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java#L55
see brickhouse udaf,when args num larger than 1, MapCollectUDAFEvaluator would be used.
add jar */brickhouse.jar ;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select
collect(a,b)
from( select 1232123 a,21 b
union all select 123 a,23 b)a;
result:{1232123:21,123:23}

Convert Character to Number in Oracle

I have ColumnA in table. The data of each row is single character between A & H.
I want my select query to return 1 for 'A', 2 for B .... 8 for H.
My query always returns only one row. I can make a lookup table.
Anyone has better ideas to achieve the same ?
SELECT 1 + ASCII(columnA) - ASCII('A')
FROM table

Oracle aggregate function to return a random value for a group?

The standard SQL aggregate function max() will return the highest value in a group; min() will return the lowest.
Is there an aggregate function in Oracle to return a random value from a group? Or some technique to achieve this?
E.g., given the table foo:
group_id value
1 1
1 5
1 9
2 2
2 4
2 8
The SQL query
select group_id, max(value), min(value), some_aggregate_random_func(value)
from foo
group by group_id;
might produce:
group_id max(value), min(value), some_aggregate_random_func(value)
1 9 1 1
2 8 2 4
with, obviously, the last column being any random value in that group.
You can try something like the following
select deptno,max(sal),min(sal),max(rand_sal)
from(
select deptno,sal,first_value(sal)
over(partition by deptno order by dbms_random.value) rand_sal
from emp)
group by deptno
/
The idea is to sort the values within group in random order and pick the first.I can think of other ways but none so efficient.
You might prepend a random string to the column you want to extract the random element from, and then select the min() element of the column and take out the prepended string.
select group_id, max(value), min(value), substr(min(random_value),11)
from (select dbms_random.string('A', 10)||value random_value,foo.* from foo)
In this way you cand avoid using the aggregate function and specifying twice the group by, which might be useful in a scenario where your query is very complicated / or you are just exploring the data and are entering manually queries with a lengthy and changing list of group by columns.

Resources