Populate the columns of a Hive SQL query with results of another query - hadoop

Apache Hive (version 1.2.1000.2.6.5.0-292)
I have a table, A, that has a large number of columns. I'm trying to select only the columns that I need from A and the columns I want live in a key-value pair table, B. (Example below). I can query B to get the columns I need, but I'm struggling to put the output of this sql query as the columns of the query used in A. Is there a way to do this in one sql query? I can write a python program to do this to create the SQL, but I'd rather have it in just one query for simplicity to the end-user.
DDL for tables
create table A (
a1 string,
a2 string,
a3 string,
b1 string,
b2 string,
b3 string,
)
create table B (
key string,
value string,
)
Data in table B (key value table). It should be noted that the data in the column value cannot be inferred upon by the corresponding value in column key. I have written them as a1,a2 for simplicity.
key,value
a,a1
a,a2
a,a3
b,b1
b,b2
b,b3
Query to get the correct columns = select value from B where key='a'
When you merge the results from this query with the Table A query you should get this sql statement
select a1,a2,a3 from A
As you can see, we are trying to derive the columns used in Table A
My first attempt doesn't work:
select
(select value from B where key='a')
from A
What's the right way to do this?
Thanks in advance!

You can try to generate query and write to file. Once done, you can call in existing hive hql using source command:
Here is sample queries by taking your example:
Create table and dummy data:
CREATE EXTERNAL TABLE IF NOT EXISTS a_table(
a1 string,
a2 string,
a3 string,
b1 string,
b2 string,
b3 string)
LOCATION '/user/xyz/a_table';
insert into table a_table
VALUES ('a11', 'a12', 'a13','b11','b12','b13'), ('a21', 'a22', 'a23','b21','b22','b23');
CREATE EXTERNAL TABLE IF NOT EXISTS b_table (
key string,
value string
)
LOCATION '/user/xyz/b_table';
insert into table b_table
VALUES ('a', 'a1'), ('a','a2'),('a','a3'), ('b', 'b1'), ('b','b2'),('b','b3');
Validate data into table:
select * from a_table;
OK
a11 a12 a13 b11 b12 b13
a21 a22 a23 b21 b22 b23
Time taken: 0.124 seconds, Fetched: 2 row(s)
select * from b_table;
OK
a a1
a a2
a a3
b b1
b b2
b b3
Time taken: 0.15 seconds, Fetched: 6 row(s)
This is hive hql part to generate statement based on given key and then using source to run query:
insert overwrite local directory '/home/xyz/temp_hql/out'
select concat_ws(" ", "select",concat_ws("," , collect_list(value)), "from a_table")
from b_table where key = 'a';
source /home/xyz/temp_hql/out/000000_0;
OK
a11 a12 a13
a21 a22 a23
insert overwrite local directory '/home/xyz/temp_hql/out'
select concat_ws(" ", "select",concat_ws("," , collect_list(value)), "from a_table")
from b_table where key = 'b';
source /home/xyz/temp_hql/out/000000_0;
OK
b11 b12 b13
b21 b22 b23

Related

comparing values from two oracle table columns and getting starting value

I Have oracle table with data as shown below.
column1 column2
A1 B1
B1 C1
C1 D1
I need to get A1 value from D1. I have to implemet this in View. Need to traverse using as D1 as input and get C1 and get B1 from C1 and finally A1 using B1
Please help.
Not sure what kind of view you are looking to create; if you are thinking of passing 'A1' as an input to the view, there is no such thing in Oracle as far as I know, you would need to use a cursor for that.
The following SELECT statement can be used as an inline view (a subquery) and you can make the 'D1' value at the end of it into a bind variable if needed.
select column1
from test_data
where connect_by_isleaf = 1
connect by column2 = prior column1
start with column2 = 'D1'
;

Update Column of a Hive Table without using Sub query

This is a question regarding updating a new column in a Hive table. Since I think Hive does not allow to update a column of an existing table using subqueries, I wanted to ask what will be the best way to achieve the following update operation.
I have the following two example tables:
Table A:
KeyId ValId Val
W1 V1 10
W2 V2 20
Table B:
KeyId ValId Val
W1 V1 10
W1 V1 30
W1 V3 40
W1 V4 50
W2 V2 0
W2 V2 50
W2 V2 70
W2 V4 80
I want to create another column in Table A, lets say avgVal that takes the KeyId and ValId in each row in Table A and takes the average of Val for those corresponding KeyId and ValId in Table B. Thus, my final output table should look like:
Updated Table A:
KeyId ValId Val avgVal
W1 V1 10 20
W2 V2 20 40
Please let me know if the question is not clear.
It seems you are trying to get aggregate values in table A from table B. In that case you cannot have "val" column in table A because after aggregation which val from table B do you expect in table A?
Assuming that was genuine mistake, and you remove "val" column from table a, your insert statement for table a should look like this:
insert into table table_a select keyid,valid,avg(val) from table_b group by keyid,valid
You can use below query to get avg of data in Table_B corresponding to row in table_A :-
select t1.keyid, t1.valid , t1.val , avgval from table_A t1 left join
(select keyid as k , valid as v, avg(val) as avgval from Table_B group by keyid,valid )temp
on k=t1.keyid and t1.valid=v;
You have to check the table_A is updatable to change the schema else you can make other table to load the data.

join with exact match value otherwise join with default value

I have a table a
c1 c2 c3 c4 value
all all all all 5
all david all Y 6
all all cd all 7
and table b
c1 c2 c3 c4
a peter cd N
b david all Y
c all cd N
I want to have get the value from table a into table b, the desired results is like this:
c1 c2 c3 c4 Value
a david cd N 5
b david ab Y 6
c all cd N 7
That is use the default "all" value if there is no close match find.
Thanks a lot!
I see a couple of possibilities assuming you have some errors in your example result data. (C2 for value 5 and C3 for value 6 are suspect to me)
Use a CTE to union the results replace all with null and use aggregation to get max value
Use a Join on value and evaluate table a's value for each c column if it's not all use it, otherwise use B's value. (IF a and b are both all it doesn't matter which we use.) this may be problematic if both values are potentially different like if Value 6 had a Y in A, and a N in B. but no such example exits in your data so I'm trusting it doesn't happen. (or if it did, picking A's value is more appropriate if it's not all)
AS a cte: (Common Table expression)
WITH cte as (
SELECT replace(c1,'all',null)
, replace(c2,'all',null)
, replace(c3,'all',null)
, replace(c4,'all',null)
, value
FROM A
UNION ALL
SELECT replace(c1,'all',null)
, replace(c2,'all',null)
, replace(c3,'all',null)
, replace(c4,'all',null)
, value
FROM b)
/* We have to eval the max as if it's null we need to replace it with all
Might be able to avoid the replacing all provided all values of c1-c4 are
greater than all... replacing just seemed safer. at a hit to performance.*/
SELECT coalesce(max(c1),'all') as c1
,coalesce(max(c2),'all') as c2
,coalesce(max(c3),'all') as c3
,coalesce(max(c4),'all') as c4
,value
FROM cte
GROUP BY value
Using a join (simpler from a maintenance and perhaps performance standpoint)
SELECT case when A.C1 <> 'all' then A.C1 else B.c1 end as C1,
case when A.C2 <> 'all' then A.C2 else B.c2 end as C2,
case when A.C3 <> 'all' then A.C3 else B.c3 end as C3,
case when A.C4 <> 'all' then A.C4 else B.C4 end as C4,
A.value --A.val = b.val so it doesn't matter which se use.
FROM A
INNER JOIN B
on A.value = B.Value
Depending on existing indexes and data volume the first approach might be better than the second.

is possible : cast char to varchar while join table in oracle

Can I cast B1 char(2) which joins on A1 varchar2(2) :
SELECT * FROM A
LEFT JOIN B
ON CAST(B.B1 AS VARCHAR2(2)) = A.A1
It results to no errors, but there no data displayed.
Is the above query possible?
You can cast it, but it isn't doing that you think, or seem to be relying on. Assuming you have a one-character value in the field you're joining on, you don't get a match, with or without the cast:
create table a (a1 varchar2(2));
create table b (b1 char(2));
insert into a values ('X');
insert into b values ('X');
select * from a left join b on b.b1 = a.a1;
A1 B1
-- --
X
select * from a left join b on cast(b.b1 as varchar2(2)) = a.a1;
A1 B1
-- --
X
The cast is chaging the data type, but not the data; it is still blank-padded. The only difference is that it's done explicitly in the value, rather implicitly as you'd see with a char. You can verify that the value is the same with the dump() function:
select dump(b.b1) dump_char,
dump(cast(b.b1 as varchar2(2))) dump_varchar2
from b;
DUMP_CHAR DUMP_VARCHAR2
-------------------- --------------------
Typ=96 Len=2: 88,32 Typ=1 Len=2: 88,32
So the type has changed, from 96 (char) to 1 (varchar2), but the value is the same. Compare that with your value in table A and you'll see they are not the same:
select dump(a.a1) dump_varchar2 from a;
DUMP_VARCHAR2
--------------------
Typ=1 Len=1: 88
Your cast B value still has the trailing space, the A value does not, therefore they don't match. You can remove that trailing space for comparison with trim() or rtrim():
select * from a left join b on rtrim(b.b1) = a.a1;
A1 B1
-- --
X X
There is an implicit conversion from char to varchar2 within the rtrim() call, so you could still cast that explicitly for clarity.
Note that this assumes you never have a trailing space in A. It may be safer to cast the other way:
select * from a left join b on b.b1 = cast(a.a1 as char(2));
A1 B1
-- --
X X
... but which side you cast/trim will also affect which indexes can be used.
this worked for me:
SELECT * FROM A
LEFT JOIN B
ON TRIM(B.B1) = A.A1
i prefer this syntax to the cast for purely aesthetic reasons, although presumably others exist

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.
Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1
without pre-process it with awk/sed or something like that?
Thanks!
Step 1: Create a Hive table for input data set .
create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
(i assumed field seprator is \t, you can replace it with actual separator)
Step 2 : Run below to get the merge data you are looking for
create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;
I tried this for below input set
hive (default)> select * from table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;
hive (default)> select * from table4;
OK
A 1
B 1
C 1
You can use external table as well , but for simplicity I have used managed table here.
One idea.. you could create a table around the first file (called 'oldtable').
Then run something like this....
create table newtable select field1, max(field) from oldtable group by field1;
Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?
For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.
insert overwrite into table test1
(select x.* from t1 x )
UNION ALL
(select y.* from t2 y);
here we are merging two tables data (t1 and t2) into one single table test1.
There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.
hive> SELECT * FROM table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
B 2 # Added to show it will group correctly with different values
hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A 1
B 1
B 2
C 1

Resources