How to load data into couple of tables from a single files with different records structure in Hive? - hadoop

I have a single file with a structure like:
A 1 2 3
A 4 5 6
A 5 8 12
B abc cde
B and fae
B bsd oio
C 1
C 2
C 3
and would like to load the data in 3 simple tables (A (int int int), B(string string) C(int)).
Is it possible and how?
It's also fine for me, if A(string int int int) etc. with the first column of the file to be included in the table.

I'd go with option 1 as Praveen suggests. I'd create an external table of only a string, and use the FROM ( ... ) syntax to insert into multiple tables at once. I think something like the following would work
create external table source_table( line string )
stored as textfile
location '/myfile';
from ( select split( line , " ") as col_array from source_table ) cols
insert overwrite table A select col_array[1], col_array[2], col_array[3] where col_array[0] = 'A'
insert overwrite table B select col_array[1], col_array[2] where col_array[0] = 'B'
insert overwrite table C select col_array[1] where col_array[0] = 'C';

Option 1) Map the entire data to a Hive table and then use the insert overwrite table .... option to map the appropriate data to the target tables.
Option 2) Develop a MR program to split the file into multiple files and then do the mapping of the files to the target tables in Hive.

Related

ORA-30483: window functions are not allowed here in ODI Mapping

I am working on ODI mapping where I am calculating" Min(ID) over parition by(device_num, sys_id) as min_id" in expression component, I used another expression component to filter duplicates using row_number() over partition by (ID) order by(min_id) followed by a filter component "rownum=1" this results in window function error are not allowed here.
I understand that I need to run the analytical function on top the aggregate results. I am not sure how to achieve this in odi mapping (odi 12c). can anyone of you please guide me?
merge into (
select /*+ */ *
from target_base.tgt_table
where (1=1)
) TGT
using (
select /*+ */
RESULT2.ID_1 AS ID,
RESULT2.COL AS MIN_ID
from (
SELECT
RESULT1.ID AS ID ,
RESULT1.DEVICE__NUM AS DEVICE__NUM ,
RESULT1.SYS_ID AS SYS_ID ,
MIN(RESULT1.ID) OVER (PARTITION BY RESULT1.DEVICE__NUM ,RESULT1.SYS_ID) AS COL ,
ROW_NUMBER() OVER (PARTITION BY RESULT1.ID ORDER BY (MIN(RESULT1.ID) OVER (PARTITION BY RESULT1.DEVICE__NUM ,RESULT1.SYS_ID) AS COL) DESC ) AS COL_1
-- WINDOW FUNCTION ERROR,
FROM
(
select * from union_table
) RESULT1
)RESULT2
where (1=1)
and (RESULT2.COL_1 = 1)
) SRC
on (
and TGT.ID=SRC.ID )
when matched then update set
TGT.COMMON_ID = SRC.MIN_ID
, TGT.REC_UPDATE = SYSDATE
WHERE (
DECODE(TGT.COMMON_ID, SRC.COMMON_ID, 0, 1) > 0
)
UNION_TABLE has data as per below table
ID
device_num
sys_id
1
A
5
2
B
15
3
C
25
4
D
35
5
A
10
5
A
5
6
B
15
6
B
20
7
C
25
7
C
30
8
D
35
8
D
40
output expected: the ID where the rown_num=1 will be updated in target
ODI Mapping
This is very complex use case to model in ODI and the parser might not understand what you are trying to achieve.
My advice would be to write the difficult part of the query manually in SQL and use it as a source in ODI. Here is how to do it :
In the physical design of your mapping click on your source table. In the property pane, go to the Extract Options. You can then paste your SQL as a value for option CUSTOMER_TEMPLATE.
Of course it hides a bit the logic of the mapping so it shouldn't be used everywhere but for complex use cases as this one, this is an easy way to get the job done. I personally always add a memo on mapping with custom SQL so other developers can quickly see it.
Let try use IKM :Oracle Incremental Update on target table replace for IKM Oracle Merge.
Physical -> click target table -> Intergration Knowlege Module -> Oracle Incremental Update

SAS join (or insert) little table to big table

I have little problem. I have big table and few little table where little tables including part of fields from big table. How I can insert (or union) tables on the basis of if field is the same - set data, if little table not have field from big - set null/0 in big table.
Example:
data temp1;
infile DATALINES dsd missover;
input a b c d e f g;
CARDS;
1, 2, 3, 4,5,6
2, 3, , 5
3, 3
4,,3,2,3,
;
run;
data temp2;
infile DATALINES dsd missover;
input a c e g;
CARDS;
5, 2, 3, 4
6, 3, , 5
7, 3
;
run;
Is there an elegant method where if I insert temp2 to temp1 - missing fields in temp2 will set value of null in temp1?
Thank you for help!
That is exactly what SAS does by default.
data want ;
set have1 have2;
run;
It will match the variables by name and any variables that do not exist (in either source) will have missing values.
For better performance when appending a small table to a large table you should use PROC APPEND instead of a data step to avoid having to make new copy of the large dataset. That is more like an "insert". The FORCE option will allow the dataset to be different. But since the new data is being added to the old dataset any extra variables that appear only in HAVE2 will just be ignored and their values will be lost.
proc append base=have1 data=have2 force ;
run;
If you really did have to generate an actual INSERT statement (perhaps you are actually trying to generate SQL code to run in a foreign database) you might want to compare the metadata of the two datasets and find the common variables.
proc contents data=have1 out=cont1 noprint; run;
proc contents data=have2 out=cont2 noprint; run;
proc sql noprint;
select a.name into :varlist separated by ','
from cont2 a
inner join cont1 b
on upcase(a.name) = upcase(b.name)
;
...
insert into have1 (&varlist) select &varlist from have2 ;
It is not very clear to me what operation you intend to do but some initial thoughts are:
To compare columns between two datasets (and check whether a value exists in one of them) it is good practice to use an outer join. You can do joins via MERGE clause in a datastep, or more elegantly use PROC SQL.
However, using either approach you will have to specify which two rows in temp1 and temp2 shall be compared - you are typically joining on a column that is available in both tables.
To help us resolve your issue, could you possibly provide the correct output for your desired operation, if you perform it on temp1 and temp2? This would show what options you've explored and what needs to be fixed there.
you should try proc append.that will be more efficient because you will not reading your big table again and again unlike in
/*reads temp1 which is big table and temp2*/
data temp3;
set temp1 temp2;
run;
/* this does pretty much same as above code but will not read your big table
and will be efficient*/
proc append base=temp1 data=temp2 force;
run;
more on proc append in documentation http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n19kwc3onglzh2n1l2k4e39edv3x.htm

Group records conditionally in pig

I've a file which has 2 columns as Column1 and Column2 and holding records as below -
File in HDFS
Record 1 A is the main record and record 2 Column2 holds the information linked with A, Similarly the information with B C and D respectively. What I am looking for is to club these information and gets the following desired output.
Desired output look like
I can't do any modifications in the HDFS file, anything and everything in hadoop environment only.
How this can be achieved? Any help!!
After loading the data,
A = load '' as col1,col2;
B = FOREACH A GENERATE (col1 is null?substr(col2,1):col1),col2;

Collect to a Map in Hive

I have a Hive table such as
id | value
-------------
A 1
A 2
B 3
A 4
B 5
Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.
Query:
select COLLECT_TO_A_MAP(id, value)
from table
Output:
{A:[1,2,4], B:[3,5]}
I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?
EDIT:
Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table
num |id |value
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6
What I am looking for is for a UDAF that provides this output
num |new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
To this query
select num
,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num
There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select num
,collect(id_array, value_array) as new_map
from (
select collect_list(id) as id_array
,collect_list(value) as value_array
,num
from table
group by num
) A
group by num
However, I would rather not write a nested query.
EDIT #2
(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)
1 {A:2, B:3}
2 {A:4, B:6}
As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).
Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )
It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.
the CollectUDAF in Brickhouse (http://github.com/klout/brickhouse ) will get you there.
regarding your comment EDIT #2:
first, collect the values to a list, then collect the k,v pairs to a map:
select
num,
collectUDAF(id, values) as new_map
from
(
SELECT
num,
id,
collect_set(value) as values
FROM
tbl
GROUP BY
num,
id
) as sub
GROUP BY
num
will return
num | new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
If you don't care about the order in which the values appear, you could use the collect_set() UDAF that comes with Hive.
SELECT id, collect_set(value) FROM table GROUP BY id;
This should solve your issue.
Your current query groups by num in both the inner and outer query -- you need to group by id in the inner query to accomplish what you're trying to do.
https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java#L55
see brickhouse udaf,when args num larger than 1, MapCollectUDAFEvaluator would be used.
add jar */brickhouse.jar ;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select
collect(a,b)
from( select 1232123 a,21 b
union all select 123 a,23 b)a;
result:{1232123:21,123:23}

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.
Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1
without pre-process it with awk/sed or something like that?
Thanks!
Step 1: Create a Hive table for input data set .
create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
(i assumed field seprator is \t, you can replace it with actual separator)
Step 2 : Run below to get the merge data you are looking for
create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;
I tried this for below input set
hive (default)> select * from table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;
hive (default)> select * from table4;
OK
A 1
B 1
C 1
You can use external table as well , but for simplicity I have used managed table here.
One idea.. you could create a table around the first file (called 'oldtable').
Then run something like this....
create table newtable select field1, max(field) from oldtable group by field1;
Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?
For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.
insert overwrite into table test1
(select x.* from t1 x )
UNION ALL
(select y.* from t2 y);
here we are merging two tables data (t1 and t2) into one single table test1.
There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.
hive> SELECT * FROM table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
B 2 # Added to show it will group correctly with different values
hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A 1
B 1
B 2
C 1

Resources