Group records conditionally in pig - hadoop

I've a file which has 2 columns as Column1 and Column2 and holding records as below -
File in HDFS
Record 1 A is the main record and record 2 Column2 holds the information linked with A, Similarly the information with B C and D respectively. What I am looking for is to club these information and gets the following desired output.
Desired output look like
I can't do any modifications in the HDFS file, anything and everything in hadoop environment only.
How this can be achieved? Any help!!

After loading the data,
A = load '' as col1,col2;
B = FOREACH A GENERATE (col1 is null?substr(col2,1):col1),col2;

Related

Parsing and replacing columns in a file

INPUT:
I have an input file in which the first 10 characters of each line represent 2 fields - first 4 characters(field A) and the next 6 characters(field B). The file contains about 400K records.
I have a Mapping table which contains about 25M rows and looks like
Field A Field B SomeStringA SomeStringB
1628 836791 1234 783901
afgd ahutwe 1278 ashjkl
--------------------------------
--------------------------------
and so on.
Field A and Field B combined is the Primary Key for the table.
PROBLEM STATEMENT:
Replace:
Field A by SomeStringA
Field B by SomeStringB
in the input file. SomeStringA and SomeStringB are exactly the same width as Field A and B respectively.
Here's what I'm trying:
Approach 1:
Sort and Dump the mapping table into a file
spool dump_file
select * from mapping order by fieldA, fieldB;
spool off
exit;
Strip the input file and get the first 10 chars
cut -c1-10 input_file > input_file_stripped
Do something to find the lines that begin with the same string and then when they do - replace in the input_file with field 10-20 in the spooled file. - here's where I'm stuck.
Approach 2:
Take the input file and get the first 10 chars
cut -c1-10 input_file >input_file_stripped
Use sqlldr and load into a temp_table.
Select matching records from the mapping table and spool
spool matching_records
select m.* from mapping m, temp t where m.fieldA=t.fieldA and m.fieldB=t.fieldB;
spool off
exit;
Now how do I replace these in the original file ?
Given the high number of records to process, how can this be done and done fast ?
Notes:
Not a one time activity, has to be done daily so scale is important
The mapping table is unlikely to change
I've Python, shell script and Oracle database available. Any combination of these is fine.

Load only particular field in PIG?

This is my file:
Col1, Col2, Col3, Col4, Col5
I need only Col2 and Col3.
Currently I'm doing this:
a = load 'input' as (Col1:chararray,
Col2:chararray,
Col3:chararray,
Col4:chararray);
b = foreach a generate Col2, Col3;
Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?
Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.
In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have
b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);
then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did
b = group a by Col2;
c = foreach b generate group, COUNT(a);
then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.
If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:
a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;

How to load data into couple of tables from a single files with different records structure in Hive?

I have a single file with a structure like:
A 1 2 3
A 4 5 6
A 5 8 12
B abc cde
B and fae
B bsd oio
C 1
C 2
C 3
and would like to load the data in 3 simple tables (A (int int int), B(string string) C(int)).
Is it possible and how?
It's also fine for me, if A(string int int int) etc. with the first column of the file to be included in the table.
I'd go with option 1 as Praveen suggests. I'd create an external table of only a string, and use the FROM ( ... ) syntax to insert into multiple tables at once. I think something like the following would work
create external table source_table( line string )
stored as textfile
location '/myfile';
from ( select split( line , " ") as col_array from source_table ) cols
insert overwrite table A select col_array[1], col_array[2], col_array[3] where col_array[0] = 'A'
insert overwrite table B select col_array[1], col_array[2] where col_array[0] = 'B'
insert overwrite table C select col_array[1] where col_array[0] = 'C';
Option 1) Map the entire data to a Hive table and then use the insert overwrite table .... option to map the appropriate data to the target tables.
Option 2) Develop a MR program to split the file into multiple files and then do the mapping of the files to the target tables in Hive.

complex Hive Query

Hi I have the following table:
ID------ |--- time
======================
5------- | ----200101
3--------| --- 200102
2--------|---- 200103
12 ------|---- 200101
16-------|---- 200103
18-------|---- 200106
Now I want to know how often a certain month in the year appears. I cant use a group by because this only counts the number of times which appears in the table. But I also want to get a 0 when a certain month in the year does not appear. So the output should be something like this:
time-------|----count
=====================
200101--|-- 2
200102--|-- 1
200103--|-- 1
200104--|-- 0
200105--|-- 0
200106--|-- 1
Sorry for the bad table format, I hope it is still clear what I mean.
I would apreciate any help
You can provide a year-month table containing all year and month information. I wrote a script for you to generate such csv file:
#!/bin/bash
# year_month.sh
start_year=1970
end_year=2015
for year in $( seq ${start_year} ${end_year} ); do
for month in $( seq 1 12 ); do
echo ${year}$( echo ${month} | awk '{printf("%02d\n", $1)}');
done;
done > year_month.csv
Save it in year_month.sh and run it. Then you will get a file year_month.csv containing the year and month from 1970 to 2015. You can change start_year and end_year to specify the year range.
Then, upload the year_month.csv file to HDFS. For example,
hadoop fs -mkdir /user/joe/year_month
hadoop fs -put year_month.csv /user/joe/year_month/
After that, you can load year_month.csv into Hive. For example,
create external table if not exists
year_month (time int)
location '/user/joe/year_month';
At last, you can join the new table with your table to get the final result. For example, assume your table is id_time:
from (select year_month.time as time, time_count.id as id
from year_month
left outer join id_time
on year_month.time = id_time.time) temp
select time, count(id) as count
group by time;
Note: you need to make tiny modification (such as path, type) to the above statement.

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.
Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1
without pre-process it with awk/sed or something like that?
Thanks!
Step 1: Create a Hive table for input data set .
create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
(i assumed field seprator is \t, you can replace it with actual separator)
Step 2 : Run below to get the merge data you are looking for
create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;
I tried this for below input set
hive (default)> select * from table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;
hive (default)> select * from table4;
OK
A 1
B 1
C 1
You can use external table as well , but for simplicity I have used managed table here.
One idea.. you could create a table around the first file (called 'oldtable').
Then run something like this....
create table newtable select field1, max(field) from oldtable group by field1;
Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?
For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.
insert overwrite into table test1
(select x.* from t1 x )
UNION ALL
(select y.* from t2 y);
here we are merging two tables data (t1 and t2) into one single table test1.
There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.
hive> SELECT * FROM table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
B 2 # Added to show it will group correctly with different values
hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A 1
B 1
B 2
C 1

Resources