PIG : How to exclude first n lines while Loading - hadoop

is there a way to exclude the first n lines of a file while loading some data on pig ?
I have a csv file that i would like to load but i have to ignore the first 3 lines.

One option could be you can try like this.
A = LOAD 'input' <schema>;
B = RANK A;
C = FILTER B BY $0 > 3;
D = FOREACH C GENERATE $1..;
DUMP D;
If you defined the schema in your load stmt then instead of positional notation($0,$1 etc) use the defined names. It will be more readable.

Try the following code:
abt = LOAD 'act.psv' using PigStorage('|')
as (r1:chararray,r2:chararray);
r = rank abt;
n = filter r by ($0 > 3);
p = foreach n generate r1,r2;
dump p;

Related

Pig Latin Remove Tuple in Data Bag

Here is my code leading up to my issue:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
In databag f, I have an empty tuple (,1) I'd like to filter/remove.
dump f;
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
describe f;
f: {group: chararray,long}
I've tried this with no success (makes no change):
remove_tuple = filter f BY group is not null;
Group is a pig keyword. Hope this should work when some other word is used for the tuple name.
NULL can be filtered by using !='null' as a condition. I have taken below as the input.
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
Below is how we can filter NULL's.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:long);
B = FILTER A BY a!='null';
DUMP B;
So for your script the line will be something like
remove_tuple = filter f BY group!='null';
Output:
(1,97)
(5,49)
(10,87)
(20,24)
I solved by adding a step and casting as an int. Here are the steps:
e = foreach d generate (int)$0; # this is the key added step
f = group e by number; #{group: chararray,d: {(number: chararray)}}
g = foreach f generate group, COUNT(e); # f: {group: chararray,long}
h = foreach f generate group, SUM(e);
i = filter g by $0 is not null;
dump i;
(1,97)
(5,49)
(10,87)
(20,24)

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

FIles comparing field by field in PIG

I have two files like
File 1
id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
File 2
id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
When I compare file1 with file 2, I need a output like
1000, sal
1001,code
Basically, it should tell me what field is changed from the previous file along with the id.
Can this be done in PIG.
You can easily solve this problem but the challenging part will the output format as you mentioned. It requires little bit complex logic to get the output format.
I have fixed most of the edge cases but you can check with your input to make sure that it works for all combinations.
file1:
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
file2:
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
PigScript:
A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
((A::location == B::location)?'':'location') AS location,
((A::code == B::code)?'':'code') AS code;
--Remove the common fields between two files
E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');
--The below two lines are used to formatting the output
F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
DUMP G;
Output:
(1000,sal)
(1001,code)

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Resources