Handling duplicate records in PIG Latin - hadoop

If there are duplicates in the file, the first record should go to valid file and remaining duplicate records should be moved to invalid file using a PIG script.
Below is the scenario.
Input:
Acc|Phone|Name
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|987-654-3210|BLS
output: Two files
1. Valid rec:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
9999|123-456-1890|PQR
8734|456-879-1234|QWE
2. Invalid rec:
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|987-654-3210|BLS
Invalid records are not necessarily to be in same order. It can also be like this.
Invalid rec:
1234|234-123-0000|DEF
1234|987-654-3210|BLS
4567|369-258-0147|NNN
Scenario 2:
Input:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|087-654-3210|BLS
1234|303-444-5555|XYZ
4567|122-555-1111|ABC
1234|134-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|069-258-0147|NNN
1234|086-654-3210|BLS
1234|033-444-5555|XYZ
4567|200-555-1111|ABC
1234|230-123-0000|DEF
9999|023-456-1890|PQR
8734|456-779-1234|QWE
4567|309-258-0147|NNN
1234|007-654-3210|BLS
Good Rec:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
9999|123-456-1890|PQR
8734|456-879-1234|QWE
Can anyone please suggest some idea. I'm only able to get the first record.
Thanks.

Can you try this?
input.txt
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|987-654-3210|BLS
PigScript:
A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
sortInAsc = ORDER B BY rank_A ASC;
top1 = LIMIT sortInAsc 1;
GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
}
--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);
--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');
--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);
--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
goodrecord Output1:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
badrecord Output1:
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
Scenario2 goodrecord Output:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
Scenario2 badrecord Output:
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR

Related

How to append columns dynamically to a .csv file?

suppose we have the following csv file
file1.csv
#groups id owner
abc id1 owner1
abc id2 owner1
bcx id1 owner2
cpa id3 owner1
the following script reads file1.csv, filters on the first column, #groups, and adds extra characters
#!/bin/env python2
#!/usr/bin/python
import re
import csv
print "enter Path to orignal file"
GROUPS = raw_input()
print "enter Path to modified file"
WORKING = raw_input()
def filter_lines(f):
"""this generator funtion uses a regular expression
to include only lines that have a `abc` at the start
and NO `gep` throughout the record
"""
filter_regex = r'^abc(?!gep).*'
for line in f:
line = line.strip()
m = re.match(filter_regex, line)
if m:
yield line
pat = re.compile(r'^(abc)(?!.*gep.*)') #insert gep in any abc records that dont have gep
#insert gep
variable1 = 0
with open(GROUPS, 'r') as f:
with open(WORKING, 'w') as data:
#next(f) # Skip over header in input file.
#filter
filter_generator = filter_lines(f)
csv_reader = csv.reader(filter_generator)
count = 0
writer = csv.writer(data) #, quoting=csv.QUOTE_ALL
for row in csv_reader:
count += 1
variable1 = (pat.sub('\\1gep_', row[0])) #modify all filtered records to include gep
fields = [variable1]
writer.writerow(fields)
print 'Filtered (abc at Start and NO gep) Rows Count = ' + str(count)
for example, abc would turn to abc_gep and we would write that to another csv file file2.csv
so file2.csv now contains only:
abc_gep
abc_gep
good.
now i want to add the rest of the columns where they match with abc from file1.csv
how could i do that?
i tried the following
fields = [variable1,row[1],row[2]]
but this is hardcoding the columns and not dynamic. i am looking for something more like this:
fields = [variable1, row[i]]
essentially, this is the result im seeking for file2.csv:
abc_gep id1 owner1
abc_gep id2 owner1

Handle thorn delimiter in pig

My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.
Option 1 :
Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.
Option 2 :
I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.
I can see multiple questions in web, but unable to find a solution.
Kindly direct me with this.
Thorn Details :
http://www.fileformat.info/info/unicode/char/fe/index.htm
Is this the solution are you expecting?
input.txt:
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));
dump B;
Output:
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
Added 2nd option with different datatypes:
input.txt
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);
DUMP B;
DESCRIBE B;
Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)
(hello,4567,1990-01-01T00:00:00.000+00:00,world)
(hello,8901,2001-01-01T00:00:00.000+00:00,world)
(hello,9876,2014-01-01T00:00:00.000+00:00,world)
B: {f1: chararray,f2: long,f3: datetime,f4: chararray}
Another thorn symbol A¾:
input.txt
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
PigScript:
A = LOAD 'jinput.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);
DUMP B;
describe B;
Output:
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}
}
This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):
A = LOAD 'input' USING
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);
I don't see why the last field should be left out.
Somehow, this does not work:
A = LOAD 'input' USING PigStorage('\00DE');

why the schema shows "group" when I haven't done "group" (hadoop pig)

Hi Here's an example from http://pig.apache.org/docs/r0.10.0/test.html#describe.
Why the schema of A and B includes some "group"? I thought the schema would have "group" only after you've done a group command (like in C).
A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
B = FILTER A BY name matches 'J.+';
C = GROUP B BY name;
DESCRIBE A;
A: {group, B: (name: chararray,age: int,gpa: float}
DESCRIBE B;
B: {group, B: (name: chararray,age: int,gpa: float}
DESCRIBE C;
C: {group, chararry,B: (name: chararray,age: int,gpa: float}

Pig Latin - adding values from different bags?

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

LASTMOVE loop is not working

my xml looks like
- <ItemMaster>
- <ItemMasterHeader>
+ <ItemID>
+ <ItemStatus>
+ <UserArea>
- <Classification Type="HOMOLOGATION CLASS">
- <Codes>
<Code>E</Code>
</Codes>
</Classification>
+ <Classification Type="LP">
+ <Classification>
- <Classification Type="BRAND">
- <Codes>
<Code>002</Code>
</Codes>
</Classification>
Yhe full xml is here http://www.speedyshare.com/MgCCA/download/ItemMaster-2.xml
I need to fetch the value of Classification with attribute TYPE= "BRAND" but with below code, it only fetchs the classification with attribute TYPE = "HOMOLOGATION CLASS" which I dont want since I am calling for "BRAND". I tried to apply LASTMOVE but dosent work. Please tell me where I am wrong.
I have to fetch other values also like codes inside the type -"LP" also.
DECLARE rResource REFERENCE TO InputRoot.XMLNSC.*:SyncItemMaster.*:DataArea.*:ItemMaster.*:ItemMasterHeader[1];
SET rowCnt = rowCnt+1;
DECLARE LineCount INTEGER 1;
WHILE LASTMOVE(rResource) = TRUE DO
SET OutputRoot.XMLNSC.root.row[rowCnt].product_Info.TyreBrandCd = THE (SELECT ITEM FIELDVALUE(T) FROM itemMaster.*:ItemMasterHeader[LineCount].*:Classification.*:Codes.*:Code AS T WHERE FIELDVALUE(itemMaster.*:ItemMasterHeader[LineCount].*:Classification.(XMLNSC.Attribute)Type) = 'BRAND');
SET LineCount = LineCount + 1;
MOVE rResource NEXTSIBLING REPEAT TYPE NAME;
END WHILE;
RETURN TRUE;
END;
Thanks
TRIED with below suggested code
Here are trace logs
2013-05-10 18:32:27.218385 7732 UserTrace BIP2537I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Executing statement ''SET temp = THE (SELECT T.Classification AS :Classification FROM myref AS T WHERE FIELDVALUE(T.Classification.(XMLNSC.Attribute)Type) = 'BRAND');'' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.3').
2013-05-10 18:32:27.218393 7732 UserTrace BIP2538I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Evaluating expression ''THE (SELECT T.Classification AS :Classification FROM myref AS T WHERE FIELDVALUE(T.Classification.(XMLNSC.Attribute)Type) = 'BRAND')'' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.14').
2013-05-10 18:32:27.218400 7732 UserTrace BIP2572W: Node: 'WMB_9D1_PROD_SUB00_001.9D1_PROD': ('.WMB_9D1_PROD_SUB00_001.Main', '22.14') : Finding one and only SELECT result.
2013-05-10 18:32:27.218427 7732 UserTrace BIP2539I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Evaluating expression ''myref'' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.48'). This resolved to ''myref''. The result was ''ROW... Root Element Type=16777216 NameSpace='' Name='ItemMasterHeader' Value=NULL''.
2013-05-10 18:32:27.218437 7732 UserTrace BIP2539I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Evaluating expression ''XMLNSC.Attribute'' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.94'). This resolved to ''XMLNSC.Attribute''. The result was ''1095266992384''.
2013-05-10 18:32:27.218446 7732 UserTrace BIP2540I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Finished evaluating expression ''FIELDVALUE(T.Classification.(XMLNSC.Attribute)Type)'' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.65'). The result was '''HOMOLOGATION CLASS'''.
2013-05-10 18:32:27.218454 7732 UserTrace BIP2539I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Evaluating expression ''FIELDVALUE(T.Classification.(XMLNSC.Attribute)Type) = 'BRAND''' at ('.WMB_9D1_PROD_SUB00_001.Main', '22.117'). This resolved to '''HOMOLOGATION CLASS' = 'BRAND'''. The result was ''FALSE''.
2013-05-10 18:32:27.218461 7732 UserTrace BIP2569W: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': ('.WMB_9D1_PROD_SUB00_001.Main', '22.14') : WHERE clause evaluated to false or unknown. Iterating FROM clause.
2013-05-10 18:32:27.218469 7732 UserTrace BIP2570W: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': ('.WMB_9D1_PROD_SUB00_001.Main', '22.14') : There were no items in the FROM clause satisfying the WHERE clause.
2013-05-10 18:32:27.218503 7732 UserTrace BIP2567I: Node 'WMB_9D1_PROD_SUB00_001.9D1_PROD': Assigning NULL to ''temp'', thus deleting it.
Try this:
declare temp ROW;
SET temp = THE (SELECT T.Classification FROM rResource AS T WHERE FIELDVALUE(T.Classification.(XMLNSC.Attribute)Type) = 'BRAND');
OutputRoot.XMLNSC.root.row[rowCnt].product_Info.TyreBrandCd = temp.code;
I'm not sure what's the kind of mapping you're looking for. Assuming that what you want is for the (unique) 'Code', on the 'Classification' with the right attribute on each 'ItemMasterHeader', to be present in the output inside separate 'row' folders, here's the code:
CREATE PROCEDURE ExtractTyreCodes() BEGIN
DECLARE rOutput REFERENCE TO OutputRoot;
DECLARE rResource REFERENCE TO InputRoot.XMLNSC.*:SyncItemMaster.*:DataArea.*:ItemMaster;
CREATE FIELD OutputRoot.XMLNSC.root AS rOutput;
IF LASTMOVE(rResource) THEN
SET rOutput.row[] = SELECT
THE(SELECT C.*:Codes.*:Code AS TyreBrand
FROM T.*:Classification[] AS C
WHERE C.(XMLNSC.Attribute)Type = 'BRAND') AS product_Info
FROM rResource.*:ItemMasterHeader[] AS T;
END IF;
END;
Starting from this message:
<SyncItemMaster>
<DataArea>
<ItemMaster>
<ItemMasterHeader>
<ItemID/>
<ItemStatus/>
<UserArea/>
<Classification Type="HOMOLOGATION CLASS">
<Codes>
<Code>E</Code>
</Codes>
</Classification>
<Classification Type="LP"/>
<Classification/>
<Classification Type="BRAND">
<Codes>
<Code>002</Code>
</Codes>
</Classification>
</ItemMasterHeader>
<ItemMasterHeader>
<ItemID/>
<ItemStatus/>
<UserArea/>
<Classification Type="HOMOLOGATION CLASS">
<Codes>
<Code>F</Code>
</Codes>
</Classification>
<Classification Type="LP"/>
<Classification/>
<Classification Type="BRAND">
<Codes>
<Code>005</Code>
</Codes>
</Classification>
</ItemMasterHeader>
</ItemMaster>
</DataArea>
</SyncItemMaster>
You get this message:
<root>
<row>
<product_Info>
<TyreBrand>002</TyreBrand>
</product_Info>
</row>
<row>
<product_Info>
<TyreBrand>005</TyreBrand>
</product_Info>
</row>
</root>
This generates a 'row' folder for each 'ItemMasterHeader', puts inside of each a 'product_Info' folder, and inside of that one puts the code from the 'Classification' of (Attribute) 'Type' = 'BRAND'.
Hope this helps. Regards,

Resources