I have a .TSV file containing data in HDFS and i am not able to load it into Pig.
The command i am using is "
A = load 'file_location' as (name:chararray, age:int, gpa:float);
B = foreach A generate (name, age);
DUMP B;
Error returned : Unable to find operator for alias A
If you do not specify the delimiter PIG uses default ',' as the delimiter for loading the file.Hence your load statement is failing.You have to explicitly specify the delimiter '\t'.
A = LOAD 'file_location' USING PigStorage('\t') AS (name:chararray, age:int, gpa:float);
Do it like this
A = load 'path/of/file' using PigStorage('\t') AS (name:chararray,age:int,gpa:float);
B = foreach A generate name, age;
DUMP B;
ps: I don't think there is any fault with your commands.As tab (\t) is default delimiter for pig . I am getting the correct output with your commands .Can you please send me logs or screenshot of your terminal.
I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...
I'm having an interesting behaviour with PigStorage and its -tagPath option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig.
My file looks like this (the most basic, I was able to come up with):
A
B
Now I can load and subselect this file like this fine:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';') AS (char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(A)
(B)
(A)
(B)
However, when I try to fetch the filepath with -tagPath (I need it when I access a whole folder of data), the data gets loaded correctly into the first variable, but I cannot subselect a column from it.
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath')
AS (filepath: chararray, char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
However, when I first read the data without schema and then add a schema using FOREACH it works fine again:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath');
vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;
DUMP vals_n
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)
So is there any way, I can use -tagPath and schema in the LOAD phase at the same time?
This happens, because pig tries to find out automatically which columns are being used in the script and only load those. When we use -tagFile or -tagPath, it seems this gets confused.
The solution is to run the pig script without this column detection:
pig -x mapreduce -t ColumnMapKeyPrune
Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.
I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap