Pig: Unable to Load BAG - hadoop

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.

It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.

Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

Related

Dump is not working

I am using IBM BigInsights.
When I execute the DUMP command in Pig Grunt shell, I am not getting any result.
Sample Input file:
s_no,name,DOB,mobile_no,email_id,country_code,sex,disease,age
11111,bbb1,12-10-1950,1234567890,bbb1#xxx.com,1111111111,M,Diabetes,78
11112,bbb2,12-10-1984,1234567890,bbb2#xxx.com,1111111111,F,PCOS,67
11113,bbb3,712/11/1940,1234567890,bbb3#xxx.com,1111111111,M,Fever,90
11114,bbb4,12-12-1950,1234567890,bbb4#xxx.com,1111111111,F,Cold,88
11115,bbb5,12/13/1960,1234567890,bbb5#xxx.com,1111111111,M,Blood Pressure,76
INFO [JobControl] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
My code is as follow:
A = LOAD 'healthcare_Sample_dataset1.csv' as(s_no:long,name:chararray,DOB:datetime,mobile_no:long,email_id:chararray,country_code:long,sex:chararray,disease:chararray,age:int);
B = FOREACH A GENERATE name;
C = LIMIT B 5;
DUMP C;
Kindly help me to resolve this.
Thanks and Regards!!!
From your script I can see that you are using CSV File. If you are working with CSV File then you should use CSVLoader() in your pig script. Your script should be like this:
--Register piggybank jar which contains UDF of CSVLoader
REGISTER piggybank.jar
-- Define the UDF
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader
A = load '/user/biadmin/test/CBTTickets.csv' using CSVLoader AS (
Type:chararray,
Id:int,
Summary:chararray,
OwnedBy:chararray,
Status:chararray,
Prio‌​rity:chararray,
Severity:chararray,
ModifiedDate:datetime,
PlannedFor:chararray,
Time‌​Spent:int);
B = FOREACH A GENERATE Type;
C = LIMIT B 5;
DUMP C;
Please provide your input data if it not works for you.
You have not mentioned the whole address of healthcare_Sample_dataset1.csv that's why dump is not working properly.
Load data by Writing full path of that file than Dump will work!!
I think you need to load all fields as bytearray, then remove first row (i.e. header), because they don't match the data types you want to impose on those fields.
OR
remove first row using text editor and use your own code.

DEFINE statement in Apache Pig

I parsed json input
--Load Json
loadJson = LOAD '$inputJson' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map []);
'$inputJson' - is the file having json structure.
Then I parsed json to get some parameters for my pig job
--PARSING JSON
--Source : Is the input file I need to process in Pig job
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
and I stored that in "a". "a" contains the input path ie /home/sree/foo.txt
Now I need to load that data into a bag.Next I need to do a normal load statement
inputdata = LOAD "/home/sree/foo.txt";
Instead of this I have to do
inputdata = LOAD a;
This is what I am trying to achieve.
So far what I tried is , I used define
--Source
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
-- define a global constant for storage
define myIn "a";
--Load data
inputdata = LOAD "$myIn" ;
dump data;
But showing Unexpected internal error. Undefined parameter : a
How to load that file?
As far as i know that pig does not allow the relation to be used in the DEFINE instead of statements.
Refer this,
http://pig.apache.org/docs/r0.10.0/basic.html#define-udfs

How to split a text file which is having '\t' and ',' values in Pig

I want to convert text file which is having tab and comma separated values into fully comma separated value in PIG. I am using Apache Pig version 0.11.1., I have tried with the following code and tried with FLATTEN, TOKENIZE. But I cannot make it into fully CSV file.
a = load '/home/mansoor/Documents/ip.txt' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray );
b = FOREACH a {
key= STRSPLIT(key,'\t');
GENERATE key;
}
Following is my text file input:
M12345 M123456,M234567,M987653
M23456 M23456,M123456,M234567
M34567 M234567,M765678,M987643
I need a file which is having fully CSV file like the following output:
M12345,M123456,M234567,M987653
M23456,M23456,M123456,M234567
M34567,M234567,M765678,M987643
How can I do this?
With pig 0.13, just using load without PigStorage made the csv be well loaded.
a = load '/home/mansoor/Documents/ip.txt';
dump a
gives me
(M12345,M123456,M234567,M987653)
(M23456,M23456,M123456,M234567)
(M34567,M234567,M765678,M987643 )
If that's not what you want, you might want to consider the REPLACE function.
Here is a quick and dirty solution to dispose of a usable csv :
a = load '/home/mansoor/Documents/ip.txt' using PigStorage('\n');
b = foreach a generate FLATTEN(REPLACE($0, '\t', ','));
store b into 'tmp.csv';
You can then use the csv as intended :
c = load 'tmp.csv' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray, val3:chararray);
describe c
gives c: {key: chararray,val1: chararray,val2: chararray, val3:chararray}
Try this,
a = load '/home/mansoor/Documents/ip.txt';
store a into '/home/mansoor/Documents/op' using PigStorage(',');
Now the file is fully converted into csv file.

using MultiStorage to store records in separate files

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

Pass an array argument to custom pig loader

I wrote a LoadFunc function that allows me to select given keywords of an unstructured huge log-file. How do I pass Tuple into my function as an argument?
Something like
A = load '/input/*' using MyLoader('keyword1','keyword2');
or
A = load '/input/*' using MyLoader( ('keyword1','keyword2') );
cause errors:
grunt> a = LOAD '/input/*' USING MyLoader( ('keyword1','keyword2') );
2012-08-28 19:44:04,331 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 3, column 36> mismatched input '(' expecting RIGHT_PAREN
Details at logfile: /home/hadoop/pig-0.10.0/pig_1346159261142.log
In practice, a Pig LoadFunc can only accept String parameters for its constructor. See http://mail-archives.apache.org/mod_mbox/pig-user/201302.mbox/%3CCAO8ATY27UOdcgSjdh19F=iHsnFEAwmzedWbsnZ66sNvcsjfgog#mail.gmail.com%3E.
For your purposes, I would pass a CSV as a String to your LoadFunc and then parse it within the LoadFunc's constructor.

Resources