DEFINE statement in Apache Pig - hadoop

I parsed json input
--Load Json
loadJson = LOAD '$inputJson' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map []);
'$inputJson' - is the file having json structure.
Then I parsed json to get some parameters for my pig job
--PARSING JSON
--Source : Is the input file I need to process in Pig job
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
and I stored that in "a". "a" contains the input path ie /home/sree/foo.txt
Now I need to load that data into a bag.Next I need to do a normal load statement
inputdata = LOAD "/home/sree/foo.txt";
Instead of this I have to do
inputdata = LOAD a;
This is what I am trying to achieve.
So far what I tried is , I used define
--Source
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
-- define a global constant for storage
define myIn "a";
--Load data
inputdata = LOAD "$myIn" ;
dump data;
But showing Unexpected internal error. Undefined parameter : a
How to load that file?

As far as i know that pig does not allow the relation to be used in the DEFINE instead of statements.
Refer this,
http://pig.apache.org/docs/r0.10.0/basic.html#define-udfs

Related

Dump is not working

I am using IBM BigInsights.
When I execute the DUMP command in Pig Grunt shell, I am not getting any result.
Sample Input file:
s_no,name,DOB,mobile_no,email_id,country_code,sex,disease,age
11111,bbb1,12-10-1950,1234567890,bbb1#xxx.com,1111111111,M,Diabetes,78
11112,bbb2,12-10-1984,1234567890,bbb2#xxx.com,1111111111,F,PCOS,67
11113,bbb3,712/11/1940,1234567890,bbb3#xxx.com,1111111111,M,Fever,90
11114,bbb4,12-12-1950,1234567890,bbb4#xxx.com,1111111111,F,Cold,88
11115,bbb5,12/13/1960,1234567890,bbb5#xxx.com,1111111111,M,Blood Pressure,76
INFO [JobControl] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
My code is as follow:
A = LOAD 'healthcare_Sample_dataset1.csv' as(s_no:long,name:chararray,DOB:datetime,mobile_no:long,email_id:chararray,country_code:long,sex:chararray,disease:chararray,age:int);
B = FOREACH A GENERATE name;
C = LIMIT B 5;
DUMP C;
Kindly help me to resolve this.
Thanks and Regards!!!
From your script I can see that you are using CSV File. If you are working with CSV File then you should use CSVLoader() in your pig script. Your script should be like this:
--Register piggybank jar which contains UDF of CSVLoader
REGISTER piggybank.jar
-- Define the UDF
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader
A = load '/user/biadmin/test/CBTTickets.csv' using CSVLoader AS (
Type:chararray,
Id:int,
Summary:chararray,
OwnedBy:chararray,
Status:chararray,
Prio‌​rity:chararray,
Severity:chararray,
ModifiedDate:datetime,
PlannedFor:chararray,
Time‌​Spent:int);
B = FOREACH A GENERATE Type;
C = LIMIT B 5;
DUMP C;
Please provide your input data if it not works for you.
You have not mentioned the whole address of healthcare_Sample_dataset1.csv that's why dump is not working properly.
Load data by Writing full path of that file than Dump will work!!
I think you need to load all fields as bytearray, then remove first row (i.e. header), because they don't match the data types you want to impose on those fields.
OR
remove first row using text editor and use your own code.

Pig: Unable to Load BAG

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.
It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.
Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

How to split a text file which is having '\t' and ',' values in Pig

I want to convert text file which is having tab and comma separated values into fully comma separated value in PIG. I am using Apache Pig version 0.11.1., I have tried with the following code and tried with FLATTEN, TOKENIZE. But I cannot make it into fully CSV file.
a = load '/home/mansoor/Documents/ip.txt' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray );
b = FOREACH a {
key= STRSPLIT(key,'\t');
GENERATE key;
}
Following is my text file input:
M12345 M123456,M234567,M987653
M23456 M23456,M123456,M234567
M34567 M234567,M765678,M987643
I need a file which is having fully CSV file like the following output:
M12345,M123456,M234567,M987653
M23456,M23456,M123456,M234567
M34567,M234567,M765678,M987643
How can I do this?
With pig 0.13, just using load without PigStorage made the csv be well loaded.
a = load '/home/mansoor/Documents/ip.txt';
dump a
gives me
(M12345,M123456,M234567,M987653)
(M23456,M23456,M123456,M234567)
(M34567,M234567,M765678,M987643 )
If that's not what you want, you might want to consider the REPLACE function.
Here is a quick and dirty solution to dispose of a usable csv :
a = load '/home/mansoor/Documents/ip.txt' using PigStorage('\n');
b = foreach a generate FLATTEN(REPLACE($0, '\t', ','));
store b into 'tmp.csv';
You can then use the csv as intended :
c = load 'tmp.csv' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray, val3:chararray);
describe c
gives c: {key: chararray,val1: chararray,val2: chararray, val3:chararray}
Try this,
a = load '/home/mansoor/Documents/ip.txt';
store a into '/home/mansoor/Documents/op' using PigStorage(',');
Now the file is fully converted into csv file.

using MultiStorage to store records in separate files

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

PIG Loading a CSV - Map Type Error

We aim to leverage PIG for largescale log analysis of our server logs. I need to load a PIG map datatype from a file.
I tried running a sample PIG script with the following data.
A line in my CSV file, named 'test' (to be processed by PIG) looks like,
151364,[ref#R813,highway#secondary]
My PIG Script
a = LOAD 'test' using PigStorage(',') AS (id:INT, m:MAP[]);
DUMP a;
The idea is to load an int and the second element as a hashmap.
However, when I dump, the int field get parsed correctly(and gets printed in the dump) but the map field is not parsed resulting in a parsing error.
Can someone please explain if I am missing something?
I think there is a delimiter related problem (such as field-delimiter is somehow effecting parsing of map field or it is confused with map-delimiter).
When this input data is used (notice I used semicolon as field-delimiter):
151364;[ref#R813,highway#secondary]
below is the output from my grunt shell:
grunt> a = LOAD '/tmp/temp2.txt' using PigStorage(';') AS (id:int, m:[]);
grunt> dump a;
...
(151364,[highway#secondary,ref#R813])
grunt> b = foreach a generate m#'ref';
grunt> dump b;
(R813)
Atlast, I figured out the problem. Just change the de-limiter from ',' to another character ,say a pipe. The field delimiter was being confused with the delimiter ',' used for the map :)
The string 151364,[ref#R813,highway#secondary] was getting parsed into,
field1: 151364 field2: [ref#R813 field3: highway#secondary]
Since '[ref#$813' is not a valid map field, there is a parse error.
I also looked into the source code of the PigStorage function and confirmed the parsing logic - Source code
#Override
public Tuple getNext() throws IOException {
for (int i = 0; i < len; i++) {
//skipping some stuff
if (buf[i] == fieldDel) { // if we find delim
readField(buf, start, i); //extract field from prev delim to current
start = i + 1;
fieldID++;
}
}
}
Thus, since PIG splits fields by the delimiter, it causes the parsing of fields to be confused with the separator used for the map.

Resources