Pass an array argument to custom pig loader - user-defined-functions

I wrote a LoadFunc function that allows me to select given keywords of an unstructured huge log-file. How do I pass Tuple into my function as an argument?
Something like
A = load '/input/*' using MyLoader('keyword1','keyword2');
or
A = load '/input/*' using MyLoader( ('keyword1','keyword2') );
cause errors:
grunt> a = LOAD '/input/*' USING MyLoader( ('keyword1','keyword2') );
2012-08-28 19:44:04,331 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 3, column 36> mismatched input '(' expecting RIGHT_PAREN
Details at logfile: /home/hadoop/pig-0.10.0/pig_1346159261142.log

In practice, a Pig LoadFunc can only accept String parameters for its constructor. See http://mail-archives.apache.org/mod_mbox/pig-user/201302.mbox/%3CCAO8ATY27UOdcgSjdh19F=iHsnFEAwmzedWbsnZ66sNvcsjfgog#mail.gmail.com%3E.
For your purposes, I would pass a CSV as a String to your LoadFunc and then parse it within the LoadFunc's constructor.

Related

Pig: Unable to Load BAG

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.
It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.
Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

How to insert dummpy map values in pig

I am doing a conditional check for null and empty occurrence of a bag. The contains multiple map arrays. Whenever 'info' is null or empty I want to put a dummy map values into this. Because in the next step I am doing a FLATTEN operation on 'info'.
Why I need this because null or empty bag in FLATTEN will remove the complete record from the data which I don't want.
((info is null or IsEmpty(info)) ? {(['Unknown'#'unknown'])} : info) as info;
This is giving me below compilation error?
2014-09-02 06:20:37,978 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " ": "" at line 24, column 70.
Was expecting one of:
"cat" ...
"clear" ...
"fs" ...
"sh" ...
"cd" ...
"cp" ...
"copyFromLocal" ...
It seems there is a syntax error while creating a map. There is an easy way to create map using TOMAP function, which you can use as below:
((info is null or IsEmpty(info)) ? {(TOMAP('Unknown','unknown'))} : info) as info;

DEFINE statement in Apache Pig

I parsed json input
--Load Json
loadJson = LOAD '$inputJson' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map []);
'$inputJson' - is the file having json structure.
Then I parsed json to get some parameters for my pig job
--PARSING JSON
--Source : Is the input file I need to process in Pig job
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
and I stored that in "a". "a" contains the input path ie /home/sree/foo.txt
Now I need to load that data into a bag.Next I need to do a normal load statement
inputdata = LOAD "/home/sree/foo.txt";
Instead of this I have to do
inputdata = LOAD a;
This is what I am trying to achieve.
So far what I tried is , I used define
--Source
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;
-- define a global constant for storage
define myIn "a";
--Load data
inputdata = LOAD "$myIn" ;
dump data;
But showing Unexpected internal error. Undefined parameter : a
How to load that file?
As far as i know that pig does not allow the relation to be used in the DEFINE instead of statements.
Refer this,
http://pig.apache.org/docs/r0.10.0/basic.html#define-udfs

using MultiStorage to store records in separate files

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

PIG Loading a CSV - Map Type Error

We aim to leverage PIG for largescale log analysis of our server logs. I need to load a PIG map datatype from a file.
I tried running a sample PIG script with the following data.
A line in my CSV file, named 'test' (to be processed by PIG) looks like,
151364,[ref#R813,highway#secondary]
My PIG Script
a = LOAD 'test' using PigStorage(',') AS (id:INT, m:MAP[]);
DUMP a;
The idea is to load an int and the second element as a hashmap.
However, when I dump, the int field get parsed correctly(and gets printed in the dump) but the map field is not parsed resulting in a parsing error.
Can someone please explain if I am missing something?
I think there is a delimiter related problem (such as field-delimiter is somehow effecting parsing of map field or it is confused with map-delimiter).
When this input data is used (notice I used semicolon as field-delimiter):
151364;[ref#R813,highway#secondary]
below is the output from my grunt shell:
grunt> a = LOAD '/tmp/temp2.txt' using PigStorage(';') AS (id:int, m:[]);
grunt> dump a;
...
(151364,[highway#secondary,ref#R813])
grunt> b = foreach a generate m#'ref';
grunt> dump b;
(R813)
Atlast, I figured out the problem. Just change the de-limiter from ',' to another character ,say a pipe. The field delimiter was being confused with the delimiter ',' used for the map :)
The string 151364,[ref#R813,highway#secondary] was getting parsed into,
field1: 151364 field2: [ref#R813 field3: highway#secondary]
Since '[ref#$813' is not a valid map field, there is a parse error.
I also looked into the source code of the PigStorage function and confirmed the parsing logic - Source code
#Override
public Tuple getNext() throws IOException {
for (int i = 0; i < len; i++) {
//skipping some stuff
if (buf[i] == fieldDel) { // if we find delim
readField(buf, start, i); //extract field from prev delim to current
start = i + 1;
fieldID++;
}
}
}
Thus, since PIG splits fields by the delimiter, it causes the parsing of fields to be confused with the separator used for the map.

Resources