Multi-line JSON read using Apache PIG - hadoop

I have a JSON file and want to read using Apache Pig.
I tried using the regular JSONLOADER, but looks like JSONLOADER works only with single line JSON. Then I tried with Elephant-Bird. But I am still not able to see the results correctly. Can any one please suggest a solution?
Input :
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
Note : I dont want to convert the input in to a single line.
Script:
A = LOAD 'input' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'employees');
Dump B;
Expected result should be :
([firstName#John,lastName#Doe])
([firstName#Anna,lastName#Smith])
([firstName#Peter,lastName#Jones])

As mentioned in the comments by siva, the answer is basically that you do need to change your input to a single line.
JsonLoader or elephantbird loader will always works only with single
line . It will not work with multiline. You need to convert your input
to single line before passing to pig. One workaround would be write a
shell script and call the logic to replace multiline to single line
using 'SED' command and then call the pig script in the shell script.
This link will help you how to call pig thru shell script.

Related

How to obtain variable values from a text file for use by a bash script?

I have a bash script, and I'd like it to read in the values of variables from a text file.
I'm thinking that in the text file where the values are stored, I'd use a different line for each variable / value pair and use the equals sign.
VARIABLE1NAME=VARIABLE1VALUE
VARIABLE2NAME=VARIABLE2VALUE
I'd then like my bash script to assign the value VARIABLE1VALUE to the variable VARIABLE1NAME, and the same for VARIABLE2VALUE / VARIABLE2NAME.
Since the syntax of the file is the syntax you would use in the script itself, the source command should do the trick:
source text-file-with-assignments.txt
alternately, you can use . instead of source, but in a case like this, using the full name is more clear.
The documentation can be found in the GNU Bash Reference Manual.

whitespace character in case of parameter substitution

I want to pass a filter statement with in my pig script using parameter substitution
For that I have tried
exec -param flt='a1==1 AND a2=2' filterscript.pig
But sadly it is throwing an exception message
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'AND' does not exist.
Pig version - 0.9.2
I have tried flt='\'a1==1 AND a2=2\'' and flt="a1==1 AND a2==2" suggested by pig users in apache forum as well as seen a similar post in SO.
Any help will be appreciated
I think you are using the parameter passed as it is as a condition. If so you will get an error like this. Instead you can pass them as separate paarmeters and form the condition string inside the pig script.
exec -p p1=1 -p p2=2 filterscript.pig
Inside your filterscript.pig script you can use these parameter values in condition clauses. For example
a1==$p1 AND a2=$p2
If you run your script outside the grunt shell you can do the followings:
pig -param flt="a1\=\=1 AND a2\=\=2" -f filterscript.pig
where filterscript.pig is something like this:
A = load ...
...
B = filter A by $flt;
...
Note that the '=' is also escaped, otherwise the filter condition won't be evalued to boolean.
If you want to use the filter substitution within the grunt shell as you tried with exec,
then you'll encounter the whitespace problem. Since escaping the whitespace character doesn't work, as a workaround you can create a parameter file :
cat params.txt
flt="a1\=\=1 AND a2\=\=2"
Then issue:
exec -param_file params.txt filterscript.pig
Note: I use Pig 0.12

how to use mongoexport to get a particular format in .csv file?

I am new to mongoDb and exciting about using it at my workplace. However, I have come across a situation where one of our client has sent the data in .bson file. I have got everything working on machine. I want to use mongoexport facility to export my data in csv format. When I am using the following query
./mongoexport --db <dbname> -collection <collectionname> --csv -fields _id,field1,field2
I am getting the result in following format
ObjectID(4f6b42eb5e724242f60002ce),"[ { ""$oid"" : ""4f6b31295e72422cc5000001"" } ]",369008
However, I just want the value of the fields as a comma separated output like below: 4f6b42eb5e724242f60002ce,4f6b31295e72422cc5000001,369008
My question is, is there anything that I can do something in mongoexport to ignore certain characters?
any pointer will be helpful.
No, mongoexport has no features like this. You'll need to use tools like sed and awk to post-process the file, or read the file and munge it in a scripting language like Python.
You should be able to add the following to your list of arguments:
--csv
You may also want to supply a path:
-o something.csv
...Though I don't think you could do this in 2012 when you first posted your question :-)

Export data from mathematica script

I am writing a mathematica script and running it in the linux batch shell. The script gives as a result a list of numbers. I would like to write this list to a file as a one single column without the braces and commas. For this, I tried to use Export comand as
Export["file.txt", A1, "Table"]
but I get the error:
Export::infer: Cannot infer format of file test1.txt
I tried with other format but i got the same error.
Could someone please tell what is wrong and what i can do? Thank beforehand
From what I understand you are trying to export the file in TABLE, why don't you try something like this ,
Export["file.txt", A1, "Text"]
This:
A1 = {1,2,3};
Export["test.tab", Transpose[{A1}], "Table"];
produces a single column without braces and commas.

Hadoop Pig: Passing Command Line Arguments

Is there a way to do this? eg, pass the name of the file to be processed, etc?
This showed up in another question, but you can indicate the input parameter on the command line and use that when you are loading, for example:
Command Line:
pig -f script.pig -param input=somefile.txt
script.pig:
raw = LOAD '$input' AS (...);
Note that if you are using the Amazon Web Services Elastic Map Reduce then the '$input' is what is passed to the script for any input you provide.
You can use ...
1. if there are few parameters then use -param (-p)
2. if there are lot of parameters then use -param_file (-m)
You can use either approach depending on the nature of value of your command line arguments, I use -param when i am developing and testing my scripts. Once pig script is ready for batch processing or running thru crontab, I use -param_file so that if any change required, I can easily update the params.init file.
man pig will show you all available options.
-m, -param_file path to the parameter file
-p, -param key value pair of the form param=val
Here is sample code ...
students.txt (input data)
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
params.init (file to hold all parameters)
fileName='hdfs://horton/user/jgosalia/students.txt'
cityName='Chennai'
filter.pig
students = LOAD '$fileName' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
students = FILTER students BY city == '$cityName';
DUMP students;
OPT 1: Using params on command line (-param or -p) & Output
pig -param fileName='hdfs://horton/user/jgosalia/students.txt' -param cityName='Chennai' filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
OPT 2: Using params file on command line (-param_file or -m) & Output
pig -param_file params.init filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
NOTE: use absolute path for file paths (both as parameters and when giving param file path to -param_file (-m)).
It is simple to pass in parameters to a PIG script.
First mark your variables in pig using '$' for example $input_file. Then pass the parameters to you script using pig -params input_file='/path/to/data'
for more information look here: http://wiki.apache.org/pig/ParameterSubstitution
Yes.
You can pass parameters along commandline options using pig's param option.
--customparam.pig
--load hdfs/local fs data
original = load '$input' using PigStorage('$delimiter');
--filter a specific field value into another bag
filtered = foreach original generate $split;
--storing data into hdfs/local fs
store filtered into '$output';
pig -x local -f customparam.pig -param input=Pig.csv -param
output=OUT/pig -param delimiter="," -param split='$1'
For more info: check this

Resources