PIG - Passing multiple words as a parameter

PIG - Passing multiple words as a parameter - hadoop

In my PIG script I have the following:
REL = FILTER OLD_REL BY COL == '$filter';
If I pass $filter as a multi-word string word1 word2, PIG only filters against word1. It is as if word2 is chopped off.
This happens when I do it from the command line or call it from oozie.
I'm using PIG 0.11.0-cdh4.3.0

Add extra single quotes to the string:
-p filter="'word1 word2'"
If you ever run into this type of problem again, it is useful to use the -dryrun option, which produces a script (text file) with substituted parameters, without executing the script.

Related

Can I pass a json object as value for a cli flag in go?

I am using urfave/cli for my go program and I would like to have a cli flag that reads a json value like this one:
{"name":"foo","surname":"var"}
I am currently reading that variable as a cli.StringFlag which returns a string. Then, I was planning to json.Unmarshall it but it does not work. The problem is that the returned string by the cli library is like this:
[{name foo} {surname var}]
which is not a json anymore.
Is there a way to achieve this? Note that if it returned a simple map, that would work too

for Linux, try to pass the paramaters with shell escape
#!/bin/bash
echo "{\"name\":\"foo\",\"surname\":\"var\"}"
in go program, just marshal this string parameter

The issue is that the shell (bash, ksh, csh, zsh, ...) interprets
{"name":"foo","surname":"var"}
as a sequence of bareword and quoted word tokens:
Token Type
Value
bareword
{
quoted word
name
bareword
:
quoted word
foo
bareword
,
quoted word
surname
bareword
:
quoted word
var
bare word
}
As it happens, a comma (,) is a shell operator, used for arithmetic, and that essentially gets discarded (at least in zsh, what I use).
The whole is then spliced together to get
name:foo surname:var
You can see this in action by opening your shell and executing the command
echo {"name":"foo","surname":"var"}
If, however, you quote your JSON document with single quotes ('):
echo '{"name":"foo","surname":"var"}'
You'll get what you might expect:
{"name":"foo","surname":"var"}
Note, however, that this will fail if the text in your JSON document contains a literal apostrophe/single quote (', U+0027), so you'd want to replace all such occurrences within the JSON document with \, to escape them.

Big Query job in shell script

I'm trying to automate a Big Query job in shell script but I'm getting errors while trying to do this. I'm reading a local CSV file with two columns, reading line by line and updating the values, with the following script:
#!/bin/bash
IFS=","
while read f1 f2
do
echo "De $f1 para $f2"
bq query --use_legacy_sql=false "UPDATE agendas_usuarios.tb_usuarios SET cargo='${f2}' WHERE cargo='${f1}'"
done < cargos_ps.csv
But I'm getting a syntax error: Unclosed
string literal at [1:47].
I've seen something that Shell Script doesn't allow for single quotes inside double quotes, is that true? If so, what's the best way to do this job in shell? I really need to develop in another programming language?
My CSV reading is right, my echo before the bq query is showing correctly.

I'm not sure what the actual problem is (perhaps it's necessary to escape the quotes) but using query parameters will mean that you don't need to inject strings into the query directly and can hopefully avoid the issue you're seeing. You'd want something like this:
bq query --use_legacy_sql=false \
--parameter="cargo:STRING:${f2}" \
--parameter="target:STRING:${f1}" \
"UPDATE agendas_usuarios.tb_usuarios SET cargo=#cargo WHERE cargo=#target"

Multi-line JSON read using Apache PIG

I have a JSON file and want to read using Apache Pig.
I tried using the regular JSONLOADER, but looks like JSONLOADER works only with single line JSON. Then I tried with Elephant-Bird. But I am still not able to see the results correctly. Can any one please suggest a solution?
Input :
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
Note : I dont want to convert the input in to a single line.
Script:
A = LOAD 'input' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'employees');
Dump B;
Expected result should be :
([firstName#John,lastName#Doe])
([firstName#Anna,lastName#Smith])
([firstName#Peter,lastName#Jones])

As mentioned in the comments by siva, the answer is basically that you do need to change your input to a single line.
JsonLoader or elephantbird loader will always works only with single
line . It will not work with multiline. You need to convert your input
to single line before passing to pig. One workaround would be write a
shell script and call the logic to replace multiline to single line
using 'SED' command and then call the pig script in the shell script.
This link will help you how to call pig thru shell script.

whitespace character in case of parameter substitution

I want to pass a filter statement with in my pig script using parameter substitution
For that I have tried
exec -param flt='a1==1 AND a2=2' filterscript.pig
But sadly it is throwing an exception message
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'AND' does not exist.
Pig version - 0.9.2
I have tried flt='\'a1==1 AND a2=2\'' and flt="a1==1 AND a2==2" suggested by pig users in apache forum as well as seen a similar post in SO.
Any help will be appreciated

I think you are using the parameter passed as it is as a condition. If so you will get an error like this. Instead you can pass them as separate paarmeters and form the condition string inside the pig script.
exec -p p1=1 -p p2=2 filterscript.pig
Inside your filterscript.pig script you can use these parameter values in condition clauses. For example
a1==$p1 AND a2=$p2

If you run your script outside the grunt shell you can do the followings:
pig -param flt="a1\=\=1 AND a2\=\=2" -f filterscript.pig
where filterscript.pig is something like this:
A = load ...
...
B = filter A by $flt;
...
Note that the '=' is also escaped, otherwise the filter condition won't be evalued to boolean.
If you want to use the filter substitution within the grunt shell as you tried with exec,
then you'll encounter the whitespace problem. Since escaping the whitespace character doesn't work, as a workaround you can create a parameter file :
cat params.txt
flt="a1\=\=1 AND a2\=\=2"
Then issue:
exec -param_file params.txt filterscript.pig
Note: I use Pig 0.12

Hadoop Pig: Passing Command Line Arguments

Is there a way to do this? eg, pass the name of the file to be processed, etc?

This showed up in another question, but you can indicate the input parameter on the command line and use that when you are loading, for example:
Command Line:
pig -f script.pig -param input=somefile.txt
script.pig:
raw = LOAD '$input' AS (...);
Note that if you are using the Amazon Web Services Elastic Map Reduce then the '$input' is what is passed to the script for any input you provide.

You can use ...
1. if there are few parameters then use -param (-p)
2. if there are lot of parameters then use -param_file (-m)
You can use either approach depending on the nature of value of your command line arguments, I use -param when i am developing and testing my scripts. Once pig script is ready for batch processing or running thru crontab, I use -param_file so that if any change required, I can easily update the params.init file.
man pig will show you all available options.
-m, -param_file path to the parameter file
-p, -param key value pair of the form param=val
Here is sample code ...
students.txt (input data)
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
params.init (file to hold all parameters)
fileName='hdfs://horton/user/jgosalia/students.txt'
cityName='Chennai'
filter.pig
students = LOAD '$fileName' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
students = FILTER students BY city == '$cityName';
DUMP students;
OPT 1: Using params on command line (-param or -p) & Output
pig -param fileName='hdfs://horton/user/jgosalia/students.txt' -param cityName='Chennai' filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
OPT 2: Using params file on command line (-param_file or -m) & Output
pig -param_file params.init filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
NOTE: use absolute path for file paths (both as parameters and when giving param file path to -param_file (-m)).

It is simple to pass in parameters to a PIG script.
First mark your variables in pig using '$' for example $input_file. Then pass the parameters to you script using pig -params input_file='/path/to/data'
for more information look here: http://wiki.apache.org/pig/ParameterSubstitution

Yes.
You can pass parameters along commandline options using pig's param option.
--customparam.pig
--load hdfs/local fs data
original = load '$input' using PigStorage('$delimiter');
--filter a specific field value into another bag
filtered = foreach original generate $split;
--storing data into hdfs/local fs
store filtered into '$output';
pig -x local -f customparam.pig -param input=Pig.csv -param
output=OUT/pig -param delimiter="," -param split='$1'
For more info: check this

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PIG - Passing multiple words as a parameter - hadoop

Add extra single quotes to the string: -p filter="'word1 word2'" If you ever run into this type of problem again, it is useful to use the -dryrun option, which produces a script (text file) with substituted parameters, without executing the script.

Related

Can I pass a json object as value for a cli flag in go?

Big Query job in shell script

Multi-line JSON read using Apache PIG

whitespace character in case of parameter substitution

Hadoop Pig: Passing Command Line Arguments

Categories

Resources