whitespace character in case of parameter substitution - hadoop

I want to pass a filter statement with in my pig script using parameter substitution
For that I have tried
exec -param flt='a1==1 AND a2=2' filterscript.pig
But sadly it is throwing an exception message
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'AND' does not exist.
Pig version - 0.9.2
I have tried flt='\'a1==1 AND a2=2\'' and flt="a1==1 AND a2==2" suggested by pig users in apache forum as well as seen a similar post in SO.
Any help will be appreciated

I think you are using the parameter passed as it is as a condition. If so you will get an error like this. Instead you can pass them as separate paarmeters and form the condition string inside the pig script.
exec -p p1=1 -p p2=2 filterscript.pig
Inside your filterscript.pig script you can use these parameter values in condition clauses. For example
a1==$p1 AND a2=$p2

If you run your script outside the grunt shell you can do the followings:
pig -param flt="a1\=\=1 AND a2\=\=2" -f filterscript.pig
where filterscript.pig is something like this:
A = load ...
...
B = filter A by $flt;
...
Note that the '=' is also escaped, otherwise the filter condition won't be evalued to boolean.
If you want to use the filter substitution within the grunt shell as you tried with exec,
then you'll encounter the whitespace problem. Since escaping the whitespace character doesn't work, as a workaround you can create a parameter file :
cat params.txt
flt="a1\=\=1 AND a2\=\=2"
Then issue:
exec -param_file params.txt filterscript.pig
Note: I use Pig 0.12

Related

snakemake: Problem with --rerun-triggers flag and bash variable

I have a problem when I try to provide the --rerun-triggers flag as a bash variable.
My command is
snakemake $snakemake_extra -pr --snakefile Snakefile --configfile config.yaml -c 20 -n
and snakemake_extra is a bash variable defined as
snakemake_extra="--rerun-triggers {mtime,input,params}"
I get the following error:
snakemake: error: argument --rerun-triggers: invalid choice: '{mtime,input,params}' (choose from 'mtime', 'params', 'input', 'software-env', 'code')
The problem seems to be that snakemake(?) adds single-quotes before and after the {}.
When I insert the --rerun-triggers flag directly (without bash variable) it works fine. I need the bash variable however and can also not use a snakemake profile yaml.
Is there any possible workaround?
I am using snakemake version 7.12.1.
Thanx,
Carlus
Is there any possible workaround?
https://mywiki.wooledge.org/BashFAQ/050
Use array.
snakemake_extra=( --rerun-triggers {mtime,input,params} )
... "${snakemake_extra[#]}" ...

Big Query job in shell script

I'm trying to automate a Big Query job in shell script but I'm getting errors while trying to do this. I'm reading a local CSV file with two columns, reading line by line and updating the values, with the following script:
#!/bin/bash
IFS=","
while read f1 f2
do
echo "De $f1 para $f2"
bq query --use_legacy_sql=false "UPDATE agendas_usuarios.tb_usuarios SET cargo='${f2}' WHERE cargo='${f1}'"
done < cargos_ps.csv
But I'm getting a syntax error: Unclosed
string literal at [1:47].
I've seen something that Shell Script doesn't allow for single quotes inside double quotes, is that true? If so, what's the best way to do this job in shell? I really need to develop in another programming language?
My CSV reading is right, my echo before the bq query is showing correctly.
I'm not sure what the actual problem is (perhaps it's necessary to escape the quotes) but using query parameters will mean that you don't need to inject strings into the query directly and can hopefully avoid the issue you're seeing. You'd want something like this:
bq query --use_legacy_sql=false \
--parameter="cargo:STRING:${f2}" \
--parameter="target:STRING:${f1}" \
"UPDATE agendas_usuarios.tb_usuarios SET cargo=#cargo WHERE cargo=#target"

Hadoop Hcatalog -How to pass key value pair

I have a create table script where the table name will be decided at runtime. How do I pass the value to sql script?
I'm trying something like this
hcat -e "create table ${D:TAB_NAME} (name string)" -DTAB_NAME=person
But I keep getting errors.
Can I get the correct syntax?
Try this:
hcat -e 'create table ${hiveconf:TAB_NAME} (name string);' -DTAB_NAME=person2
Here are two things to note:
In shell, default variable expansion is $ so your ${D:TAB_NAME} is getting expanded to nothing before even getting passed to hcat parser. So, either escape the $ or use strong quoting using: ''.
Use hiveconf instead of D for variable substitution as hcat under the hoods is still using hive to parse commands.

PIG - Passing multiple words as a parameter

In my PIG script I have the following:
REL = FILTER OLD_REL BY COL == '$filter';
If I pass $filter as a multi-word string word1 word2, PIG only filters against word1. It is as if word2 is chopped off.
This happens when I do it from the command line or call it from oozie.
I'm using PIG 0.11.0-cdh4.3.0
Add extra single quotes to the string:
-p filter="'word1 word2'"
If you ever run into this type of problem again, it is useful to use the -dryrun option, which produces a script (text file) with substituted parameters, without executing the script.

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed above, one of the directory names is a date stamp.
Problem: I want to load files from a date range say from 20100810 to 20100813.
I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);
The following works with hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?
Error log follows:
Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
cheers
As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):
shell:
pig -f script.pig -param input=/user/training/test/{20100810..20100812}
script.pig:
temp = LOAD '$input' USING SomeLoader() AS (...);
Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).
i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.
none of the current answers applied to my situation, but i did find a general answer that might be helpful here.
in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.
so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.
WON'T WORK:
$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6
WILL WORK
$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6
i hope this saves someone some pain and agony.
So since this works:
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()
but this does not work:
temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()
but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.
Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is
temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()
then do a filter after that
filtered = FILTER temp BY (the_date>='2012-10-08')
temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data
if the variable is in the middle of file path, concate subfolder name or use '*' for all files.
I found this problem is caused by linux shell. Linux shell will help you expand
{20100810..20100812}
to
20100810 20100811 20100812,
then you actually run command
bin/hadoop fs -ls 20100810 20100811 20100812
But in the hdfs api, it won't help you to expand the expression.
Thanks to dave campbell.
Some of the answer beyond are wrong since they got some votes.
Following is my test result:
Works
pig -f test.pig -param input="/test_{20120713,20120714}.txt"
Cannot have space before or after "," in the expression
pig -f test.pig -param input="/test_201207*.txt"
pig -f test.pig -param input="/test_2012071?.txt"
pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
Cannot have space before or after "," in the expression
Doesn't Work
pig -f test.pig -param input="/test_{20120713..20120714}.txt"
pig -f test.pig -param input=/test_{20120713,20120714}.txt
pig -f test.pig -param input=/test_{20120713..20120714}.txt
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).
additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)
edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine
To Romain's answer, if you want to just parameterize the date, the shell will run like this:
pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig
pig:
temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);
Please note the quotes.
Pig support globe status of hdfs,
so I think pig can handle the pattern
/user/training/test/{20100810,20100811,20100812},
could you paste the error logs ?
Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.
For example:
DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
d=$(date +%Y%m%d -d "${DT} +$i days");
DT_LIST=${DT_LIST}$d','
done
size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}
pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

Resources