Pig parameter substitution - hadoop

I have couple of questions around parameter substitution in Pig.
I am on Pig 0.10
Can i access unix environemnt variables in grunt shell ? In Hive we can do this via ${env:variable}
I have bunch of Pig scripts that are automated and running in batch mode. I have used bunch of parameters inside it and I substitute them from command line (either -param or -param_file). When i need to enhance (or debug) the pig script in grunt mode, i am left with manually replacing the parameters with the value. Is there a better way of handling this situations.
Thanks for the help !

For the first question, Pig does not support to use the environment. Is there any special requirement? You should be able to pass the environment by the Pig command line parameters.
For the second question, now Pig does not support to use parameters in Grunt. You can check the issue and discussion in PIG-2122. Aniket Mokashi suggests to use the following way:
Store your script line in a file (with $params included).
Start grunt interactively
type run -param a=b -param c=d myscript.pig

Related

Replace sqoop process output

Wondering if there is any way to somehow hide sqoop process output in Unix shell?
For example instead of that output put some text like "sqoop processing"
Thanks
The way I deal with this for pig scripts (which also tend to give a lot of output, and run for a long time) is as follows:
Rather than running
pig mypath/myscript.pig
I will run
nohup pig mypath/myscript.pig &
In your case that would mean something like
nohup oozie -job something &
This has the additional benefit that it will not stop your query if your SSH connection times out. If you do not use SSH at the moment, this may be an additional required step.

How to enter two commands in pig gruntshell without typing enter key?

A = LOAD '/pig/student.tsv' as (rollno:int, name:chararray, gpa:float);
DUMP A;
If I want to execute the first line, I have to type Enter key after the first line.
How can I make it as a single execution?
You can create a pig script file to make it as single execution.
test.pig
A = LOAD '/pig/student.tsv' as (rollno:int, name:chararray, gpa:float);
DUMP A;
Now run the pig script using below command from pig/bin,
pig -f /path/test.pig
You need to create a pig script(say, myscript.pig) containing those 2 lines. Then, run this script using the command pig myscript.pig
Short-answer, use a script as suggested by Kumar.
Long answer, if you create a single line script containing multiple statements, then it will be not be long before it becomes a nightmare to read and understand as your script grows. Having said that, if you use a script, it won't matter whether you use one line or multiple lines.
So, my suggestion will be to use a well-indented script for learning/development/what-have-you.

Is there any way to run impala shell with sql script with parameters?

Is there any way to run impala shell with SQL script with parameters?
For example:
impala-shell -f /home/john/sql/load.sql /dir1/dir2/dir3/data_file
I got errors:
Error, could not parse arguments "-f /home/john/sql/load.sql /dir1/dir2/dir3/data_file”
This feature is available in CDH 5.7 / Impala 2.5 and higher.
The --var option lets you pass substitution variables to the statements that are executed by that impala-shell session, for example the statements in a script file processed by the -f option. You encode the substitution variable on the command line using the notation --var=variable_name=value. Within a SQL statement, you substitute the value by using the notation ${var:variable_name}.
See more details directly in the documentation : https://www.cloudera.com/documentation/enterprise/latest/topics/impala_set.html
No, you can specify a file of sql statements with -f, but it does not take a file of parameters. See the impala-shell documentation for more details:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/impala_impala_shell.html

pigrc feature available?

Is there a way to "automatically" set certain variables when i invoke the pig grunt intractive shell. I understand that we could use the define/default command to but then it is manual. Usecase could be the setting various variables that point to different HDFS path. I also understand that such an option can be used when calling the pig file using
pig -param_file -f somefile.pig
. But even if i use the -param_file during invoking the pig shell it does not work (pig -param_file ).
What i am looking for is kind of ".hiverc" file feature, do we have one ?
As per this JIRA you already have it. But you need to be on pig-0.11.0(or later) if you want to have this working.

Pig in grunt mode

I have installed cygwin, hadoop and pig in windows. The configuration seems ok, as I can run pig scripts in batch and embedded mode.
When I try to run pig in grunt mode, something strange happens. Let me explain.
I try to run a simple command like
grunt> A = load 'passwd' using PigStorage(':');
When I press Enter, nothing happens. The cursor goes to the next line and the grunt> prompt does not appear at all anymore. It seems as I am typing in a text editor.
Has anything similar ever happened to you? Do you have any idea how can I solve this?
The behavior is consistent with what you are observing. I will take the pig tutorial for example.
The following command does not result in any activity by pig.
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
But if you invoke a command that results in using data from variable raw using some map-reduce thats when you will see some action in your grunt shell. Some thing along the lines of second command that is mentioned there.
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
Similarly, your command will not result in any action, you have to use the data from variable A which results in map-reduce command to see some action on grunt shell:
grunt> A = load 'passwd' using PigStorage(':');
Pig will only process the commands when you use a command that creates output namely DUMP (to console) or STORE you can also use command DESCRIBE to get the structure of an alias and EXPLAIN to see the map/reduce plan
so basically DUMP A; will give you all the records in A
Please try to run in the windows command window.
C:\FAST\JDK64\1.6.0.31/bin/java -Xmx1000m -Dpig.log.dir=C:/cygwin/home/$USERNAME$/nubes/pig/logs -Dpig.log.file=pig.log -Dpig.home.dir=C:/cygwin/home/$USERNAME$/nubes/pig/ -classpath C:/cygwin/home/$USERNAME$/nubes/pig/conf;C;C:/FAST/JDK64/1.6.0.31/lib/tools.jar;C:/cygwin/home/$USERNAME$/nubes/pig/lib/jython-standalone-2.5.3.jar;C:/cygwin/home/$USERNAME$/nubes/pig/conf;C:/cygwin/home/$USERNAME$/nubes/hadoop/conf;C:/cygwin/home/$USERNAME$/nubes/pig/pig-0.11.1.jar org.apache.pig.Main -x local
Replace $USERNAME$ with your user id accordingly ..
Modify the class path and conf path accordingly ..
It works well in both local as well as map reduce mode ..
Pig shell hangs up in cygwin. But pig script successfully executed from pig script file.
As below:
$pig ./user/input.txt
For local mode:
pig -x local ./user/input.txt
I came across the same problem as you yesterday,and I spent one whole day to find what was wrong with my pig or my hotkey and fix it finally. I found that it's only because I copied the pig code from other resource,then the bending quotation marks cannot be identified in pig command line, which only admits straight quotation marks, so the input stream would not end.
My suggestion is that you should take care of the valid characters in the code, especially when you just copy codes into the command line, which always causes unexpected faults.

Resources