How do I log Pig Latin grunt shell commands that I write? - hadoop

I am new to Pig and Pig Latin. I'd like to log the commands that I write in the interactive grunt shell, so that I can piece together working Pig Latin scripts. Is this possible? Is there a file that stores the history of commands I've written, similar to my ".bash_history" file?
I'd like access to the ".grunt_history", if such a thing exists, or some way to turn on logging to a file.

The Pig history file is in ~/.pig_history. So, if your user home is /home/joe, the path is /home/joe/.pig_history.
However, you need to take care of locating the user home directory. You can get the user home directory from /etc/passwd. Some user home directory may be not standard. For example, we use the CDH4 and use sudo -u hdfs pig to start Grunt. In the situation, the history file is /var/lib/hadoop-hdfs/.pig_history. Here is the source code.

You're close - look for ~/.pig_history

Related

modify the Source code of hadoop command to add text during command execution

I'd like to see the source code for certain hadoop commands like -put and -ls. I want to be able to add additional information to the log outputs that are associated with running these commands. For example, i want to show the message "Hii user, your file is copying from local file system to hdfs" during the execution of -get or copyFromLocal command.
I want to change in core files not in api files like copyCommand.java(http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/CopyCommands.java?view=markup)
this type of message should print on execution of command.
Can anyone tell which file I should change.
How can I find these?

Run shell script with Oozie: how to get the script path in HDFS inside the script

I am running a Shell script with Oozie. First I uploaded the script to HDFS, then script should forward its logs to a log file in the same directory where this script is stored in HDFS, meaning the generated log file should be in the HDFS.
Anyone knows how to achieve this goal?
...a log file in the same directory where this script is stored in HDFS...
The Oozie Shell action contains a <file> element with the HDFS path of the script. But the way it works is not what you seem to think:
Oozie aks YARN to allocate a container somewhere
Oozie asks YARN to download some files to the container's private local filesystem (in CWD), and especially the <file> stuff
finally, Oozie asks YARN to run the
local version of the script
Bottom line: the script that is executed has no way to know its original HDFS directory. The Action must pass that directory explicitly as a script argument, or as an env variable.
Assuming that you use an env variable, the obvious solution to run the archive part is something like
hdfs dfs -appendToFile ./MySession.log $LOG_ARCHIVE_DIR/archive.log

Running shell script from oozie through Hue

I am invoking a bash shell script using oozie editor in Hue.
I used the shell action in the workflow and tried below different options in shell command:
Uploaded the shell script using 'choose a file'
Gave local directory path where shell script is present
Gave HDFS path where shell script is present
But all these options gave following error:
Cannot run program "sec_test_oozie.sh" (in directory "/data/hadoop/yarn/local/usercache/user/appcache/application_1399542362142_0086/container_1399542362142_0086_01_000002"): java.io.IOException: error=2, No such file or directory
How should I give the shell script execution command?
Where the shell script file should be residing?
You need add file "sec_test_oozie.sh" in oozie shell step. In add files
I think you are creating the file from windows machine which is adding extra line break characters.You need to convert the shell script file to Unix format.I also faced the same issue.Then I created the file from a Linux system and it started working.The error is misguiding.
I want to extend the #SergioRG answer. Oozie, at least with Cloudera's Hue interface is very counterintuitive.
To run a script file, three conditions should be met:
the file is on the HDFS file system, in a folder accessible by Oozie
the file should be indicated in the shell command field
the file should be added with any other dependent file in the "Files+" part of the task card.
I wonder why they didn't add by default the script file you are calling.
Edit: please also check in advanced options (the gear in the left upper corner) if you need to set the path variable (eg. PATH=/usr/local/bin:/usr/bin).
Did you edit sec_test_oozie.sh with the Hue File Browser? Depending on your Hue version it might have corrupted it: hue-list
I encountered the same problem, and the problem was that the script echoed some irrelevant line while the workflow tried to parse it as a property line. Oozie gave a very irrelevant error message of java.io.IOException: error=2, No such file or directory which only added confusion.
You will need to use <file> to add your script.
If you used <capture-output/> then you must make sure that your script prints only "key=value" lines, like java properties, otherwise you will get the error you see java.io.IOException: error=2, No such file or directory with some path pointing to .../yarn/local/usercache/...
We had this issue on a test script, basically if you use an editor that adds wierd characters or line ending to the file, it'll throw this error because the script cannot be used in the container.
Try using nano file.sh to see if any strange characters appear. Then push it back to hdfs with hdfs dfs -put file.sh /path/you/need
Removing the #!/bin/bash from my shell script helped me
"No such a file or directory" oozie cannot locate the file. Please check the AddPath setting in the command.
In the edit node seciton, get the oozie application hdfs path.
Upload the shell script in hdfs oozie application path.
In the oozie edit node step, Shell command - specify the shell script name which is uploaded.
Below that there would be option to AddPath, then add files, add the shell script which was uploaded in the hdfs path.

Can I set pig.temp.dir to /user/USERNAME/tmp/pig?

Hive can be configured with
hive.exec.scratchdir=/user/${user.name}/tmp/hive
Can I do something similar with Pig? I have tried modifying the pig.properties file, but nothing seems to work.
pig.temp.dir=/user/${user.name}/tmp/pig <- Doesn't work
pig.temp.dir=/user/`whoami`/tmp/pig <- Doesn't work
pig.temp.dir=/user/${user}/tmp/pig <- Doesn't work
pig.temp.dir=/user/${username}/tmp/pig <- Doesn't work
I could replace the pig command with an alias, but I am hoping to have the change enshrined in the configuration file.
pig -Dpig.temp.dir=/user/`whoami`/tmp/pig
Thanks!
UPDATE: We decided to use /tmp/ for the production system. The reason this was an issue at all is because we are running MapR which seems to try to put the temp directories into the user directory, and succeeds with Hive, but not with Pig.
You can also set the pig temp dir from within a Pig script as follows:
set pig.temp.dir /user/foo/tmp/pig;
For small outputs, I think using the /tmp directory is fine, but for large outputs, I'd recommend users write to their personal directories.
Not a configuration file solution, but you can bake this into the $PIG_HOME/bin/pig script:
PIG_OPTS="$PIG_OPTS -Dpig.temp.dir=/user/`whoami`/tmp/pig"

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Resources