Can I set pig.temp.dir to /user/USERNAME/tmp/pig? - hadoop

Hive can be configured with
hive.exec.scratchdir=/user/${user.name}/tmp/hive
Can I do something similar with Pig? I have tried modifying the pig.properties file, but nothing seems to work.
pig.temp.dir=/user/${user.name}/tmp/pig <- Doesn't work
pig.temp.dir=/user/`whoami`/tmp/pig <- Doesn't work
pig.temp.dir=/user/${user}/tmp/pig <- Doesn't work
pig.temp.dir=/user/${username}/tmp/pig <- Doesn't work
I could replace the pig command with an alias, but I am hoping to have the change enshrined in the configuration file.
pig -Dpig.temp.dir=/user/`whoami`/tmp/pig
Thanks!
UPDATE: We decided to use /tmp/ for the production system. The reason this was an issue at all is because we are running MapR which seems to try to put the temp directories into the user directory, and succeeds with Hive, but not with Pig.

You can also set the pig temp dir from within a Pig script as follows:
set pig.temp.dir /user/foo/tmp/pig;
For small outputs, I think using the /tmp directory is fine, but for large outputs, I'd recommend users write to their personal directories.

Not a configuration file solution, but you can bake this into the $PIG_HOME/bin/pig script:
PIG_OPTS="$PIG_OPTS -Dpig.temp.dir=/user/`whoami`/tmp/pig"

Related

How to edit txt file inside the HDFS in terminal?

Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.

What is the difference between moveFromLocal v/s put and CopyToLocal v/s get in hadoop hdfs command

Basically what is the major difference between moveFromLocal and copyToLocal instead of using put and get command in CLI of hadoop.
moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied.
copyToLocal: Similar to get command, except that the destination is restricted to a local file reference.
Source.

Source code for hadoop commands

I'd like to see the source code for certain hadoop commands like -put and -ls. I want to be able to add additional information to the log outputs that are associated with running these commands.
How can I find these?
The source can be found in Apache's SVN.
You'll see files such as Ls.java, CopyCommands.java (defines -put, -copyFromLocal), etc., that all define the HDFS commands that you're looking for.

How do I log Pig Latin grunt shell commands that I write?

I am new to Pig and Pig Latin. I'd like to log the commands that I write in the interactive grunt shell, so that I can piece together working Pig Latin scripts. Is this possible? Is there a file that stores the history of commands I've written, similar to my ".bash_history" file?
I'd like access to the ".grunt_history", if such a thing exists, or some way to turn on logging to a file.
The Pig history file is in ~/.pig_history. So, if your user home is /home/joe, the path is /home/joe/.pig_history.
However, you need to take care of locating the user home directory. You can get the user home directory from /etc/passwd. Some user home directory may be not standard. For example, we use the CDH4 and use sudo -u hdfs pig to start Grunt. In the situation, the history file is /var/lib/hadoop-hdfs/.pig_history. Here is the source code.
You're close - look for ~/.pig_history

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Resources