How can I specify Hadoop XML configuration variables via the Hadoop shell scripts? - hadoop

I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At checkin time: I can modify the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At run time: I generate the temporary directory that I want to use for my data storage.
I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR and HADOOP_PID_DIR, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir in core-default.xml and dfs.datanode.data.dir in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir?

We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.
export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"
This method can be used to configure other configurations also, like namenode url.

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Is pig.temp.dir property mandatory?

Pig Execution Mode = Local
In that case do we need to set pig.temp.dir=/temp property and this /temp folder needs to be present inside HDFS.
Note:
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property's default value is "/tmp" which is the same as the hardcoded location in Pig 0.7.0 and earlier versions.
As per: http://pig.apache.org/docs/r0.14.0/start.html#req Storing Intermediate Results heading
You'll still need to have some temp directory, but it needs to be present in your local file system. In local mode Pig (and MapReduce) does all operations on local filesystem by default.

Hadoop copying file to hadoop filesystem

I have copied a file from a local to the hdfs file system and the file got copied -- /user/hduser/in
hduser#vagrant:/usr/local/hadoop/hadoop-1.2.1$ bin/hadoop fs -copyFromLocal /home/hduser/afile in
Question:-
1.How does hadoop by default copies the file to this directory -- /user/hduser/in ...Where is this mapping specified in the conf file?
If you write the command like above, the file gets copied to your user's HDFS home directory, which is /home/username. See also here: HDFS Home Directory.
You can use an absolute pathname (one starting with "/") just like in a Linux filesystem, if you want to write the file to a different location.
Are u using a default vm? Basically if you configure hadoop from binaries without using the preconfigure yum package. It doesnt have a default path. But if you use yum via hortin or cloudera vm. It comes with default path i guess
Check the hdfs-site.xml to see the default fs path. So "/" will point to the base URL set in the above mentioned XML. Any folder mentioned in the command without the use of home path will be appended to that.
hadoop picks the default path defined in hdfs-site.xml and write data.
below image clear how writes works in HDFS.

Which configuration file client is using to connect to hadoop cluster

When the edge node had multiple hadoop distributions, there can be multiple configuration files scattered across the directories.
In those cases, how to know which configuration file the client is referring to, for it to connect to the cluster. ( say, for Yarn ). One option is to look at .bashrc file to find out if the HADOOP_HOME variable is set.
Are there are any other options to find this out . ( obviously, using the find command to search for a file will not solve the purpose ).
Hadoop provides classpath command. Read the description of the command below:
classpath prints the class path needed to get the
Hadoop jar and the required libraries
You can execute this command as:
hadoop classpath
or
yarn classpath
Both these commands, should give you almost identical results.
For e.g. I got following output, for hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\conf\;
e:\hdp\tez-0.7.0.2.3.0.0-2557\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\lib\*;
All these paths contain HADOOP_HOME as the parent path. In my case, it is: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557". From this path, you can easily figure out, which distribution of Hadoop, is your client referring to.
In my case, my client is using the Hadoop configurations and jars from: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557" directory.
You can run env command to get HADOOP_HOME for the session. Even if you overwrite HADOOP_HOME, env will give current value of the session.

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Resources