Which configuration file client is using to connect to hadoop cluster - hadoop

When the edge node had multiple hadoop distributions, there can be multiple configuration files scattered across the directories.
In those cases, how to know which configuration file the client is referring to, for it to connect to the cluster. ( say, for Yarn ). One option is to look at .bashrc file to find out if the HADOOP_HOME variable is set.
Are there are any other options to find this out . ( obviously, using the find command to search for a file will not solve the purpose ).

Hadoop provides classpath command. Read the description of the command below:
classpath prints the class path needed to get the
Hadoop jar and the required libraries
You can execute this command as:
hadoop classpath
or
yarn classpath
Both these commands, should give you almost identical results.
For e.g. I got following output, for hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\conf\;
e:\hdp\tez-0.7.0.2.3.0.0-2557\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\lib\*;
All these paths contain HADOOP_HOME as the parent path. In my case, it is: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557". From this path, you can easily figure out, which distribution of Hadoop, is your client referring to.
In my case, my client is using the Hadoop configurations and jars from: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557" directory.

You can run env command to get HADOOP_HOME for the session. Even if you overwrite HADOOP_HOME, env will give current value of the session.

Related

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

Why Hive will search its configuration profile in HADOOP_CONF_DIR first?

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

How can I specify Hadoop XML configuration variables via the Hadoop shell scripts?

I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At checkin time: I can modify the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At run time: I generate the temporary directory that I want to use for my data storage.
I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR and HADOOP_PID_DIR, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir in core-default.xml and dfs.datanode.data.dir in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir?
We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.
export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"
This method can be used to configure other configurations also, like namenode url.

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Resources