Why Hive will search its configuration profile in HADOOP_CONF_DIR first? - hadoop

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.

So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

Related

Which configuration file client is using to connect to hadoop cluster

When the edge node had multiple hadoop distributions, there can be multiple configuration files scattered across the directories.
In those cases, how to know which configuration file the client is referring to, for it to connect to the cluster. ( say, for Yarn ). One option is to look at .bashrc file to find out if the HADOOP_HOME variable is set.
Are there are any other options to find this out . ( obviously, using the find command to search for a file will not solve the purpose ).
Hadoop provides classpath command. Read the description of the command below:
classpath prints the class path needed to get the
Hadoop jar and the required libraries
You can execute this command as:
hadoop classpath
or
yarn classpath
Both these commands, should give you almost identical results.
For e.g. I got following output, for hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\conf\;
e:\hdp\tez-0.7.0.2.3.0.0-2557\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\lib\*;
All these paths contain HADOOP_HOME as the parent path. In my case, it is: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557". From this path, you can easily figure out, which distribution of Hadoop, is your client referring to.
In my case, my client is using the Hadoop configurations and jars from: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557" directory.
You can run env command to get HADOOP_HOME for the session. Even if you overwrite HADOOP_HOME, env will give current value of the session.

Adding hive jars permanently

Is there any way I can add hive jars permanently instead of adding at session level in hive shell?
Any help would be appreciated
In the hiveserver2 host, create a location something like /var/lib/hive and add all the necessary jars inside that folder. Edit the hive-site.xml and mention all these jars in the property hive.aux.jars.path
Eg:
ADD JAR /home/amal/hive/amaludf.jar
ADD JAR /home/amal/hive/amaludf2.jar
Instead of using the above commands in each session, you can define it for all sessions.
Create a location for storing these jars in the hiveserver host.
mkdir /var/lib/hive
Add all these jars to that directory
Set the property in hive-site.xml
<property>
<name>hive.aux.jars.path</name>
<value>/var/lib/hive</value>
</property>
Restart the hiveserver2 after doing this modification.
Instead of creating a directory and putting all the jars, you can specify paths of individual jars also. The only condition is that all these jars should be present in the hiveserver host.
Eg:
<property>
<name>hive.aux.jars.path </name>
<value>file:///home/amal/hive/udf1.jar,file:///usr/lib/hive/lib/hive-hbase-handler.jar</value>
</property>
You will have to put the jar in the lib folder of hadoop or hive in all your nodes.
these can be done by two steps
Hive Client should be avalable in all nodes.
Hive Live location should be defined in hadoop-env.sh CLASSPATH and the same file should be updated in entired Hadoop Clueter.
{hadoop-env.sh should be update with CLASSPATH of hive and other location for user defined custom jars and common location which available in entire cluster }
You also need to restart the hive/hadoop to take effect if after changes it dnt work.
create directory named auxlib in $HIVE_HOME, put all your jars in this directory and restart the hive server. run ps -ef | grep hive this command to list hive processes, search for hive.aux.jars.path and you will see that all your jars will be listed against this hiveconf.

Best place for json Serde JAR in CDH Hadoop for use with Hive/Hue/MapReduce

I'm using Hive/Hue/MapReduce with a json Serde. To get this working I have copied the json_serde.jar to several lib directories on every cluster node:
/opt/cloudera/parcels/CDH/lib/hive/lib
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib
/opt/cloudera/parcels/CDH/lib/hadoop/lib
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib
...
On every CDH update of the cluster I have to do that again.
Is there a more elegant way where the distribution of the Serde in the cluster would be automatic and resistant to updates?
If using HiveServer2 (Default in Cloudera 5.0+) the following configuration will work across your entire cluster without having to copy the jar to each node.
In your hive-site.xml config file, or if you're using Cloudera Manager in the "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" config box
<property>
<name>hive.aux.jars.path</name>
<value>/user/hive/aux_jars/hive-serdes-1.0-snapshot.jar</value>
</property>
Then create the directory in your HDFS filesystem (/user/hive/aux_jars) and place the jar file in it. If you are running HUE you can do this part via the web UI, just click on File Browser at the top right.
It depends on the version of Hue and if using Beeswax or HiveServer2:
Beeswax: there is a workaround with the HIVE_AUX_JARS_PATH https://issues.cloudera.org/browse/HUE-1127
HiveServer2 supports a hive.aux.jars.path property in the hive-site.xml. HiveServer2 does not support a .hiverc and Hue is looking at providing an equivalent at some point: https://issues.cloudera.org/browse/HUE-1066

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Where HDFS stores files locally by default?

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally.
Any ideas?
Thanks.
You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data and note that the ${hadoop.tmp.dir} is actually in core-default.xml described here.
The configuration options are described here. The description for this setting is:
Determines where on the local
filesystem an DFS data node should
store its blocks. If this is a
comma-delimited list of directories,
then data will be stored in all named
directories, typically on different
devices. Directories that do not exist
are ignored.
Seems like for the current version(2.7.1) the dir is
/tmp/hadoop-${user.name}/dfs/data
Based on dfs.datanode.data.dir, hadoop.tmp.dir setting from:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml
As "more recent answer" and to clarify hadoop version numbers:
If you use Hadoop 1.2.1 (or something similar), #Binary Nerd's answer is still true.
But if you use Hadoop 2.1.0-beta (or something similar), you should read the configuration documentation here and the option you want to set is: dfs.datanode.data.dir
For hadoop 3.0.0, the hdfs root path is as given by the property "dfs.datanode.data.dir"
Run this in the cmd prompt, and you will get the HDFS location:
bin/hadoop fs -ls /

Resources