Should hadoop installation path be the same across nodes - hadoop

Hadoop 2.7 is installed at /opt/pro/hadoop/hadoop-2.7.3 at master, then the whole installation is copied to slave, but different directory /opt/pro/hadoop-2.7.3. I then update the environment variables (e.g., HADOOP_HOME, hdfs_site.xml for namenode and datanode) at slave machine.
Now I can run hadoop version at slave successfully. However, in the master, start-dfs.sh fails with message:
17/02/18 10:24:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [master]
master: starting namenode, logging to /opt/pro/hadoop/hadoop-2.7.3/logs/hadoop-shijiex-namenode-shijie-ThinkPad-T410.out
master: starting datanode, logging to /opt/pro/hadoop/hadoop-2.7.3/logs/hadoop-shijiex-datanode-shijie-ThinkPad-T410.out
slave: bash: line 0: cd: /opt/pro/hadoop/hadoop-2.7.3: No such file or directory
slave: bash: /opt/pro/hadoop/hadoop-2.7.3/sbin/hadoop-daemon.sh: No such file or directory
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/pro/hadoop/hadoop-2.7.3/logs/hadoop-shijiex-secondarynamenode-shijie-ThinkPad-T410.out
17/02/18 10:26:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
The hadoop uses the HADOOP_HOME of master(/opt/pro/hadoop/hadoop-2.7.3) at slave, while the HADOOP_HOME at slave is /opt/pro/hadoop-2.7.3.
So should the HADOOP_HOME be the same across nodes when installation?
.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-amd64/bin
export HADOOP_HOME=/opt/pro/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
At slave server, $HADOOP_HOME/etc/hadoop has a file masters:
xx#wodaxia:/opt/pro/hadoop-2.7.3/etc/hadoop$ cat masters
master

No, Not necessarily. But if the paths are different among the nodes, then you cannot use the scripts like start-dfs.sh, stop-dfs.sh and the same for yarn. These scripts refer the $HADOOP_PREFIX variable of the node where the script is executed.
Snippet of code from hadoop-daemons.sh used by start-dfs.sh to start all the datanodes.
exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_PREFIX" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$#"
The script is written this way because of the assumption that all the nodes of cluster follow the same $HADOOP_PREFIX or $HADOOP_HOME (deprecated) path.
To overcome this,
1) Either try to have the path same across all the nodes.
2) Or login to each node in the cluster and start the dfs process applicable for that node using,
$HADOOP_HOME/sbin/hadoop-daemon.sh start <namenode | datanode | secondarynamenode| journalnode>
Same procedure for yarn as well,
$HADOOP_HOME/sbin/yarn-daemon.sh start <resourcemanager | nodemanager>

No, it should not. $HADOOP_HOME is individual per each Hadoop node, but it can be instantiated by different ways. You can define it in global way by setting it in .bashrc file or it can be set in local hadoop-env.sh script in your Hadoop folder for example. Verify that the values are the same on every node of the cluster. If it is global you can check it by echo $HADOOP_HOME. If it is a script option, you can verify this variable by importing it into current context and checking it again:
. /opt/pro/hadoop/hadoop-2.7.3/bin/hadoop-env.sh
echo $HADOOP_HOME
Besides make sure that you don't have hadoop.home.dir property in your configuration, as it overrides environmental $HADOOP_HOME

Related

I can't run Hive from Mac terminal

I downloaded Hive and Hadoop onto my system, when I enter the jps command all the nodes seem to be running:
81699 SecondaryNameNode
65058 ResourceManager
82039 NodeManager
36086
81463 NameNode
91288 Jps
37193 Launcher
95256 Launcher
81563 DataNode
However when I try to run hive using the ./hive command I get the following error:
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
ERROR: Invalid HADOOP_COMMON_HOME
Unable to determine Hadoop version information.
'hadoop version' returned:
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
ERROR: Invalid HADOOP_COMMON_HOME
This is what my ~/.bashrc file looks like:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_311.jdk
export HADOOP_HOME=/opt/homebrew/Cellar/hadoop/3.3.1
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
export HIVE_HOME=/Users/arjunpanyam/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

Hadoop : start-dfs.sh does not work when calling directly

I have a very strange problem when starting hadoop.
When I call start-dfs.sh using absolute path /usr/local/hadoop/etc/hadoop/sbin/start-dfs.sh, it starts without any problem.
But as I add hadoop into my environment variables :
export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
I would like to call it directly using start-dfs.sh. But when I start like this, it throws error :
20/10/26 16:36:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
Starting namenodes on []
localhost: Error: JAVA_HOME is not set and could not be found.
localhost: Error: JAVA_HOME is not set and could not be found.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Error: JAVA_HOME is not set and could not be found.
20/10/26 16:36:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I wonder what is the problem ? I have all my Java home and core-site.xml well configured. Why it's not working if I start it directly from bash ?
It seems that you need to set the JAVA_HOME environment variable to where your java package is located in your (I suppose) Linux distribution. To do that, you have to locate the path to the java installation.
In order to do that you can use the following command on your terminal, as shown here:
find /usr -name java 2> /dev/null
which is gonna output one or a number of paths (depends on how many java versions you have on your system) like in the screenshot below:
You can choose one of the versions (or just take the single one you have) and copy the path of it.
Up next, to set the environment variable of JAVA_HOME with its path you need to copy the path you got from the output above and trim the last directory (aka the /java directory) off of it on a text editor.
For my system I chose the third version of java in my system so I went in the .bashrc file and added those 2 lines at the bottom (notice how on the setting of the variable the path ends before the /bin directory, while the setting of its path ends after the /bin directory):
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-i386
export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-i386/bin/
So the bottom of the .bashrc file looks like this:
And to test it out, it works without the full path on the script (this also works for the start-all.sh and stop-all.sh scripts as well):
Finally the problem is that I have another hadoop in /opt/module. When I call hdfs for example, it refers to the /opt/module one than the one in /usr/local.

localhost: ERROR: Cannot set priority of datanode process 2984

I set up and configured a multi-node Hadoop .Will appear when I start
My Ubuntu is 16.04 and Hadoop is 3.0.2
Starting namenodes on [master]
Starting datanodes
localhost: ERROR: Cannot set priority of datanode process 2984
Starting secondary namenodes [master]
master: ERROR: Cannot set priority of secondarynamenode process 3175
2018-07-17 02:19:39,470 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting resourcemanager
Starting nodemanagers
Who can tell me which link is wrong?
I had the same error and fixed it by ensuring that the datanode and namenode locations have the right permissions and are owned by the user starting hadoop daemons.
Check that
The directory path properties in hdfs-site.xml under $HADOOP_CONF_DIR are pointing to valid locations.
dfs.namenode.name.dir
dfs.datanode.data.dir
dfs.namenode.checkpoint.dir
Hadoop user must have write permission for these paths
If the write permission is not present for the mentioned paths, then the processes might not start and the error you see can occur.
I had the same error, and tried the above method, but it doesn't work.
I set XXX_USER in all xxx-env.sh files, and got the same result.
Finally I set HADOOP_SHELL_EXECNAME="root" in ${HADOOP_HOME}/bin/hdfs, and the error disappeared.
The default value of HADOOP_SHELL_EXECNAME is "HDFS".
I had the same error when I renamed my Ubuntu home directory, and had to edit core-site.xml, changing the value of the property hadoop.tmp.dir to the new path.
Just append the word "native" to your HADOOP_OPTS like this:
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
I had the same issue, you just need to check hadoop/logs directory and look for a .log file for datanode, type more nameofthefile.log and check for the errors, mine was a problem in to configuration, I fixed it and it worked.

How can I see the underlying Hadoop file system from Spark

I have started Spark like this:
spark-shell --master local[10]
I'm trying to see the files on the underlying Hadoop installation.
I want to do something like this:
hdfs ls
How can I do it?
You can execute any underlying system/OS commands (like hdfs dfs -ls or even pure shell/DOS commands) from scala (which comes default with spark) just by importing classes from sys.process package.
see below for example
Linux
import sys.process._
val oldcksum = "cksum oldfile.txt" !!
val newcksum = "cksum newfile.txt" !!
val hdpFiles = "hdfs dfs -ls" !!
Windows
import sys.process._ # This will let underlying OS commands to be executed.
val oldhash = "certUtil -hashFile PATH_TO_FILE" !!#CertUtil is a windows command
If you plan to read and write from/to HDFS in Spark you need to first integrate the spark and hadoop.
http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
If I understand your question correctly you want to execute HDFS commands from shell. In my opinion running spark job may not help.
You need to start your HDFS instance first. Below are the commands from the documentation. Once HDFS is started you can run the shell commands.
To start a Hadoop cluster you will need to start both the HDFS and
YARN cluster.
The first time you bring up HDFS, it must be formatted. Format a new
distributed filesystem as hdfs:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format Start
the HDFS NameNode with the following command on the designated node as
hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR
--script hdfs start namenode Start a HDFS DataNode with the following command on each designated node as hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config
$HADOOP_CONF_DIR --script hdfs start datanode If etc/hadoop/slaves and
ssh trusted access is configured (see Single Node Setup), all of the
HDFS processes can be started with a utility script. As hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh Start the YARN with the
following command, run on the designated ResourceManager as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config
$HADOOP_CONF_DIR start resourcemanager Run a script to start a
NodeManager on each designated host as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config
$HADOOP_CONF_DIR start nodemanager Start a standalone WebAppProxy
server. Run on the WebAppProxy server as yarn. If multiple servers are
used with load balancing it should be run on each of them:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config
$HADOOP_CONF_DIR start proxyserver If etc/hadoop/slaves and ssh
trusted access is configured (see Single Node Setup), all of the YARN
processes can be started with a utility script. As yarn:
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh Start the MapReduce
JobHistory Server with the following command, run on the designated
server as mapred:
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config
$HADOOP_CONF_DIR start historyserver
Second option is programmatic way. You can use FileSystem class from Hadoop (It is a java implementation.) and do the hdfs operations.
Below is the link for javadoc.
https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html
You can see the underlying file system of HDFS using the commands on spark-shell:
import scala.sys.process._
val lsOutput = Seq("hdfs","dfs","-ls","/path/to/folder").!!

Backup hdfs directory from full-distributed to a local directory?

I'm trying to back up a directory from hdfs to a local directory. I have a hadoop/hbase cluster running on ec2. I managed to do what I want running in pseudo-distributed on my local machine but now I'm fully distributed the same steps are failing. Here is what worked for pseudo-distributed
hadoop distcp hdfs://localhost:8020/hbase file:///Users/robocode/Desktop/
Here is what I'm trying on the hadoop namenode (hbase master) on ec2
ec2-user#ip-10-35-53-16:~$ hadoop distcp hdfs://10.35.53.16:8020/hbase file:///~/hbase
The errors I'm getting are below
13/04/19 09:07:40 INFO tools.DistCp: srcPaths=[hdfs://10.35.53.16:8020/hbase]
13/04/19 09:07:40 INFO tools.DistCp: destPath=file:/~/hbase
13/04/19 09:07:41 INFO tools.DistCp: file:/~/hbase does not exist.
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Failed to createfile:/~/hbase
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1171)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
You can't use the ~ character in Java to represent the current home directory, so change to a fully qualified path, e.g.:
file:///home/user1/hbase
But i think you're going to run into problems in a fully distributed environment as the distcp command runs a map reduce job, so the destination path will be interpreted as local to each cluster node.
If you want to pull data down from HDFS to a local directory, you'll need to use the -get or -copyToLocal switches to the hadoop fs command

Resources