How to sync Hadoop configuration files to multiple nodes? - hadoop

I uesd to manage a cluster of only 3 Centos machines running Hadoop. So scp is enough for me to copy the configuration files to the other 2 machines.
However, I have to setup a Hadoop cluster to more than 10 machines. It is really frustrated to sync the files so many times using scp.
I want to find a tool that I can easily sync the files to all machines. And the machine names are defined in a config file, such as:
node1
node2
...
node10
Thanks.

If you do not want to use Zookeeper you can modify your hadoop script in $HADOOP_HOME/bin/hadoop and add something like :
if [ "$COMMAND" == "deployConf" ]; then
for HOST in `cat $HADOOP_HOME/conf/slaves`
do
scp $HADOOP_HOME/conf/mapred-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/core-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/hdfs-site.xml $HOST:$HADOOP_HOME/conf
done
exit 0
fi
That's what I'm using now and it does the job.

Use Zookeeper with Hadoop.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Reference: http://wiki.apache.org/hadoop/ZooKeeper

You have several options to do that. One way is to use tools like rsync. The Hadoop control scripts can distribute configuration files to all nodes of the cluster using rsync. Alternatively, you can make use of tools like Cloudera Manager or Ambari if you need a more sophisticated way to achieve that.

If you use InfoSphere BigInsights then there is the script syncconf.sh

Related

Can Luigi run remote Hadoop jobs?

If one of the tasks in the Luigi graph need to run on a remote Hadoop cluster, is that possible? The machine on which Luigi runs is different from the Hadoop cluster. Can luigi still check the if the HDFS file in the remote cluster exists?
I tried to find documentation for this but wasn't able to.
You can run a job that launches any script.
The HDFS target documentation is here:
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.html
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.target.html

What are best practices to run command on all nodes in a HDP cluster?

Often in a hadoop environment, you are required to run a command or a script or copy a file to all nodes in the cluster.
What are efficient ways of doing that (without having to ssh to each node separately)?
Example:
When upgrading Ambari, you are required to run many commands on all nodes where a certain component is installed - e.g. Infra, SmartSense, etc.
I use Ansible to do that. It will do the job for you.
But you can use puppet or Salt or Chef.

Retrieve files from remote HDFS

My local machine does not have an hdfs installation. I want to retrieve files from a remote hdfs cluster. What's the best way to achieve this? Do I need to get the files from hdfs to one of the cluster machines fs and then use ssh to retrieve them? I want to be able to do this programmatically through say a bash script.
Here are the steps:
Make sure there is connectivity between your host and the target cluster
Configure your host as client, you need to install compatible hadoop binaries. Also your host needs to be running using same operating system.
Make sure you have the same configuration files (core-site.xml, hdfs-site.xml)
You can run hadoop fs -get command to get the files directly
Also there are alternatives
If Webhdfs/httpFS is configured, you can actually download files using curl or even your browser. You can write bash scritps if Webhdfs is configured.
If your host cannot have Hadoop binaries installed to be client, then you can use following instructions.
enable password less login from your host to the one of the node on the cluster
run command ssh <user>#<host> "hadoop fs -get <hdfs_path> <os_path>"
then scp command to copy files
You can have the above 2 commands in one script

Hadoop cluster configuration with Ubuntu Master and Windows slave

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.
You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.
I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

How to tell if I am about to run Hadoop streaming job on a cluster or in "local" mode?

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.
Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.
I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.

Resources