How to tell if I am about to run Hadoop streaming job on a cluster or in "local" mode? - hadoop

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.

Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.

I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.

Related

Can Luigi run remote Hadoop jobs?

If one of the tasks in the Luigi graph need to run on a remote Hadoop cluster, is that possible? The machine on which Luigi runs is different from the Hadoop cluster. Can luigi still check the if the HDFS file in the remote cluster exists?
I tried to find documentation for this but wasn't able to.
You can run a job that launches any script.
The HDFS target documentation is here:
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.html
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.target.html

Run PIG in local mode from oozie

I want to run PIG in local mode, which is very easy
pig -x local file.pig
My requirement is to run PIG in local mode from OOZIE?
Is it possible as i think OOZIE will automatically launch map task first?
It's possible. When a pig script is run by Oozie, it's run as a one-map map-reduce job, which only runs the pig script, which in turn runs other map-reduce jobs (when pig is run in mapred mode).
It seems, that Pig action configuration doesn't allow running in local mode, but you can still run Pig script in local mode using shell action type. You only have to make sure, that your script, input and output data are in HDFS.
I don't think, we can run pig in local mode from oozie. Comment which Vishal wrote makes sense. In some cases, where there is lesser amount of data, Its better to go for pig in local mode. To run in local mode, you can run by writing a shell script and scheduling that in crontab.If you try this through oozie. Upto my knowledge It won't suit well , because Oozie is meant to run in HDFS.
If you want oozie to run on some data . It expects that data to be in HDFS (i.e distributed).And You must have the pig script as well in hdfs.I rembered seeing post from AlanGates where he mentioned PIG is designed to process data from/to HDFS and hive is for local to HDFS or HDFS to HDFS.

Configuring pig relation with Hadoop

I'm having troubles understanding the relation between Hadoop and Pig.
I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.
What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node.
Indeed, it uses the hadoop configuration files.
Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?
If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?
If not, does it mean we always need to have hadoop running within pig ?
Pig can run in two modes:
Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:
pig -x local
MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.
Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021
Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.

Differences between hadoop jar and yarn -jar

what's the difference between run a jar file with commands "hadoop jar " and "yarn -jar " ?
I've used the "hadoop jar" command on my MAC successfully but I want be sure that the execution is being correct and parallel on my four cores.
Thanks!!!
Short Answer
They are probably identical for you, but even if they aren't, they should both utilize your cluster to the best of its ability.
Longer Answer
The /usr/bin/yarn script sets up the execution environment so that all of the yarn commands can be run. The /usr/bin/hadoop script isn't quite as concerned about yarn specific functionality. However, if you have your cluster set up to use yarn as the default implementation of mapreduce (MRv2), then hadoop jar will probably act the same as yarn jar for a mapreduce job.
Either way you're probably fine, but you can always check the resource manager (or job tracker) web interface to see how your job is distributed across the cluster (whether it's a single node cluster or not)

How to sync Hadoop configuration files to multiple nodes?

I uesd to manage a cluster of only 3 Centos machines running Hadoop. So scp is enough for me to copy the configuration files to the other 2 machines.
However, I have to setup a Hadoop cluster to more than 10 machines. It is really frustrated to sync the files so many times using scp.
I want to find a tool that I can easily sync the files to all machines. And the machine names are defined in a config file, such as:
node1
node2
...
node10
Thanks.
If you do not want to use Zookeeper you can modify your hadoop script in $HADOOP_HOME/bin/hadoop and add something like :
if [ "$COMMAND" == "deployConf" ]; then
for HOST in `cat $HADOOP_HOME/conf/slaves`
do
scp $HADOOP_HOME/conf/mapred-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/core-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/hdfs-site.xml $HOST:$HADOOP_HOME/conf
done
exit 0
fi
That's what I'm using now and it does the job.
Use Zookeeper with Hadoop.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Reference: http://wiki.apache.org/hadoop/ZooKeeper
You have several options to do that. One way is to use tools like rsync. The Hadoop control scripts can distribute configuration files to all nodes of the cluster using rsync. Alternatively, you can make use of tools like Cloudera Manager or Ambari if you need a more sophisticated way to achieve that.
If you use InfoSphere BigInsights then there is the script syncconf.sh

Resources