CDH4.4: Restarting HDFS and MapReduce from shell - hadoop

I'm trying to automate stopping, formatting and starting HDFS and MapReduce services on a Cloudera Hadoop 4.4 cluster, using a bash script.
It's easy to kill HDFS and MapReduce processes using "pkill -U hdfs && pkill -U mapred", but how can I start those processes again, without using the Cloudera Manager GUI?

Well, apparently CM has a pretty sweet API
Check it out here
http://cloudera.github.io/cm_api/

Related

Cloudera quickstart CDH 5.15 cluster is RUNNING slow

I have Cloudera quickstart CDH 5.15 cluster is very slow
when i run a simple hadoop command like "hadoop fs -ls" it takes almost 20 seconds
but when i try runnnig local commands like "ls" it is very fast please help me with this.
The quickstart VM requires 6-8 GB of RAM to work reliably.
But the JVM startup process for any hadoop command is going to be much much slower compared to other built-in shell commands that operate similarly. There's no way around that fact.
If you want the Hadoop ls command to be quicker, it would be beneficial to setup an actual distributed cluster with adequate memory for the Namenode process, which is what ls contacts

How to start Datanode? (Cannot find start-dfs.sh script)

We are setting up automated deployments on a headless system: so using the GUI is not an option here.
Where is start-dfs.sh script for hdfs in Hortonworks Data Platform? CDH / cloudera packages those files under the hadoop/sbin directory. However when we search for those scripts under HDP they are not found:
$ pwd
/usr/hdp/current
Which scripts exist in HDP ?
[stack#s1-639016 current]$ find -L . -name \*.sh
./hadoop-hdfs-client/sbin/refresh-namenodes.sh
./hadoop-hdfs-client/sbin/distribute-exclude.sh
./hadoop-hdfs-datanode/sbin/refresh-namenodes.sh
./hadoop-hdfs-datanode/sbin/distribute-exclude.sh
./hadoop-hdfs-nfs3/sbin/refresh-namenodes.sh
./hadoop-hdfs-nfs3/sbin/distribute-exclude.sh
./hadoop-hdfs-secondarynamenode/sbin/refresh-namenodes.sh
./hadoop-hdfs-secondarynamenode/sbin/distribute-exclude.sh
./hadoop-hdfs-namenode/sbin/refresh-namenodes.sh
./hadoop-hdfs-namenode/sbin/distribute-exclude.sh
./hadoop-hdfs-journalnode/sbin/refresh-namenodes.sh
./hadoop-hdfs-journalnode/sbin/distribute-exclude.sh
./hadoop-hdfs-portmap/sbin/refresh-namenodes.sh
./hadoop-hdfs-portmap/sbin/distribute-exclude.sh
./hadoop-client/sbin/hadoop-daemon.sh
./hadoop-client/sbin/slaves.sh
./hadoop-client/sbin/hadoop-daemons.sh
./hadoop-client/etc/hadoop/hadoop-env.sh
./hadoop-client/etc/hadoop/kms-env.sh
./hadoop-client/etc/hadoop/mapred-env.sh
./hadoop-client/conf/hadoop-env.sh
./hadoop-client/conf/kms-env.sh
./hadoop-client/conf/mapred-env.sh
./hadoop-client/libexec/kms-config.sh
./hadoop-client/libexec/init-hdfs.sh
./hadoop-client/libexec/hadoop-layout.sh
./hadoop-client/libexec/hadoop-config.sh
./hadoop-client/libexec/hdfs-config.sh
./zookeeper-client/conf/zookeeper-env.sh
./zookeeper-client/bin/zkCli.sh
./zookeeper-client/bin/zkCleanup.sh
./zookeeper-client/bin/zkServer-initialize.sh
./zookeeper-client/bin/zkEnv.sh
./zookeeper-client/bin/zkServer.sh
Notice: there are ZERO start/stop sh scripts..
In particular I am interested in the start-dfs.sh script that starts the namenode(s) , journalnode, and datanodes.
How to start DataNode
su - hdfs -c "/usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode";
Github - Hortonworks Start Scripts
Update
Decided to hunt for it myself.
Spun up a single node with Ambari, installed HDP 2.2 (a), HDP 2.3 (b)
sudo find / -name \*.sh | grep start
Found
(a) /usr/hdp/2.2.8.0-3150/hadoop/src/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/s‌​tart-dfs.sh
Weird that it doesn't exist in /usr/hdp/current, which should be symlinked.
(b) /hadoop/yarn/local/filecache/10/mapreduce.tar.gz/hadoop/sbin/start-dfs.sh
The recommended way to administer your hadoop cluster would be via the administrator panel. Since you are working on Hotronworks distribution, it makes more sense for you to use Ambari instead.

How to kill a mapred job started by hive?

I'm working by CDH 5.1 now. It starts normal Hadoop job by YARN but hive still works with mapred. Sometimes a big query will hang for a long time and I want to kill it.
I can find this big job by JobTracker web console while it didn't provide a button to kill it.
Another way is killing by command line. However, I couldn't find any job running by command line.
I have tried 2 commands:
yarn application -list
mapred job -list
How to kill big query like this?
You can get the Job ID from Hive CLI when you run a job or from the Web UI. You can also list the job IDs using the application ID from resource manager. Ideally, you should get everything from
mapred job -list
or
hadoop job -list
Using the Job ID you can kill it by using the below command.
hadoop job -kill <job_id>
Another alternative would be to kill the application using
yarn application -kill <application_id>

Could not find and execute start-all.sh and Stop-all.sh on Cloudera VM for Hadoop

How to start / Stop services from command line CDH4 --. I am new to Hadoop. Installed VM from Cloudera. Could not find start-all.sh and stop-all.sh . How to stop or start the task tracker or data node if I want. It is a single node cluster which I am using on Centos. I haven't dont any modifications.
More over I see there are changes in the directory structures in all flavours. I could not locate these sh files on the VM for my installation.
[cloudera#localhost ~]$ stop-all.sh
bash: stop-all.sh: command not found
Highly appreciate your support.
use Sudo su hdfs to start and to stop just type exit it will stop all the services.

How to tell if I am about to run Hadoop streaming job on a cluster or in "local" mode?

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.
Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.
I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.

Resources