Starting Hadoop Services using Command Line (CDH 5) - hadoop

I know how to start services using Cloudera manager interface, but I prefer to know what is really happening behind the scene and not rely on "magic".
I read this page but it does not give the desired information
I know there are some .sh files to be used but they seem to vary from version to version, and I'm using the latest as of today (5.3).
I would be grateful to have a list of service starting commands (specifically HDFS)
PS : Looks like somehow Cloudera ditched the classic Apache scripts (start-dfs.sh etc.)

You can figure this out by installing Cloudera's optional service packages.
These use the service command to start services instead of Cloudera Manager.
hadoop-hdfs-namenode - for namenode
hadoop-hdfs-secondarynamenode - for secondary namenode
hadoop-hdfs-datanode - for datanode
hadoop-hdfs-journalnode - for journalnode
You can see the CDH5.9 RPMs here:
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.9/RPMS/x86_64/
After you install them, you can look at the respective /etc/init.d/SERVICENAME
to understand how they are run (assuming you're comfortable looking at shell scripts).

Related

start-all.sh command not found

I have just installed Cloudera VM setup for hadoop. But when I open the command prompt and want to start all daemons for hadoop using command 'start-all.sh' , I get an error stating "bash : start-all.sh: command not found".
I have tried 'start-dfs.sh' too yet still gives the same error. When I use 'jps' command, I can see that none of the daemons have been started.
You can find start-all.sh and start-dfs.sh scripts in bin or sbin folders. You can use the following command to find that. Go to hadoop installation folder and run this command.
find . -name 'start-all.sh' # Finds files having name similar to start-all.sh
Then you can specify the path to start all the daemons using bash /path/to/start-all.sh
If you're using the QuickStart VM then the right way to start the cluster (as #cricket_007 hinted) is by restarting it in the Cloudera Manager UI. The start-all.sh scripts will not work since those only apply to the Hadoop servers (Name Node, Data Node, Resource Manager, Node Manager ...) but not all the services in the ecosystem (like Hive, Impala, Spark, Oozie, Hue ...).
You can refer to the YouTube video and the official documentation Starting, Stopping, Refreshing, and Restarting a Cluster

CDH 5.3.2 - Need to restart impala daemon from shell/script

I am using CDH 5.3.2 cluster and have a requirement to be able to start/stop impala daemons from a script. The command mentioned in Cloudera Docs
sudo service impala-server start
works fine on my CDH 5.10 local VM but on CDH 5.3.2 cluster I get an error "impala-server: unrecognized service". On checking in /etc/init.d I see that no such service is listed either (while its listed in 5.10 version)
Then i tried to restart the service directly from impala bin directory
cd /usr/bin
./impalad stop
However running into below error now:
E0918 11:55:27.815739 12046 JniFrontend.java:622] FileSystem is file:///
W0918 11:55:27.817589 12046 JniFrontend.java:534] Cannot detect CDH version. Skipping Hadoop configuration checks
E0918 11:55:27.817620 12046 impala-server.cc:210] Unsupported file system. Impala only supports DistributedFileSystem but the configured filesystem is: LocalFileSystem.fs.defaultFS(file:///) might be set incorrectly
E0918 11:55:27.817631 12046 impala-server.cc:212] Aborting Impala Server startup due to improper configuration
I checked core-site.xml on Cloudera Manager and fs.defaultFS is correctly set so not sure where its picking the value from. Any pointers on how to go further on this?
The init.d service packages to start Impala from the command line are meant to be used for CDH users who do NOT want to use Cloudera Manager. The right way to start and stop Impala on a Cloudera Manager cluster is to use the CM API:
https://cloudera.github.io/cm_api/apidocs/v17/index.html
start cluster service API
stop cluster service API
commands API
The tutorial shows how to use the CM APIs but for your situation you probably need to do:
$ curl -X POST -u USER:PASSWORD \
'CM_URL//api/v1/clusters/CLUSTERNAME/services/IMPALA_SERVICE/commands/stop'
replacing USER, PASSWORD, CM_URL, CLUSTERNAME, IMPALA_SERVICE_NAME with the appropriate values. The curl command will return a command ID.
Then poll this API with the command ID to see that the start/stop operation completed.
$ curl -u USER:PASSWORD 'CM_URL//api/v1/commands/COMMAND_ID'
However, if you still want to use the init.d service packages then you'll need to install the impala-server package.

Not able to run Hadoop job remotely

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.
Basically, I want to do two things:
Execute the hadoop job remotely.
Retrieve the result from hadoop output directory.
I don't have any idea how to achieve this. I am using hadoop version 1.1.2
I tried passing jobtracker/namenode URL in the Job configuration but it fails.
I have tried the following example : Running java hadoop job on local/remote cluster
Result: Getting error consistently as cannot load directory. It is similar to this post:
Exception while submitting a mapreduce job from remote system
Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Hadoop cluster configuration with Ubuntu Master and Windows slave

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.
You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.
I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

CDH4 installation using tarball

I have been struggling to install CDH via tarball, there is no document that describes the steps or guides through. I do have root access on the server & wish to install CDH4 via tarball in Pseudo mode. Can anyone help?. On the same server apache hadoop is also installed, i want to install this CDH, without effecting the existing apache hadoop.
It will not work..because in the end CDH4 will use the same ports which your existing apache hadoop is using..It will work ..if you shutdown your existing hadoop cluster and then start your CDH4 cluster. Or else change all the port numbers for namenode,secondary namenode,jobtracker, tasktracker and datanode and their respective web UI's port..which is kind of tedious.. It would be also helpful if you provide some error logs..So I can highlight what exactly is the problem.

Resources