CDH 5.3.2 - Need to restart impala daemon from shell/script - shell

I am using CDH 5.3.2 cluster and have a requirement to be able to start/stop impala daemons from a script. The command mentioned in Cloudera Docs
sudo service impala-server start
works fine on my CDH 5.10 local VM but on CDH 5.3.2 cluster I get an error "impala-server: unrecognized service". On checking in /etc/init.d I see that no such service is listed either (while its listed in 5.10 version)
Then i tried to restart the service directly from impala bin directory
cd /usr/bin
./impalad stop
However running into below error now:
E0918 11:55:27.815739 12046 JniFrontend.java:622] FileSystem is file:///
W0918 11:55:27.817589 12046 JniFrontend.java:534] Cannot detect CDH version. Skipping Hadoop configuration checks
E0918 11:55:27.817620 12046 impala-server.cc:210] Unsupported file system. Impala only supports DistributedFileSystem but the configured filesystem is: LocalFileSystem.fs.defaultFS(file:///) might be set incorrectly
E0918 11:55:27.817631 12046 impala-server.cc:212] Aborting Impala Server startup due to improper configuration
I checked core-site.xml on Cloudera Manager and fs.defaultFS is correctly set so not sure where its picking the value from. Any pointers on how to go further on this?

The init.d service packages to start Impala from the command line are meant to be used for CDH users who do NOT want to use Cloudera Manager. The right way to start and stop Impala on a Cloudera Manager cluster is to use the CM API:
https://cloudera.github.io/cm_api/apidocs/v17/index.html
start cluster service API
stop cluster service API
commands API
The tutorial shows how to use the CM APIs but for your situation you probably need to do:
$ curl -X POST -u USER:PASSWORD \
'CM_URL//api/v1/clusters/CLUSTERNAME/services/IMPALA_SERVICE/commands/stop'
replacing USER, PASSWORD, CM_URL, CLUSTERNAME, IMPALA_SERVICE_NAME with the appropriate values. The curl command will return a command ID.
Then poll this API with the command ID to see that the start/stop operation completed.
$ curl -u USER:PASSWORD 'CM_URL//api/v1/commands/COMMAND_ID'
However, if you still want to use the init.d service packages then you'll need to install the impala-server package.

Related

start-all.sh command not found

I have just installed Cloudera VM setup for hadoop. But when I open the command prompt and want to start all daemons for hadoop using command 'start-all.sh' , I get an error stating "bash : start-all.sh: command not found".
I have tried 'start-dfs.sh' too yet still gives the same error. When I use 'jps' command, I can see that none of the daemons have been started.
You can find start-all.sh and start-dfs.sh scripts in bin or sbin folders. You can use the following command to find that. Go to hadoop installation folder and run this command.
find . -name 'start-all.sh' # Finds files having name similar to start-all.sh
Then you can specify the path to start all the daemons using bash /path/to/start-all.sh
If you're using the QuickStart VM then the right way to start the cluster (as #cricket_007 hinted) is by restarting it in the Cloudera Manager UI. The start-all.sh scripts will not work since those only apply to the Hadoop servers (Name Node, Data Node, Resource Manager, Node Manager ...) but not all the services in the ecosystem (like Hive, Impala, Spark, Oozie, Hue ...).
You can refer to the YouTube video and the official documentation Starting, Stopping, Refreshing, and Restarting a Cluster

Starting Hadoop Services using Command Line (CDH 5)

I know how to start services using Cloudera manager interface, but I prefer to know what is really happening behind the scene and not rely on "magic".
I read this page but it does not give the desired information
I know there are some .sh files to be used but they seem to vary from version to version, and I'm using the latest as of today (5.3).
I would be grateful to have a list of service starting commands (specifically HDFS)
PS : Looks like somehow Cloudera ditched the classic Apache scripts (start-dfs.sh etc.)
You can figure this out by installing Cloudera's optional service packages.
These use the service command to start services instead of Cloudera Manager.
hadoop-hdfs-namenode - for namenode
hadoop-hdfs-secondarynamenode - for secondary namenode
hadoop-hdfs-datanode - for datanode
hadoop-hdfs-journalnode - for journalnode
You can see the CDH5.9 RPMs here:
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.9/RPMS/x86_64/
After you install them, you can look at the respective /etc/init.d/SERVICENAME
to understand how they are run (assuming you're comfortable looking at shell scripts).

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect.
Is there a command using which I can do this?
Edit (10/26/2017): A more detailed Knowledge Center article on how to do this has been published here by AWS officially -
https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/.
You can ssh into the master node of your EMR cluster and run -
"sudo /sbin/stop hadoop-yarn-resourcemanager"
"sudo /sbin/start hadoop-yarn-resourcemanager"
commands to restart the Yarn resource manager. EMR AMI 4.x.x uses upstart - /sbin/{start,stop,restart} are all symlinks to /sbin/initctl, which is part of upstart. See the initctl man page for more information.
Alternatively, you can follow the instructions here to propagate your changes to yarn-site.xml - yarn-change-configuration-on-yarn-site-xml
For those who are gonna come from Google
In order to restart a service in EMR, perform the following actions:
Find the name of the service by running the following command:
initctl list
For example, the YARN Resource Manager service is named hadoop-yarn-resourcemanager.
Stop the service by running the following command:
sudo stop hadoop-yarn-resourcemanager
Wait a few seconds, then start the service by running the following command:
sudo start hadoop-yarn-resourcemanager
Note: Stop/start is required; do not use the restart command.
Verify that the process is running by running the following command:
sudo status hadoop-yarn-resourcemanager
Check for the process using ps, and then check the log file for any errors in the log directory /var/log/.
Source : https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
If what you want to do is to enable log-aggregation, it is actually easier to create the cluster with log-aggregation already enabled, as described in the documentation:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-debugging.html
(It is actually enabled by default if you are using emr-4.3.0).
Try restarting this service as well:
hadoop-yarn-nodemanager

Restart hive service on AWS EMR

I am very new to HIVE as well AWS-EMR. As per my requirement, i need to create Hive Metastore Outside the Cluster (from AWS EMR to AWS RDS).
I followed the instruction given in
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html
I made changes in hive-site.xml and able to setup hive metaStore to Amazon RDS mysql server. To bring the changes in action, currently i am rebooting the complete cluster so hive start storing metastore to AWS-RDS. This way it is working.
But i want to avoid rebooting the cluster, is there any way i can restart the service?
Just for those who are gonna come from Google
To restart any EMR service
In order to restart a service in EMR, perform the following actions:
Find the name of the service by running the following command:
initctl list
For example, the YARN Resource Manager service is named “hadoop-yarn-resourcemanager”.
Stop the service by running the following command:
sudo stop hadoop-yarn-resourcemanager
Wait a few seconds, then start the service by running the following command:
sudo start hadoop-yarn-resourcemanager
Note: Stop/start is required; do not use the restart command.
Verify that the process is running by running the following command:
sudo status hadoop-yarn-resourcemanager
Check for the process using ps, and then check the log file for any errors in the log directory /var/log/.
Source : https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
sudo stop hive-metastore
sudo start hive-metastore
On EMR 5.x I have found this to work:
hive --service metastore --stop
hive --service metastore --start
For me this approach worked:
Get the pid
Kill the process
Process restarts by itself
Commands for 1 & 2:
ps aux | grep MetaStore
sudo -u hive kill <pid from above>
Here if you are not familiar with ps you can use the following command which will show the headers for PID and only one line of the hive Metastore command:
ps aux | egrep "MetaStore|PID" | grep -v grep
Hive Server restarted by itself. Validate again by ps the pig would have changed.
ps aux | grep MetaStore
You don't have to restart the entire cluster. While launching the cluster, you can specify a hive-site.xml file with the details of RDS. If you are not following this option and making the changes manually after launching the cluster, you don't need to restart the entire cluster. Just restart the hive-metastore service alone. Hive metastore is running in the master node only
You can launch the cluster either by using multiple ways.
1) AWS console
2) Using API (Java, Python etc)
3) Using AWS cli
You can keep the hive-site.xml in S3 and perform this activity as a bootstrap step while launching the cluster. AWS api is providing the feature to specify custom hive-site.xml from S3 rather than the one created by default.
If you are using hive from the master machine alone, you don't have to make the changes in all the machines.
An example of specifying the hive-site.xml while launching EMR using aws cli is given below
aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type m3.xlarge --instance-count 3 \
--bootstrap-actions Name="Install Hive Site Configuration",Path="s3://elasticmapreduce/libs/hive/hive-script",\
Args=["--base-path","s3://elasticmapreduce/libs/hive","--install-hive-site","--hive-site=s3://mybucket/hive-site.xml","--hive-versions","latest"]

Resources