A way to find the zookeeper ip's - hadoop

On our (Cloudera CDH) cluster we have 3 ZK nodes running.
For parcelling purposes I'm looking for a way to get those node's IP's dynamically instead of hard-coding them.
Is there any environment variable or REST call that I'm missing?

I must have missed it, but the env var ZK_QUORUM does the trick!
from cloudera/cm_ext on github:
If you add a dependency to ZooKeeper service, then any process in that
service
(e.g. role daemon process, command process, client config deployment
process) will get the ZooKeeper quorum in ZK_QUORUM environment
variable. This can be used in the control script to add configuration
properties for the ZooKeeper quorum.

Related

Which copy of storm.yaml file in the storm cluster is used by storm nimbus and supervisor daemons?

I'm running a storm cluster and I made some changes in the storm.yaml file. I need to decide whether to update the storm.yaml file across all nodes each time I make a change.
Do the daemons at each node use their respective local copies of config files or, is the one saved at nimbus node effective for all?
Each daemon uses its local copy of storm.yaml. Thus, Nimbus and Supervisor share the same file, if they run on the same machine. Worker JVM always run on the same machine as their corresponding supervisor and thus, always share the same file.
Hence, if you only change Nimbus related parameters, there is no need to distribute storm.yaml over all supervisor nodes. If you change Supervisor parameters and want all Supervisors to "see" the same new configuration, you will need to distribute the file over all nodes (you need to restart the Supervisors, too).

How to execute a shell script on all nodes of an EMR cluster?

Is there a proper way to execute a shell script on every node in a running EMR hadoop cluster?
Everything I look for brings up bootstrap actions, but that only applies to when the cluster is starting, not for a running cluster.
My application is using python, so my current guess is to use boto to list the IPs of each node in the cluster, then loop through each node and execute the shell script via ssh.
Is there a better way?
If your cluster is already started, you should use steps.
The steps are executed after the cluster is started, so technically it appears to be what you are looking for.
Be careful, steps are executed only on the master node, you should connect to the rest of your nodes in some way for modifyng them.
Steps are scripts as well, but they run only on machines in the
Master-Instance group of the cluster. This mechanism allows
applications like Zookeeper to configure the master instances and
allows applications like Hbase and Apache Drill to configure
themselves.
Reference
See this also.

Provision to start group of applications on same Mesos slave

I have cluster of 3 Mesos slaves, where I have two applications: “redis” and “memcached”. Where redis depends on memcached and the requirement is both of the applications/services should start on same node instead of different slave nodes.
So I have created the application group and added the dependency properly in the JSON file. After launching the JSON file via “v2/groups” REST API, I observe that sometime both application group will start on same node but sometimes it will start on different slaves which breaks our requirement.
So intent/requirement is; if any application fails to start on a slave both the application should failover to other slave node. Also can I configure the JSON file to tell Marathon to start the application group on slave-1 (specific slave first) if it is available else start it on other slave in a cluster. Due to some reason if this application group will start on other slave can Marathon relaunch the application group to slave-1 if it is available to serve the request.
Thanks in advance for help.
Edit/Update (2):
Mesos, Marathon, and DC/OS support for PODs is available now:
DC/OS: https://dcos.io/docs/1.9/usage/pods/using-pods/
Mesos: https://github.com/apache/mesos/blob/master/docs/nested-container-and-task-group.md
Marathon: https://github.com/mesosphere/marathon/blob/master/docs/docs/pods.md
I assume you are talking about marathon apps.
Marathon application groups don't have any semantics concerning co-location on the same node and the same is the case for dependencies.
You seem to be looking for a Kubernetes like Pod abstraction in marathon, which is on the roadmap but not yet available (see update above :-)).
Hope this helps!
I think this should be possible (as a workaround) if you specify the correct app contraints within the group's JSON.
Have a look at the example request at
https://mesosphere.github.io/marathon/docs/generated/api.html#v2_groups_post
and the constraints syntax at
https://mesosphere.github.io/marathon/docs/constraints.html
e.g.
"constraints": [["hostname", "CLUSTER", "slave-1"]]
should do. Downside is that there will be no automatic failover to another slave that way. Still, I'd be curious why both apps need to specifically run on the same slave node...

How can I force Hadoop to use my hostnames instead of IP-XX-XX-XX-XX

So, I'm configuring a 10 node cluster with Hadoop 2.5.2, and so far it's working, but the only issue I have is that when trying to communicate with the nodes, Hadoop is guessing their hostnames based on their IP, instead of using the ones I've configured.
Let me be more specific: this is happening when starting a job, but when I start up yarn (for instance), the slave nodes names, are used correctly. The scheme that Hadoop uses to auto-generate the names of the nodes is IP-XX-XX-XX-XX, so for a node with IP 179.30.0.1 it's name would be IP-179-30-0-1.
This is forcing me to edit every /etc/hosts file on each node, so that their 127.0.0.1 ip is named like Hadoop says.
Can I make Hadoop use the names I have those hosts? Or am I force to do this extra configuration step normally?

Configure oozie workflow properties for HA JobTracker

With an Oozie workflow, you have to specify the cluster's JobTracker in the properties for the workflow. This is easy when you have a single JobTracker:
jobTracker=hostname:port
When the cluster is configured for HA (high availability) JobTracker, I need to be able to set up my properties files to be able to hit either of the JobTracker hosts, without having to update all my properties files when the JobTracker has failed over to the 2nd node.
When accessing one JobTracker through http, it will redirect to the other if it isn't running, but oozie doesn't use http, so there is no redirect, which causes the workflow to fail if the properties file specifies the job tracker host that is not running.
How can I configure my property file to handle JobTracker running in HA?
I just finished setting up some Oozie workflows to use HA JobTrackers and NameNodes. The key is to use the logical name of the HA service you configured, and not any individual hostnames or ports. For example, the default HA JobTracker name is 'logicaljt'. Replace hostname:port with 'logicaljt', and everything should just work, as long as the node from which you're running Oozie has the appropriate hdfs-site and mapred-site configs properly installed (implicitly due to being part of the cluster, or explicitly due to adding a gateway role to it).
Please specify the nameservice for the cluster in which the HA is enabled.
eg:
in properties file
namenode=hdfs://<nameserivce>
jobTracker=<nameservice>:8032

Resources