I have to explicitly boot all the processor nodes every time when my master system restarts
i.e
mdpboot -n 3
P.S
I am implementing beowulf cluster
Try updating to a new version of MPICH. MPD is quite old and hasn't been supported for some time. The newer versions use Hydra which doesn't require booting.
Related
Ive installed a single node greenplum db with 2 segment hosts , inside them residing 2 primary and mirror segments , and i want to configure a standby master , can anyone help me with it?
It is pretty simple.
gpinitstandby -s smdw -a
Note: If you are using one of the cloud Marketplaces that deploys Greenplum for you, the standby master runs on the first segment host. The overhead of running the standby master is pretty small so it doesn't impact performance. The cloud Marketplaces also have self-healing so if that nodes fails, it is replaced and all services are automatically restored.
As Jon said, this is fairly straightforward. Here is a link to the documentation: https://gpdb.docs.pivotal.io/5170/utility_guide/admin_utilities/gpinitstandby.html
If you have follow up questions, post them here.
I want to setup a Mesophere cluster (mesos, dc/os, marathon) for running different jobs. The nodes which these jobs run on depend on the nature of the job. For e.g a job with C# code will run on a windows node. A job with pure C++ will run on Ubuntu or freebsd and so on. Each of these can again be a cluster. ie I want to have, lets say, 2 windows nodes and 4 ubuntu nodes. So I would like to know:
Can this be achieved in a single deployment ? Or do i need to setup different clusters for reach environment i want , one for windows, one for Ubuntu etc.
Regradless of a single hybrid or multiple environments, does mesos provide granularity of what the nodes send back. i.e I dont want to see high level status like job failed or running etc. My jobs write stats to a file on the system and i want to relay this back to the "main UI" or the layer that is managing all this
Can this be achieved in a single deployment?
If you want to use DCOS, currently offical just support centos/redhat, for ubuntu, you need to use at least 1604, which use systemd not old upstart of ubuntu. But afaik, windows is not support in DCOS.
So for your scenario, you had to use mesos directly not to use dcos, then with one cluster, you can set different mesos agent on ubuntu or windows. And you may add role or attribute when agent register to mesos master, so the framework can distinguish when dispatch job to proper agent.
BTW, for windows, you had to use at least 1.3.0 mesos which support windows, and you had to build it on windows using microsoft visual studio by yourself.
Does mesos provide granularity of what the nodes send back?
Yes, but you can not use default command executor, you need to write your own executor.
In your executor, you can set the value which you want to send back:
update = mesos_pb2.TaskStatus()
update.task_id.value = task.task_id.value
update.state = mesos_pb2.TASK_RUNNING
update.data = 'data with a \0 byte'
driver.sendStatusUpdate(update)
In your framework, you can receive it as follows:
def statusUpdate(self, driver, update):
slave_id, executor_id = self.taskData[update.task_id.value]
Here is an example I found from github which may help you about how to send your own data back.
I have setup a Hadoop Cluster with Hortonworks Data Platform 2.5. I'm using 1 master and 5 slave (worker) nodes.
Every few days one (or more) of my worker nodes gets a high load and seem to restart the whole CentOS operating system automatically. After the restart the Hadoop components don't run anymore and have to be restarted manually via the Amabri management UI.
Here a screenshot of the "crashed" node (reboot after the high load value ~4 hours ago):
Here a screenshot of one of other "healthy" worker node (all other workers have similar values):
The node crashes alternate between the 5 worker nodes, the master node seems to run without problems.
What could cause this problem? Where are these high load values coming from?
This seems to be a Kernel problem, as the log file (e.g. /var/spool/abrt/vmcore-127.0.0.1-2017-06-26-12:27:34/backtrace) says something like
Version: 3.10.0-327.el7.x86_64
BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
After running a sudo yum update I had the kernel version
[root#myhost ~]# uname -r
3.10.0-514.26.2.el7.x86_64
Since the operating system updates the problem didn't occur anymore. I will observe the issue and give feedback if neccessary.
I am running a mesos cluster on 3 instances each running both mesos-master and mesos-slave. I believe the cluster to be configured correctly and able to run web app via docker and marathon on all three instances.
I set up a Jenkins to perform a deployment to the cluster and as the last step post to marathon REST API to restart the job, however it fails silently (simply stuck at deployment stage) . However, if the app is running on 2 instances, restart goes smoothly. Does marathon require one instance to be unoccupied to perform app restart?
Am I missing something here?
Are there enough free resources in your cluster?
IIRC the default restart behavior will first start the new version and then scale down the old version (hence you need 2* app resources).
See Marathon Deployments for details and the upgrade strategy section here.
Here the relevant excerpt from the upgrade strategy:
upgradeStrategy
During an upgrade all instances of an application get replaced by a new version. The upgradeStrategy controls how Marathon stops old versions and launches new versions. It consists of two values:
minimumHealthCapacity (Optional. Default: 1.0) - a number between 0and 1 that is multiplied with the instance count. This is the minimum number of healthy nodes that do not sacrifice overall application purpose. Marathon will make sure, during the upgrade process, that at any point of time this number of healthy instances are up.
maximumOverCapacity (Optional. Default: 1.0) - a number between 0 and 1 which is multiplied with the instance count. This is the maximum number of additional instances launched at any point of time during the upgrade process.
The default minimumHealthCapacity is 1, which means no old instance can be stopped before another healthy new version is deployed. A value of 0.5 means that during an upgrade half of the old version instances are stopped first to make space for the new version. A value of 0 means take all instances down immediately and replace with the new application.
The default maximumOverCapacity is 1, which means that all old and new instances can co-exist during the upgrade process. A value of 0.1 means that during the upgrade process 10% more capacity than usual may be used for old and new instances. A value of 0.0 means that even during the upgrade process no more capacity may be used for the new instances than usual. Only when an old version is stopped, a new instance can be deployed.
I configured a multi node hadoop env on AWS (1 master/3 slaves running on Ubuntu 14.04). now I am planning to install and configure other Apache bricks (not sure which one exactly yet). I decided to start with HBase.
here is my dilemma: should I install zookeeper as a standalone and then HBase (taking into consideration future bricks like pig, hive ...) or should I use zookeeper/Hbase bundled?
How this choices may affect subsequent architecture design ?
thanks for sharing your views/personal experiences !
It doesn't really matter all that much from a capabilities point of view. If you install the bundled HBase+ZK, you'll still be able to use ZK later on to support other bricks. Since installing the bundle is likely to be the quickest path to a working HBase, it is probably the best option for you.
ZK ensemble is recommended to be run on separate machines (Odd number), in any production environment.
For your learning & experimenting, it can co-exist in one machine.
More information # https://zookeeper.apache.org/doc/r3.3.2/zookeeperAdmin.html