Running multiple worker daemons SLURM - cluster-computing

I want to run multiple worker daemons on single machine. As per damienfrancois's answer on what is the minimum number of computers for a slurm cluster it can be done. Problem is currently I am able to execute only 1 worker daemon on one machine. for example
When I run
sudo slurmd -N linux1 -cDvv
sudo slurmd -N linux2 -cDvv
linux1 goes down when I run linux2. Is it possible to run multiple worker daemons on one machine?
Here is my slurm.conf file

as your intention seems to be just testing the behavior of Slurm, I would recommend you to use the front-end mode, where you can create dummy computation nodes in the same machine.
In their FAQ, you have more details, but basically you must configure your installation to work with this mode:
./configure --enable-front-end
And configure the nodes in slurm.conf
NodeName=test[1-100] NodeHostName=localhost
In that guide, they also explain how to launch more than one real daemons in the same node by changing the ports, but for my testing purposes it was not necessary.
Good luck!

I got the same issue as you, I resolved it by modifying the paths of log files as mentioned there multiple slurmd support.
In your slurm.conf for example
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
must be
SlurmdLogFile=/var/log/slurm/slurmd.%n.log
SlurmdPidFile=/var/run/slurmd.%n.pid
SlurmdSpoolDir=/var/spool/slurmd.%n
Now you can launch multiple slurmd.
Note : I tried with your slurm conf, I think some parameters are missing like define two NodeName instead of one and add which Port to use for each of Nodes.
This works for me
# COMPUTE NODES
NodeName=linux[1-10] NodeHostname=linux0 Port=17004 CPUs=1 State=UNKNOWN
NodeName=linux[11-19] NodeHostname=linux0 Port=17005 CPUs=1 State=UNKNOWN
# PARTITIONS
PartitionName=main Nodes=linux1 Default=YES MaxTime=INFINITE State=UP
PartitionName=dev Nodes=linux11 Default=YES MaxTime=INFINITE State=UP

Related

What are best practices to run command on all nodes in a HDP cluster?

Often in a hadoop environment, you are required to run a command or a script or copy a file to all nodes in the cluster.
What are efficient ways of doing that (without having to ssh to each node separately)?
Example:
When upgrading Ambari, you are required to run many commands on all nodes where a certain component is installed - e.g. Infra, SmartSense, etc.
I use Ansible to do that. It will do the job for you.
But you can use puppet or Salt or Chef.

redis on windows cluster setup

I have downloaded MSOpenTech Redis version 3.x which includes the long awaited clustering feature. My redis database is all working and I can start my cluster on the min 3 nodes required (in cluster mode). Does anyone know how to configure the cluster (it seems no one knows)?
Installing Linux and running the native Linux version is not an option for me sadly.
Any help would be greatly appreciated.
You can follow the Redis Cluster Tutorial and to create the cluster you can use the redis-trib.rb ruby script, for which you need to install Ruby for Windows.
For example:
> C:\Ruby22\Bin\ruby.exe redis-trib.rb create --replicas 1 192.168.1.1:7000 192.168.1.1:7001 192.168.1.1:7002 192.168.1.1:7003 192.168.1.1:7004 192.168.1.1:7005
Did not have the option to install Ruby on Windows but found the manual steps worked for me. The Ruby script seems to do a lot of checking stuff is setup correctly and is the preferred setup route. So Beware, here be dragons.
Set each node to run in Cluster mode. Edit the redis.windows-service.conf file and uncomment
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
restart the service.
Run a powershell window and change to the Redis installed folder and start the redis-cli. e.g.
cd "C:\Program Files\Redis"
.\redis-cli.exe
Now you can join other nodes. Run CLUSTER MEET IPADDRESS PORT for each of the other nodes, than the instance you happen to be on. e.g.
CLUSTER MEET 10.10.0.2 6379
After a few seconds running
CLUSTER NODES
Should list all the nodes connected, but all will be set as MASTER.
On each of the other nodes, run CLUSTER REPLICATE MASTERNODEID. Where MASTERNODEID is the hash-looking value next the node declared "myself" on your master when running CLUSTER NODES. e.g.
CLUSTER REPLICATE b7c767ab3ab7c4a926ac2fed937cf140b96764a7
Now allocate slots to each Master. My setup has three instances, only one master.
for ($slot=0;$slot -le 16383;$slot++) {
.\redis-cli.exe -h REDMST CLUSTER ADDSLOTS $slot
}
Reconnect with redis-cli and try and save data. e.g.
SET foo bar
OK
GET foo
"bar"
Phew! Got most this from reading https://www.javacodegeeks.com/2015/09/redis-clustering.html#InstallingRedis which is not Windows specific.
for windows version:
open the command window then type below command
C:\ProgramFiles\redis>FOR /L %i IN (0,1,16383) DO ( redis-cli.exe -p **6380** CLUSTER ADDSLOTS %i )
6380 is port of master node.

Spark - Add Worker from Local Machine (standalone spark cluster manager)?

When running spark 1.4.0 in a single machine, I can add worker by using this command "./bin/spark-class org.apache.spark.deploy.worker.Worker myhostname:7077". The official documentation points out another way by adding "myhostname:7077" to the "conf/slaves" file followed by executing the command "sbin/start-all.sh" which invoke the master and all workers listed in conf/slaves file. However, the later method doesn't work for me (with time-out error). Can anyone help me with this?
Here is my conf/slaves file (assume the master URL is myhostname:700):
myhostname:700
The conf.slaves file should just be the list of the hostnames, you don't need to include the port # that spark runs on (I think if you do it will try and ssh on that port which is probably where the timeout comes from).

Percona Xtradb Cluster nodes won't start

I setup percona_xtradb_cluster-56 with three nodes in the cluster. To start the first cluster, i use the following command and it starts just fine:
#/etc/init.d/mysql bootstrap-pxc
The other two nodes however fail to start when i start them normally using the command:
#/etc/init.d/mysql start
The error i am getting is "The server quit without updating the PID file". The error log contains this message:
Error in my_thread_global_end(): 1 threads didn't exit 150605 22:10:29
mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended.
The cluster nodes are running all Ubuntu 14.04. When i use percona-xtradb-cluster5.5, the cluster ann all the nodes run just fine as expected. But i need to use version 5.6 because i am also using GTID which is only available in version 5.6 and not supported in earlier versions.
I was following these two percona documentation to setup the cluster:
https://www.percona.com/doc/percona-xtradb-cluster/5.6/installation.html#installation
https://www.percona.com/doc/percona-xtradb-cluster/5.6/howtos/ubuntu_howto.html
Any insight or suggestions on how to resolve this issue would be highly appreciated.
The problem is related to memory, as "The Georgia" writes. There should be at least 500MB for default setup and bootstrapping. See here http://sysadm.pp.ua/linux/px-cluster.html

Spark: how to set worker-specific SPARK_HOME in standalone mode [duplicate]

This question already has answers here:
How to use start-all.sh to start standalone Worker that uses different SPARK_HOME (than Master)?
(3 answers)
Closed 4 months ago.
I'm setting up a [somewhat ad-hoc] cluster of Spark workers: namely, a couple of lab machines that I have sitting around. However, I've run into a problem when I attempt to start the cluster with start-all.sh: namely, Spark is installed in different directories on the various workers. But the master invokes $SPARK_HOME/sbin/start-all.sh on each one using the master's definition of $SPARK_HOME, even though the path is different for each worker.
Assuming I can't install Spark on identical paths on each worker to the master, how can I get the master to recognize the different worker paths?
EDIT #1 Hmm, found this thread in the Spark mailing list, strongly suggesting that this is the current implementation--assuming $SPARK_HOME is the same for all workers.
I'm playing around with Spark on Windows (my laptop) and have two worker nodes running by starting them manually using a script that contains the following
set SPARK_HOME=C:\dev\programs\spark-1.2.0-worker1
set SPARK_MASTER_IP=master.brad.com
spark-class org.apache.spark.deploy.worker.Worker spark://master.brad.com:7077
I then create a copy of this script with a different SPARK_HOME defined to run my second worker from. When I kick off a spark-submit I see this on Worker_1
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker1\bin...
and this on Worker_2
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker2\bin...
So it works, and in my case I duplicated the spark installation directory, but you may be able to get around this
You might want to consider assign the name by changing SPARK_WORKER_DIR line in the spark-env.sh file.
A similar question was asked here
The solution I used was to create a symbolic link mimicking the master node's installation path on each worker node so when the start-all.sh executing on the master node does its SSH into the worker node, it will see identical pathing to run the worker scripts.
Example in my case, I had 2 Macs and 1 Linux machine. Both Macs had spark installed under /Users/<user>/spark however the Linux machine had it under /home/<user>/spark. One of the Macs was the master node so running the start-all.sh it would error each time on the Linux machine due to pathing (error: /Users/<user>/spark does not exist)).
The simple solution was to mimic the Mac's pathing on the Linux machine using a symbolic link:
open terminal
cd / <-- go to the root of the drive
sudo ln -s home Users <-- create a sym link "Users" pointing to the actual "home" directory.

Resources