Can someone please explain to me how fence_vmware_soap work? - cluster-computing

I was able to set up fence_vmware_soap in my cluster and I know it is used to prevent data corruption so that two nodes do not write to a shared storage (luns in my case) at the same time. The fence will make sure the unhealthy node is completely down before the active node is able to take over and write to the shared disk.
I will like to know what happens and how the one node in the cluster knows the other node is unhealthy before the unhealthy node kills itself using fence_vmware_soap agent.
I will really appreciate an answer explaining it in a very simple way because this is my first time setting up an nfs cluster (active /passive)

I know this thread is kinda old, but:
First try to reach your vmware cluster is available:
# fence_vmware_soap -a my_host_ip -l my_user -p my_pw --ssl -z -v -o list
I don't know how to do it with pacemaker, but the solution without is to change following in your cluster.conf:
<clusternode name="n1" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="vmwarefence" port="rhel5rhcs-node1"
uuid="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="xxx.xxx.xxx.xxx"
login="root" name="vmwarefence" passwd="pwd" ssl="1"/>
</fencedevices>
You can afterwards check the cluster status by
# clustat

Related

Disconnecting and reconnecting nvme

Is there capacity within amazon/centos/linux to switch the ordering round of nitro disks?
I have an ami which consistently has devices in the incorrect order, by this I mean nvme1n1 and nvme2n1 should be switched round. If I run nvme id-ctrl -v /dev/nvme1n1 | grep sn I get a different serial number back following a reboot. I know they're "wrong" as the serial numbers are not reflective of their capacity... Hope that makes sense (I appreciate it's a bit confusing). This only ever occurs on servers with two or more disks; upon a reboot the disks are "correct"
My question is, is there a method of forcing the nvme device to disconnect and reconnect (in the hope that the mapping works as expected in the correct order).
Thanks guys
Amazon Linux version 2017.09.01 and later contains scripts and a udev rule that automatically maps NVMe devices to /dev/xvd?. It is very briefly mentioned in the documentation, but there is not much information there.
You can obtain a copy by launching the Amazon Linux AMI, but there are also other places on the web where they have been posted. For example, I found this gist.
Very simple in the end:
echo 1 > /sys/bus/pci/devices/$(readlink -f /sys/class/nvme/nvme1 | awk -F "/" '{print $5}')/remove
echo 1 > /sys/bus/pci/devices/$(readlink -f /sys/class/nvme/nvme2 | awk -F "/" '{print $5}')/remove
echo 1 > /sys/bus/pci/rescan

Input / Output error when using HDFS NFS Gateway

Getting "Input / output error" when trying work with files in mounted HDFS NFS Gateway. This is despite having set dfs.namenode.accesstime.precision=3600000 in Ambari. For example, doing something like...
$ hdfs dfs -cat /hdfs/path/to/some/tsv/file | sed -e "s/$NULL_WITH_TAB/$TAB/g" | hadoop fs -put -f - /hdfs/path/to/some/tsv/file
$ echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" /nfs/hdfs/path/to/some/tsv/file)"
when trying to remove nulls from a tsv then inspect for nulls in that tsv based on the NFS location throws the error, but I am seeing it in many other places (again, already have dfs.namenode.accesstime.precision=3600000). Anyone have any ideas why this may be happening or debugging suggestions? Can anyone explain what exactly "access time" is in this context?
From discussion on the apache hadoop mailing list:
I think access time refers to the POSIX atime attribute for files, the “time of last access” as described here for instance (https://www.unixtutorial.org/atime-ctime-mtime-in-unix-filesystems). While HDFS keeps a correct modification time (mtime), which is important, easy and cheap, it only keeps a very low-resolution sense of last access time, which is less important, and expensive to monitor and record, as described here (https://issues.apache.org/jira/browse/HADOOP-1869) and here (https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time).
However, to have a conforming NFS api, you must present atime, and so the HDFS NFS implementation does. But first you have to configure it on. [...] many sites have been advised to turn it off entirely by setting it to zero, to improve HDFS overall performance. See for example here ( https://community.hortonworks.com/articles/43861/scaling-the-hdfs-namenode-part-4-avoiding-performa.html, section "Don’t let Reads become Writes”). So if your site has turned off atime in HDFS, you will need to turn it back on to fully enable NFS. Alternatively, you can maintain optimum efficiency by mounting NFS with the “noatime” option, as described in the document you reference.
[...] check under /var/log, eg with find /var/log -name ‘*nfs3*’ -print

How to share memory in cluster machine (qsub openmpi)

dear all!
I have a question about sharing memory in cluster. I am a new to cluster, and fail to solve my problem after trying about several weeks, so I look for help here, any suggestion would be grateful!
I want to use soapdenovo, a software that was used to assemble human genome to assemble my data. However, it failed in one step because shortage of memory (the memory is 512G in my machine). So I turned to cluster machine (which have three big nodes, each node have 512 memory too), and started to learn submit job with qsub. Considering that one node couldn't solve my problem, I googled and found that openmpi may help, but when I running openmpi with demo data, it seemed it only run the command several times. Then I found to use openmpi, the software must include library of openmpi, and I didn't know whether soapdenovo is support openmpi, I had asked the question but the author didn't give me answer yet. Suppose soapdenovo support the openmpi, how should I solve my problem. If it didn't support openmpi, can I use memory in different nodes to run the software?
The problem had tortured my so much, thanks for any help. Following is what had I do and some information about the cluster machine:
Install openmpi and submit the job
1) The script of job:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
2) Submit the job:
qsub -pe orte 60 mpi.qsub
3) The log in ass.err
a) It seemed it run soapdenovo several times according to the log
cat ass.err | grep "Pregraph" | wc -l
60
b) detail information
less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix
In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
and so on
c) information of stdin
cat ass.log:
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.
Host: smp03
PID: 75035
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Information about cluster:
1) qconf -sql
all.q
smp.q
2) qconf -spl
mpi
mpich
orte
zhongxm
3) qconf -sp zhongxm
pe_name zhongxm
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
4) qconf -sq smp.q
qname smp.q
hostlist #smp.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
5) qconf -sq all.q
qname all.q
hostlist #allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 16,[c0219.local=32]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists mobile
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
According to https://hpc.unt.edu/soapdenovo the software doesn't support MPI:
This code is NOT compiled with MPI, and should only be used in parallel on a SINGLE node, via a threaded model.
So, you can't just start the software with mpiexec on cluster to have access to more memory. Cluster machines are connected with non-coherent networks (Ethernet, Infiniband) which are slower than memory bus, and PCs in cluster do not share their memory. Clusters use MPI libraries (OpenMPI or MPICH) to work with network, and all requests between nodes is explicit: program calls MPI_Send in one process and MPI_Recv in other. There are also one-way calls like MPI_Put/MPI_Get to access remote memory (RDMA - remote direct memory access), but this is not the same as local memory.
osgx, thank you for your reply very much and sorry for the delay of this message.
Since I don't major in computer, I think I can't understand some glossary very well, like ELF. So there are some new questions and I list my question as follow, thanks for help advace:
1) When I "ldd SOAPdenovo-63mer", it outputed "not a dynamic executable", did this mean "the code is not complied with MPI" that you mentioned?
2) In short, I can't solve the problem with the cluster, and I have to look for a machine with more than 512G memory?
3) Also, I used another software called ALLPATHS-LG (http://www.broadinstitute.org/software/allpaths-lg/blog/) that was also failed for shortage of memory, and according to FAQ C1 (http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336), what "it uses share memory parallelization" mean, did it means it can use memory in cluster, or only memory in a node, and I have to find a machine with enough memory?
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
By the way, this is first time I posted here, I think I should use commit to reply, considering so many words, I use "Answer Your Question".

Glusterd, One of the bricks contain the other

When configuring Glusterd I get the following error
'One of the bricks contain the other'
When executing the command
gluster volume create slitaz-volume replica 2
192.168.56.101:/mnt/data 192.168.56.102:/mnt/data
I found something to fix it with getfattr and setfattr but when executing the command setfattr it answers with 'No such attribute'
Make sure the uuids of each of the gluster are different. Look at /var/lib/glusterd/glusterd.info for the UUID. If they are the same gluster won't work. Stop the service, delete that file and restart glusterd again. That will fix your issue.

How to detect non-busy machines over a LAN automatically?

I'm writing an MPI program to be run over a local area network. These machines can be ssh'd to by any student at any time.
Although I always test my program at night, the performance has been very inconsistent. My guess is that some nodes were busy when I ran the program.
So my question is: can I write a script to detect non-busy machines and update the machine file? What's an easy way to write it?
Thanks a lot.
SSH into each machine, then read the /proc/loadavg file or determine the "business" in some other way.
I think the easiest way would be installing the check_load[1] script from Nagios to every node you want to check and call it via ssh with some sensible parameters:
# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;
CRITICAL would mean "really busy", WARNING could be "is kinda busy" and OK would mean "the machine is idle".
You have to pay attention for the tresholds you have to give as 1/5/15 minute for warning and critical; for instance, a machine with 16 cores having a load of 3 is perfectly ok, while a load of 3 on a single-core machine would mean it's really really busy.
Good luck!
Alex.
[1] http://nagiosplugins.org/man/check_load

Resources