ganglia is missing some metrics of some nodes - cluster-computing

I use ganglia to monitor performance-related metrics of cluster nodes.
I installed gmond python modules for richer functionality.
However, some metrics from some nodes are missing (i.e. disk_*_read_bytes_per_sec)
There are a few nodes that work as expected reporting the metrics.
But some nodes are missing either disk__read_bytes_per_sec or disk__write_bytes_per_sec or both of them.
If I restart gmond daemon some work correctly again and some work incorrectly again....
I checked /etc/ganglia/gmond.conf, /etc/ganglia/conf.d/* configurations files. All the computation nodes in the cluster have the exactly same configuration settings. How can they behave such differently? Where should I check first to resolve the problem?
Thanks

Related

Setting up a Sensu-Go cluster - cluster is not synchronizing

I'm having an issue setting up my cluster according to the documents, as seen here: https://docs.sensu.io/sensu-go/5.5/guides/clustering/
This is a non-https setup to get my feet wet, I'm not concerned with that at the moment. I just want a running cluster to begin with.
I've set up sensu-backend on my three nodes, and have configured the backend configuration (backend.yml) accordingly on all three nodes through an ansible playbook. However, my cluster does not discover the other two nodes. It simply shows the following:
For backend1:
=== Etcd Cluster ID: 3b0efc7b379f89be
ID Name Peer URLs Client URLs
────────────────── ─────────────────── ─────────────────────── ───────────────────────
8927110dc66458af backend1 http://127.0.0.1:2380 http://localhost:2379
For backend2 and backend3, it's the same, except it shows those individual nodes as the only nodes in their cluster.
I've tried both the configuration in the docs, as well as the configuration in this git issue: https://github.com/sensu/sensu-go/issues/1890
None of these have panned out for me. I've ensured all the ports are open, so that's not an issue.
When I do a manual sensuctl cluster member-add X X, I get an error message and it results in the sensu-backend process failing. I can't remove the member, either, because it causes the entire process to not be able to start. I have to revert to an earlier snapshot to fix it.
The configs on all machines are the same, except the IP's and names are appropriated for each machine
etcd-advertise-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-initial-cluster: "backend1=http://XX.XX.XX.20:2380,backend2=http://XX.XX.XX.31:2380,backend3=http://XX.XX.XX.32:2380"
etcd-initial-advertise-peer-urls: "http://XX.XX.XX.20:2380"
etcd-initial-cluster-state: "new" # have also tried existing
etcd-initial-cluster-token: ""
etcd-name: "backend1"
Did you find the answer to your question? I saw that you posted over on the Sensu forums as well.
In any case, the easiest thing to do in this case would be to stop the cluster, blow out /var/lib/sensu/sensu-backend/etcd/ and reconfigure the cluster. As it stands, the behavior you're seeing seems like the cluster members were started individually first, which is what is potentially causing the issue and would be the reason for blowing the etcd dir away.

Ganglia data isn't stored as expected

I am trying to setup ganglia in order to set up monitoring for spark on our cluster.
So far I have installed gmond&gmetad on my master server, and gmond on one of my slaves.
My problem is that I can only see one node on my ganglia web frontend.
I have checked the /var/lib/ganglia/rrds folder, where the rrd files are being created, and I see that both servers contain folders by the same name - ip-10-0-0-58.ec2.internal.
How can I instruct Ganglia to write its data to different folders, in order to differentiate between the nodes?
If any info is missing, I will gladly supply it.
Thanks for the help,
Yaron.
In the end, the problem was solved by removing the bind value from the udp_recv_channel part of the gmond.conf on the master.

Could not determine the current leader

I'm in this situation in which I got two masters and four slaves in mesos. All of them are running fine. But when I'm trying to access marathon I'm getting the 'Could not determine the current leader' error. I got marathon in both masters (117 and 115).
This is basically what I'm running to get marathon up:
java -jar ./bin/../target/marathon-assembly-0.11.0-SNAPSHOT.jar --master 172.16.50.117:5050 --zk zk://172.16.50.115:2181,172.16.50.117:2181/marathon
Could anyone shed some light over this?
First, I would double-check that you're able to talk to Zookeeper from the Marathon hosts.
Next, there are a few related points to be aware of:
Per the Zookeeper administrator's guide (http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkMulitServerSetup) you should have an odd number of Zookeeper instances for HA. A cluster size of two is almost certainly going to turn out badly.
For a highly available Mesos cluster, you should run an odd number of masters and also make sure to set the --quorum flag appropriately based on that number. See the details of how to set the --quorum flag (and why it's important) in the operational guide on the Apache Mesos website here: http://mesos.apache.org/documentation/latest/operational-guide
In a highly-available Mesos cluster (#masters > 1) you should let both the Mesos agents and the frameworks discover the leading master using Zookeeper. This lets them rediscover the leading master in case a failover occurs. In your case assuming canonical ZK ports you would set the --zk flag on the Mesos masters to --zk=zk://172.16.50.117:2181,172.16.50.115:2181/mesos (add a third ZK instance, see the first point above). The same value should be used for the --master flags in both the Mesos agents and Marathon, instead of specifying a single master.
It's best to run an odd number of masters in your cluster. To do so, either add another master so you have three or remove one so you have only one.

monitoring hadoop cluster with ganglia

I'm new to hadoop and trying to monitor the working of a multi node cluster using ganglia, The setup of gmond is done on all nodes and ganglia monitor only on the master.However,there are hadoop metrics graphs only for the master node and just system metrics for slaves. Do these hadoop metrics on the master include the slave metrics as well?Or is there any mistake in configuration files? Any help would be appreciated.
I think you should read this in order to understand how metrics flow between master and slave.
However, I would like to brief that, in genral, hadoop based or hbase based metrics are directly emitted/ sent to the master server (By master server, I mean the server on which gmetad is installed). All other OS related metrics are first collected by gmond installed on the corresponding slave and then redirected to the gmond installed on the master server.
So, if you are not getting any OS related metrics of slave servers then there is some misconfiguration in your gmond.conf. To know more about how to configure ganglia, please read this. This has helped me and could help you for sure, if you go through carefully.
There is a mistake in your configuration files.
More precisely, in transmitting / collecting the data, whichever approach you use.

ElasticSearch replication

Can somebody provide some instructions on how to configure ElasticSearch for replication. I am running ES in windows and understand that if I run the bat files multiple times on the same server, a separate instance of ES is started and they all connect to each other.
I will be moving to a production environment soon and will have a three node set up, each node being on a different server. Can someone point me at some documentation which gives me a bit more control of the replication set up.
Have a look at the discovery documentation. It works out-of-the-box with multicast discovery, even though you could have problems with firewalls etc., but I would recommend against it in production. I would rather use unicast and configure the host names of the nodes belonging to the cluster in the elasticsearch.yml. That way you make sure nobody is going to join the production cluster from his own machine.
One other thing I would do is configure a proper cluster name specific for every environment.
Replication is set to each index in Elasticsearch, not set to the server or node. That is, each index can have different number of replication setting. The number of replica setting is 1 by default.
The number of replication is not related or restricted to number of node set up. If the number of replication is greater than the number of data nodes, only the index health becomes yellow since some replications are not allocated, anything still works fine.
You can reference to the document for further information: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

Resources