Ganglia seeing nodes but not metrics - hadoop

I have a hadoop cluster with 7 nodes, 1 master and 6 core nodes. Ganglia is setup on each machine, and the web front end correctly shows 7 hosts.
But it only shows metrics from the master node (with both gmetad and gmond). The other nodes have the same gmond.conf file as the master node, and the web front end clearly sees the nodes. I don't understand how ganglia can recognize 7 hosts but only show metrics from the box with gmetad.
Any help would be appreciated. Is there a quick way to see if those nodes are even sending data? Or is this a networking issue?
update#1: when I telnet into a gmond host machine that is not the master node, and look at port 8649, I see the XML but no data. When I telnet to 8649 on the master machine, I see XML and data. Any suggestions of where to go from here?

Set this to all gmond.conf files of every node you want to monitor:
send_metadata_interval = 15 // or something.
Now all the nodes and their metrics are showed in master (gmetad).
This extra configuration is necessary if you are running in a unicast mode, i.e., if you are specifying a host in udp_send_channel rather than mcast_join. In the multi-cast mode, the gmond deamons can query each other any time and proactive sending of monitoring data is not required.

In gmond configuration, ensure the following is all provided:-
cluster {
name = "my cluster" #is this the same name as given in gmetad conf?
## Cluster name
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
#mcast_join = 239.2.11.71 ## Comment this
host = 192.168.1.10 ## IP address/hostname of gmetad node
port = 8649
ttl = 1
}
/* comment out this block itself
udp_recv_channel {
...
}
*/
tcp_accept_channel {
port = 8649
}
save and quit. Restart your gmond daemon. Then execute "netcat 8649". Are you able to see XML with metrics now?

Related

Consul Data Center: Leader node not automatically selected after failure of previous leader node

I'm new to Consul and I have created a one Data Center with 2 server nodes.
I followed the steps provided in this documentation,
https://learn.hashicorp.com/tutorials/consul/deployment-guide?in=consul/datacenter-deploy
The nodes are successfully created, and they both are in sync when I launch a service. Every thing is working fine till this step.
However I face an issue in case where the leader node fails (goes offline). In that case, the follower node DOES NOT automatically assume the role of leader node and Consul as whole becomes inaccessible for the service. Even the follower node stops responding to requests even though it is still running.
Can anyone help me understand what exactly is wrong with my setup and how can I keep my setup still working with the follower node automatically becoming leader node and respond to queries from API Gateway?
The below documentation gives some pointer and talks about fulfilling a 'Quorum' for automatic selection of a leader. I'm not sure if it is applicable in this case of mine?
https://learn.hashicorp.com/tutorials/consul/recovery-outage-primary?in=consul/datacenter-operations#outage-event-in-the-primary-datacenter
Edit:
consul.hcl
First Server:
datacenter = "dc1"
data_dir = "D:/Hashicorp/Consul/data"
encrypt = "<key>"
ca_file = "D:/Hashicorp/Consul/certs/consul-agent-ca.pem"
cert_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-0.pem"
key_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-0-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
retry_join = ["<ip1>", "<ip2>"]
Second Server:
datacenter = "dc1"
data_dir = "D:/Hashicorp/Consul/data"
encrypt = "<key>"
ca_file = "D:/Hashicorp/Consul/certs/consul-agent-ca.pem"
cert_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-1.pem"
key_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-1-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
retry_join = ["<ip1>", "<ip2>"]
server.hcl:
First Server:
server = true
bootstrap_expect = 2
client_addr = "<ip1>"
ui = true
Second Server:
server = true
bootstrap_expect = 2
client_addr = "<ip2>"
ui = true
The size of the cluster and the ability to form a quorum is absolutely applicable in this case. You will need a minimum of 3 nodes in the cluster in order to tolerate a failure of one node without sacrificing the availability of the cluster.
I recommend reading Consul's Raft Protocol Overview as well as reviewing the deployment table at the bottom of the page to help understand the failure tolerance provided by using various cluster sizes.

Datastax - Cassandra Amazon EC2 Multiregion Setup - Cluster with 3 node

I have launched 3 Amazon EC2 instance and setup datastax cassandra as follows
1.Region - US EAST:
cassandra.yaml - configuration
a.listen_address as private IP of this instance
b.broadcast_address as public IP of this instance
c.seeds as 50.XX.XX.X1, 50.XX.XX.X2 (public-ip of node1,public-ip of node2)
cassandra-rackdc.properties - configuration
dc=DC1
rack=RAC1
dc_suffix=US_EAST_1
2.Region - US WEST:
I did same procedure as I did above.
3.Region - EU IRELAND:
The result of above configuration is
All the node working good individually. But when I do
$nodetool status on all the three node
It only listing the local node only.
I tried to achieve the following things.
1. Launch 3 cassandra node in three different region. For say, US-EAST,US-WEST,EU-IRELAND.
With Following configuration or methodology
a.Ec2MultiRegionSnitch
b.Replication staragey as SimpleStrategy
c.Replication Factor as 3
d. Read & write level as QUORUM.
I wish to attain only one thing i.e. if any two of the region is down or any two of the node down, I can survive with renaming one node.
My Questions here are
Where I did the mistake? and How to attain my requirements?
Any help or inputs are much appreciated.
Thanks.
This is what worked for me with cassandra 3.0
endpoint_snitch: Ec2MultiRegionSnitch
listen_address: <leave_blank>
broadcast_address: <public_ip_of_server>
rpc_address: 0.0.0.0
broadcast_rpc_address: <public_ip_of_server>
-seed: "one_ip_from_other_DC"
Finally, I found the resolution of my issue. I am using replication strategy as SimpleStrategy, hence I do not require to configure cassandra-rackdc.properties.
Once, I removed the file cassandra-rackdc.properties from all node, Everything working as expected.
Thanks

Unable to add another node to existing node to form a cluster. Couldn't change num_tokens to vnodes

i have installed cassandra on two individual nodes both on Amazon.when i am trying to configure nodes to form a cluster the nodes. I am receiving the following error.
ERROR [main] 2016-05-12 11:01:26,402 CassandraDaemon.java:381 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Cannot change the number of tokens from 1 to 256.
I using these setting in cassandra.yaml file
listen_address and rpc_address to : private Ip address
seeds : Public Ip [Elastic Ip address]
num_tokens: 256
This message usually appears when num_tokens is changed after the node has been bootstrapped.
The solution is:
Stop Cassandra on all nodes
Delete the data directory (inc. datafiles, commitlog and saved_caches)
Double check that num_tokens is set to 256, initial_token is commented out and auto_bootstrap is set to true in cassandra.yaml
Start Cassandra on all nodes
This will wipe your existing cluster and cause the nodes to bootstrap from scratch again.
Cassandra doesn't support changing between vnodes and static tokens after a datacenter is bootstrapped. If you need to change from vnodes to static tokens or vice versa in an already running cluster, you'll need to create a second datacenter using the new configuration, stream your data across, and then decomission the original nodes.

ElasticSearch... Not So Elastic?

I have used this method to build Elastic Search Clusters in the cloud. It works 30%-50% of the time.
I start with 2 centos nodes in 2 servers in Digital Oceans Cloud. I then install ES and set the same cluster name in each config/elasticsearch.yml. Then I also set (uncomment):
discovery.zen.ping.multicast.enabled: false
as well as set and uncomment:
discovery.zen.ping.unicast.hosts: ['192.168.10.1:9300', '192.168.10.2:9300']
in each of the 2 servers. SO Reference here
Then, to give ES the benefit of the doubt, I service iptables stop, then restart the service on each node. Sometimes the servers see each other and I get a """cluster""" out of elasticsearch, sometimes if not most, the servers dont see each other even though multicast is disabled and specific ip addresses are given in the unicast hosts array that have NO firewall on, and point to each other.
WHY ES Community? Why does a hello world equivalent of elastic search prove to be inelastic to say the least (Let me openly and readily admit this MUST be user error/idiocy else no one would use this technology).
At first I was trying to build a simple 4 node cluster, but goodness gracious the issues that came along with that before indexing a single document were ridiculous. I had a 0% success rate. Some nodes saw some other nodes (via head and paramedic) while others had 'dangling indices' and 'unassigned indexes'. When I googled this I found tons of relevent/similar issues and no workable answers.
Can someone send me an example of how to build an elastic search cluster, that works?
#Ben_Lim's Answer: Did everyone who needs this as a resource get that?
I took 1 node (This is not for Prod) Server1 and changed the following
in /config/elasticsearch.yml settings:
uncomment node.master: true
uncomment and set network.host: 192.XXX.1.10
uncomment transport.tcp.port: 9300
uncomment discovery.zen.ping.multicast.enabled: false
uncomment and set discovery.zen.ping.unicast.hosts: ["192.XXX.1.10:9300"]
That sets the master, okay, then in each subsequent node (example
above) that wants to join --
uncomment node.master: false
uncomment and set network.host: 192.XXX.1.11
uncomment transport.tcp.port: 9301
uncomment discovery.zen.ping.multicast.enabled: false
uncomment and set discovery.zen.ping.unicast.hosts: ["192.XXX.1.10:9300"]
Obviously make sure all nodes have same cluster name and you iptables
firewalls etc are setup right.
NOTE AGAIN -- This is not for prod, but a way to start testing ES in Cloud, you can tighten up the screws from here
The most probable problem you met is the 9300 port is used by other application or the master node is not started at port 9300 , therefore they can't communicate with each other.
When you start 2 ES nodes to build up an cluster, one node must be elected to Master node. The master node will have a communication address: hostIP:post. For example:
[2014-01-27 15:15:44,389][INFO ][cluster.service ] [Vakume] new_master [Vakume][FRtqGG4xSKGbM_Yw9_oBLg][inet[/10.0.0.10:9302]], reason: zen-disco-join (elected_as_master)
When you need to start another node to build up a cluster, you can try to specific the master IP:port, like the example above you need to set
discovery.zen.ping.unicast.hosts: ["10.0.0.10:9302"]
Then the second node can find the master node and join the cluster.

Ganglia spoof doesn't work when sending data to gmond

I am using ganglia 3.6.0 for monitoring. I have an application that collect, aggregate some metrics for all hosts in the cluster. Then it sends them to gmond. The application runs on host1.
The problem here is, when setting spoof = false, ganglia eventually thinks this is metrics that only comes from host1. In fact, these metrics are generated by host1 but for all hosts in the cluster.
But when setting spoof=true, I expect gmond will accept the host name I specified. But it gmond doesn't accept the metrics at all. The metrics is event not shown on host1.
The code I use is copied from GangliaSink (from hadoop common) which applied Ganglia 3.1x format.
xdr_int(128); // metric_id = metadata_msg
xdr_string(getHostName()); // hostname
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string(type); // metric type
xdr_string(name); // metric name
xdr_string(gConf.getUnits()); // units
xdr_int(gSlope.ordinal()); // slope
xdr_int(gConf.getTmax()); // tmax, the maximum time between metrics
xdr_int(gConf.getDmax()); // dmax, the maximum data value
xdr_int(1); /*Num of the entries in extra_value field for
Ganglia 3.1.x*/
xdr_string("GROUP"); /*Group attribute*/
xdr_string(groupName); /*Group value*/
// send the metric to Ganglia hosts
emitToGangliaHosts();
// Now we send out a message with the actual value.
// Technically, we only need to send out the metadata message once for
// each metric, but I don't want to have to record which metrics we did and
// did not send.
xdr_int(133); // we are sending a string value
xdr_string(getHostName()); // hostName
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string("%s"); // format field
xdr_string(value); // metric value
// send the metric to Ganglia hosts
emitToGangliaHosts();
I do specified the host name for each metrics. But it seems not used/recognized by gmond.
Solved this... The hostName format issue. The format need to be like ip:hostname e.g 1.2.3.4:host0000001 or any string:string is fine :-)

Resources