I am using ganglia 3.6.0 for monitoring. I have an application that collect, aggregate some metrics for all hosts in the cluster. Then it sends them to gmond. The application runs on host1.
The problem here is, when setting spoof = false, ganglia eventually thinks this is metrics that only comes from host1. In fact, these metrics are generated by host1 but for all hosts in the cluster.
But when setting spoof=true, I expect gmond will accept the host name I specified. But it gmond doesn't accept the metrics at all. The metrics is event not shown on host1.
The code I use is copied from GangliaSink (from hadoop common) which applied Ganglia 3.1x format.
xdr_int(128); // metric_id = metadata_msg
xdr_string(getHostName()); // hostname
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string(type); // metric type
xdr_string(name); // metric name
xdr_string(gConf.getUnits()); // units
xdr_int(gSlope.ordinal()); // slope
xdr_int(gConf.getTmax()); // tmax, the maximum time between metrics
xdr_int(gConf.getDmax()); // dmax, the maximum data value
xdr_int(1); /*Num of the entries in extra_value field for
Ganglia 3.1.x*/
xdr_string("GROUP"); /*Group attribute*/
xdr_string(groupName); /*Group value*/
// send the metric to Ganglia hosts
emitToGangliaHosts();
// Now we send out a message with the actual value.
// Technically, we only need to send out the metadata message once for
// each metric, but I don't want to have to record which metrics we did and
// did not send.
xdr_int(133); // we are sending a string value
xdr_string(getHostName()); // hostName
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string("%s"); // format field
xdr_string(value); // metric value
// send the metric to Ganglia hosts
emitToGangliaHosts();
I do specified the host name for each metrics. But it seems not used/recognized by gmond.
Solved this... The hostName format issue. The format need to be like ip:hostname e.g 1.2.3.4:host0000001 or any string:string is fine :-)
Related
As per an earlier discussion (Defining multiple outputs in Logstash whilst handling potential unavailability of an Elasticsearch instance) I'm now using pipelines in Logstash in order to send data input (from Beats on TCP 5044) to multiple Elasticsearch hosts. The relevant extract from pipelines.yml is shown below.
- pipeline.id: beats
queue.type: persisted
config.string: |
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => '/etc/logstash/config/certs/ca.crt'
ssl_key => '/etc/logstash/config/certs/forwarder-001.pkcs8.key'
ssl_certificate => '/etc/logstash/config/certs/forwarder-001.crt'
ssl_verify_mode => "force_peer"
}
}
output { pipeline { send_to => [es100, es101] } }
- pipeline.id: es100
path.config: "/etc/logstash/pipelines/es100.conf"
- pipeline.id: es101
path.config: "/etc/logstash/pipelines/es101.conf"
In each of the pipeline .conf files I have the related virtual address i.e. the file /etc/logstash/pipelines/es101.conf includes the following:
input {
pipeline {
address => es101
}
}
This configuration seems to work well i.e. data is received by each of the Elasticsearch hosts es100 and es101.
I need to ensure that if one of these hosts is unavailable, the other still receives data and thanks to a previous tip, I'm now using pipelines which I understand allow for this. However I'm obviously missing something key in this configuration as the data isn't received by a host if the other is unavailable. Any suggestions are gratefully welcomed.
Firstly, you should configure persistent queues on the downstream pipelines (es100, es101), and size them to contain all the data that arrives during an outage. But even with persistent queues logstash has an at-least-once delivery model. If the persistent queue fills up then back-pressure will cause the beats input to stop accepting data. As the documentation on the output isolator pattern says "If any of the persistent queues of the downstream pipelines ... become full, both outputs will stop". If you really want to make sure an output is never blocked because another output is unavailable then you will need to introduce some software with a different delivery model. For example, configure filebeat to write to kafka, then have two pipelines that read from kafka and write to elasticsearch. If kafka is configured with an at-most-once delivery model (the default) then it will lose data if it cannot deliver it.
I am having a hard time debugging why my metrics don't reach my prometheus.
I am trying to send simple metrics to oc agent and from ther pull them with prometheus, but my code does not fail and I can't see the metrics in prometheus.
So, how can I debug it? Can I see if a metric arrived in the oc agent?
cAgentMetricsExporter.createAndRegister(
OcAgentMetricsExporterConfiguration.builder()
.setEndPoint("IP:55678")
.setServiceName("my-service-name")
.setUseInsecure(true)
.build());
Aggregation latencyDistribution = Aggregation.Distribution.create(BucketBoundaries.create(
Arrays.asList(
0.0, 25.0, 100.0, 200.0, 400.0, 800.0, 10000.0)));
final Measure.MeasureDouble M_LATENCY_MS = Measure.MeasureDouble.create("latency", "The latency in milliseconds", "ms");
final Measure.MeasureLong M_LINES = Measure.MeasureLong.create("lines_in", "The number of lines processed", "1");
final Measure.MeasureLong M_BYTES_IN = Measure.MeasureLong.create("bytes_in", "The number of bytes received", "By");
View[] views = new View[]{
View.create(View.Name.create("myapp/latency"),
"The distribution of the latencies",
M_LATENCY_MS,
latencyDistribution,
Collections.emptyList()),
View.create(View.Name.create("myapp/lines_in"),
"The number of lines that were received",
M_LATENCY_MS,
Aggregation.Count.create(),
Collections.emptyList()),
};
// Ensure that they are registered so
// that measurements won't be dropped.
ViewManager manager = Stats.getViewManager();
for (View view : views)
manager.registerView(view);
final StatsRecorder STATS_RECORDER = Stats.getStatsRecorder();
STATS_RECORDER.newMeasureMap()
.put(M_LATENCY_MS, 17.0)
.put(M_LINES, 238)
.put(M_BYTES_IN, 7000)
.record();
A good question; I've also struggled with this.
The Agent should provide better logging.
A Prometheus Exporter has been my general-purpose solution to this. Once configured, you can hit the agent's metrics endpoint to confirm that it's originating metrics.
If you're getting metrics, then the issue is with your Prometheus Server's target confguration. You can check the server's targets endpoint to ensure it's receiving metrics from the agent.
A second step is to configure zPages and then check /debug/rpcz and /debug/tracez to ensure that gRPC calls are hitting the agent.
E.g.
receivers:
opencensus:
address: ":55678"
exporters:
prometheus:
address: ":9100"
zpages:
port: 9999
You should get data on:
http://[AGENT]:9100/metrics
http://[AGENT]:9999/debug/rpcz
http://[AGENT]:9999/debug/tracez
Another hacky solution is to circumvent the Agent and configure e.g. a Prometheus Exporter directly in your code, run it and check that it's shipping metrics (via /metrics).
Oftentimes you need to flush exporters.
I have kapacitor 1.3.1 and influxdb 1.2.4 running on my machine. Though i have enabled kapacitor to send its stats, i dont see _kapacitor database in influxdb.
What am i missing here?
kapacitor.config:
hostname = "localhost"
[stats]
# Emit internal statistics about Kapacitor.
# To consume these stats create a stream task
# that selects data from the configured database
# and retention policy.
#
# Example:
# stream|from().database('_kapacitor').retentionPolicy('autogen')...
#
enabled = true
stats-interval = "10s"
database = "_kapacitor"
retention-policy= "autogen"
[[influxdb]]
# Connect to an InfluxDB cluster
# Kapacitor can subscribe, query and write to this cluster.
# Using InfluxDB is not required and can be disabled.
enabled = true
default = true
name = "localhost"
urls = ["http://localhost:8086"]
username = ""
password = ""
timeout = 0
Q: What am i missing here?
A: You got the first step right by enabling the stats functionality in Kapacitor. The next thing you need to do here is to bounce the Kapacitor engine this is so that stats are getting written into its internal database periodically.
Now the catch is that you'll need to also define a TICK script to pull the stats out from Kapacitor's internal database, then you can choose what you want to do to it, manipulate the data and write it back to an InfluxDB or raise alerts.
Example:
var data = stream| from().database('_kapacitor').retentionPolicy('autogen')
data
|log()
.prefix('Kapacitor stat =>')
After you got your tick script going. You'll have to do the usual, like installing it into Kapacitor then enable it.
kapacitor define test -type stream -tick test.tick -dbrp _kapacitor.autogen
There is a catch here. You need to specify the retention policy that you have specified in the config or otherwise it won't know where to look for the data. In this case it is _kapacitor.autogen.
test stream disabled false ["_kapacitor"."autogen"]
Next you enable the stream task.
kapacitor enable test
Output:
[test:log2] 2017/07/26 00:49:21 I! Kapacitor stat =>
{"Name":"ingress","Database":"_kapacitor","RetentionPolicy":"autogen","Group":"","Dimensions":{"ByName":false,"TagNames":null},"Tags":{"cluster_id":"c80d02c0-8c51-4071-8904-1583164e90ec","database":"_internal","host":"kapacitor_stoh","measurement":"tsm1_cache","retention_policy":"monitor","server_id":"82a2d589-db45-4cc5-81b0-674cb80737ac","task_master":"main"},"Fields":{"points_received":4753},"Time":"2017-07-26T00:49:21.75615995Z"}
I have running ZooKeeper and single Kafka broker and I want to get metrics with MetricBeat, index it with ElasticSearch and display with Kibana.
However, MetricBeat can only get data from partition metricset and nothing comes from consumergroup metricset.
Since kafka module is defined as periodical in metricbeat.yml, it should send some data on it's own, not just waiting for users interaction (f.exam. - write to topic) ?
To ensure myself, I tried to create consumer group, write and consume from topic, but still no data was collected by consumergroup metricset.
consumergroup is defined in both metricbeat.template.json and metricbeat.template-es2x.json.
While metricbeat.full.yml is completely commented off, this is my metricbeat.yml kafka module definition :
- module: kafka
metricsets: ["partition", "consumergroup"]
enabled: true
period: 10s
hosts: ["localhost:9092"]
client_id: metricbeat1
retries: 3
backoff: 250ms
topics: []
In /logs directory of MetricBeat, lines like this show up :
INFO Non-zero metrics in the last 30s:
libbeat.es.published_and_acked_events=109
libbeat.es.publish.write_bytes=88050
libbeat.publisher.messages_in_worker_queues=109
libbeat.es.call_count.PublishEvents=5
fetches.kafka-partition.events=106
fetches.kafka-consumergroup.success=2
libbeat.publisher.published_events=109
libbeat.es.publish.read_bytes=2701
fetches.kafka-partition.success=2
fetches.zookeeper-mntr.events=3
fetches.zookeeper-mntr.success=3
With ZooKeeper's mntr and Kafka's partition, I can see events= and success= values, but for consumergroup there is only success. It looks like no events are fired.
partition and mntr data are properly visible in Kibana, while consumergroup is missing.
Data stored in ElasticSearch are not readable with human eye, there are some internal strings used for directory names and logs do not contain any useful information.
Can anybody help me to understand what is going on and fix it(probably MetricBeat) to send data to ElasticSearch ? Thanks :)
You need to have an active consumer consuming out of the topics, to be able to generate events for consumergroup metricset.
I have a hadoop cluster with 7 nodes, 1 master and 6 core nodes. Ganglia is setup on each machine, and the web front end correctly shows 7 hosts.
But it only shows metrics from the master node (with both gmetad and gmond). The other nodes have the same gmond.conf file as the master node, and the web front end clearly sees the nodes. I don't understand how ganglia can recognize 7 hosts but only show metrics from the box with gmetad.
Any help would be appreciated. Is there a quick way to see if those nodes are even sending data? Or is this a networking issue?
update#1: when I telnet into a gmond host machine that is not the master node, and look at port 8649, I see the XML but no data. When I telnet to 8649 on the master machine, I see XML and data. Any suggestions of where to go from here?
Set this to all gmond.conf files of every node you want to monitor:
send_metadata_interval = 15 // or something.
Now all the nodes and their metrics are showed in master (gmetad).
This extra configuration is necessary if you are running in a unicast mode, i.e., if you are specifying a host in udp_send_channel rather than mcast_join. In the multi-cast mode, the gmond deamons can query each other any time and proactive sending of monitoring data is not required.
In gmond configuration, ensure the following is all provided:-
cluster {
name = "my cluster" #is this the same name as given in gmetad conf?
## Cluster name
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
#mcast_join = 239.2.11.71 ## Comment this
host = 192.168.1.10 ## IP address/hostname of gmetad node
port = 8649
ttl = 1
}
/* comment out this block itself
udp_recv_channel {
...
}
*/
tcp_accept_channel {
port = 8649
}
save and quit. Restart your gmond daemon. Then execute "netcat 8649". Are you able to see XML with metrics now?