I am having a hard time debugging why my metrics don't reach my prometheus.
I am trying to send simple metrics to oc agent and from ther pull them with prometheus, but my code does not fail and I can't see the metrics in prometheus.
So, how can I debug it? Can I see if a metric arrived in the oc agent?
cAgentMetricsExporter.createAndRegister(
OcAgentMetricsExporterConfiguration.builder()
.setEndPoint("IP:55678")
.setServiceName("my-service-name")
.setUseInsecure(true)
.build());
Aggregation latencyDistribution = Aggregation.Distribution.create(BucketBoundaries.create(
Arrays.asList(
0.0, 25.0, 100.0, 200.0, 400.0, 800.0, 10000.0)));
final Measure.MeasureDouble M_LATENCY_MS = Measure.MeasureDouble.create("latency", "The latency in milliseconds", "ms");
final Measure.MeasureLong M_LINES = Measure.MeasureLong.create("lines_in", "The number of lines processed", "1");
final Measure.MeasureLong M_BYTES_IN = Measure.MeasureLong.create("bytes_in", "The number of bytes received", "By");
View[] views = new View[]{
View.create(View.Name.create("myapp/latency"),
"The distribution of the latencies",
M_LATENCY_MS,
latencyDistribution,
Collections.emptyList()),
View.create(View.Name.create("myapp/lines_in"),
"The number of lines that were received",
M_LATENCY_MS,
Aggregation.Count.create(),
Collections.emptyList()),
};
// Ensure that they are registered so
// that measurements won't be dropped.
ViewManager manager = Stats.getViewManager();
for (View view : views)
manager.registerView(view);
final StatsRecorder STATS_RECORDER = Stats.getStatsRecorder();
STATS_RECORDER.newMeasureMap()
.put(M_LATENCY_MS, 17.0)
.put(M_LINES, 238)
.put(M_BYTES_IN, 7000)
.record();
A good question; I've also struggled with this.
The Agent should provide better logging.
A Prometheus Exporter has been my general-purpose solution to this. Once configured, you can hit the agent's metrics endpoint to confirm that it's originating metrics.
If you're getting metrics, then the issue is with your Prometheus Server's target confguration. You can check the server's targets endpoint to ensure it's receiving metrics from the agent.
A second step is to configure zPages and then check /debug/rpcz and /debug/tracez to ensure that gRPC calls are hitting the agent.
E.g.
receivers:
opencensus:
address: ":55678"
exporters:
prometheus:
address: ":9100"
zpages:
port: 9999
You should get data on:
http://[AGENT]:9100/metrics
http://[AGENT]:9999/debug/rpcz
http://[AGENT]:9999/debug/tracez
Another hacky solution is to circumvent the Agent and configure e.g. a Prometheus Exporter directly in your code, run it and check that it's shipping metrics (via /metrics).
Oftentimes you need to flush exporters.
Related
I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?
I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.
Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},
Using ES v6.4.3
I'm getting a bunch of TransportService errors when writing a high volume of transactions. The exact error is:
StatusCodeError: 429 - {"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[instance-0000000002][10.44.0.71:19428][indices:data/write/bulk[s][p]]"}],"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7#35110df8 on EsThreadPoolExecutor[name = instance-0000000002/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#3fd60b4f[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 1705133]]"},"status":429}
The general consensus seems to be bumping the queue_size so requests don't get dropped. As you can see in the error my queue_size is the default 200, and filled up. (I know simple bumping the queue_size is not a magic solution, but this is exactly what I need in this case)
So following this doc on how to change the elasticsearch.yml setting I try to add the queue_size bump here:
thread_pool.write.queue_size: 2000
And when I save I get this error:
'thread_pool.write.queue_size': is not allowed
I understand the user override settings blacklist certain settings, so if my problem is truly that thread_pool.write.queue_size is blacklisted, how can I access my elasticsearch.yml file to change it?
Thank you!
We are using Apache JMeter 2.12 in order to measure the response time of our JMS queue. However, we would like to see how many of those requests take less than a certain time. This, according to the official site of JMeter (http://jmeter.apache.org/usermanual/component_reference.html) should be set by the Timeout property. You can see in the photo below how our configuration looks like:
However, setting the timeout does not result in an error after sending 100 requests. We can see that some of them take apparently more than that amount of time:
Is there some other setting I am missing or is there a way to achieve my goal?
Thanks!
The JMeter documentation for JMS Point-to-Point describes the timeout as
The timeout in milliseconds for the reply-messages. If a reply has not been received within the specified time, the specific testcase failes and the specific reply message received after the timeout is discarded. Default value is 2000 ms.
This is timing not the actual sending the message but receipt of a response.
The source for the JMeter Point to Point will determine if you have a 'Receive Queue' Configured. If you do it will go through the executor path and use the timeout value, otherwise it does not use time timeout value.
if (useTemporyQueue()) {
executor = new TemporaryQueueExecutor(session, sendQueue);
} else {
producer = session.createSender(sendQueue);
executor = new FixedQueueExecutor(producer, getTimeoutAsInt(), isUseReqMsgIdAsCorrelId());
}
In your screen shot JNDI name Receive Queue is not defined, thus it uses temporary queue, and does not use the timeout. Should or should not timeout be supported in this case, that is best discussed in JMeter forum.
Alternately if you want to see request times in percentiles/buckets please read this stack overflow Q/A -
I want to find out the percentage of HTTPS requests that take less than a second in JMeter
I am using ganglia 3.6.0 for monitoring. I have an application that collect, aggregate some metrics for all hosts in the cluster. Then it sends them to gmond. The application runs on host1.
The problem here is, when setting spoof = false, ganglia eventually thinks this is metrics that only comes from host1. In fact, these metrics are generated by host1 but for all hosts in the cluster.
But when setting spoof=true, I expect gmond will accept the host name I specified. But it gmond doesn't accept the metrics at all. The metrics is event not shown on host1.
The code I use is copied from GangliaSink (from hadoop common) which applied Ganglia 3.1x format.
xdr_int(128); // metric_id = metadata_msg
xdr_string(getHostName()); // hostname
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string(type); // metric type
xdr_string(name); // metric name
xdr_string(gConf.getUnits()); // units
xdr_int(gSlope.ordinal()); // slope
xdr_int(gConf.getTmax()); // tmax, the maximum time between metrics
xdr_int(gConf.getDmax()); // dmax, the maximum data value
xdr_int(1); /*Num of the entries in extra_value field for
Ganglia 3.1.x*/
xdr_string("GROUP"); /*Group attribute*/
xdr_string(groupName); /*Group value*/
// send the metric to Ganglia hosts
emitToGangliaHosts();
// Now we send out a message with the actual value.
// Technically, we only need to send out the metadata message once for
// each metric, but I don't want to have to record which metrics we did and
// did not send.
xdr_int(133); // we are sending a string value
xdr_string(getHostName()); // hostName
xdr_string(name); // metric name
xdr_int(1); // spoof = True
xdr_string("%s"); // format field
xdr_string(value); // metric value
// send the metric to Ganglia hosts
emitToGangliaHosts();
I do specified the host name for each metrics. But it seems not used/recognized by gmond.
Solved this... The hostName format issue. The format need to be like ip:hostname e.g 1.2.3.4:host0000001 or any string:string is fine :-)