Metrics not being reported, but JVm stats are - metrics

Hia All,
I have a service which is pushing metrics to Graphite. However, it is only pushing the JVM metrics, not the info metrics.
When I hit the URL:
http://myservice.com/mypath/info/metrics
I get what I expect, jvm and info metrics. i.e:
`{
"jvm":{
"vm":{
"name":"Java HotSpot(TM) 64-Bit Server VM",
"version":"1.6.0_12-b04"
},
... removed for brevity_
}
},
"info.application":{
"myMethod":{
"type":"counter",
"count":0
},
"doPost":{
... removed for brevity_
}
}
}`
I can see both the JVM metrics and the info metrics there, however, only the JVM metrics are showing in Graphite.
Does anyone have any ideas as to what is causing this?
All help gratefully received!
Thanks
Dave

Make sure you enabled the graphite reporter
GraphiteReporter.enable(metricsRegistry, 1, TimeUnit.MINUTES, graphiteHost, graphitePort, hostname);
And bound all the listeners to the registry.
bindListener(Matchers.any(), new MeteredListener(metricsRegistry));
bindListener(Matchers.any(), new TimedListener(metricsRegistry));
bindListener(Matchers.any(), new GaugeListener(metricsRegistry));

Related

Custom gauge for prometheus Go SDK

I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?

clear prometheus metrics from collector

I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.
Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},

How to debug opencensus metrics agent?

I am having a hard time debugging why my metrics don't reach my prometheus.
I am trying to send simple metrics to oc agent and from ther pull them with prometheus, but my code does not fail and I can't see the metrics in prometheus.
So, how can I debug it? Can I see if a metric arrived in the oc agent?
cAgentMetricsExporter.createAndRegister(
OcAgentMetricsExporterConfiguration.builder()
.setEndPoint("IP:55678")
.setServiceName("my-service-name")
.setUseInsecure(true)
.build());
Aggregation latencyDistribution = Aggregation.Distribution.create(BucketBoundaries.create(
Arrays.asList(
0.0, 25.0, 100.0, 200.0, 400.0, 800.0, 10000.0)));
final Measure.MeasureDouble M_LATENCY_MS = Measure.MeasureDouble.create("latency", "The latency in milliseconds", "ms");
final Measure.MeasureLong M_LINES = Measure.MeasureLong.create("lines_in", "The number of lines processed", "1");
final Measure.MeasureLong M_BYTES_IN = Measure.MeasureLong.create("bytes_in", "The number of bytes received", "By");
View[] views = new View[]{
View.create(View.Name.create("myapp/latency"),
"The distribution of the latencies",
M_LATENCY_MS,
latencyDistribution,
Collections.emptyList()),
View.create(View.Name.create("myapp/lines_in"),
"The number of lines that were received",
M_LATENCY_MS,
Aggregation.Count.create(),
Collections.emptyList()),
};
// Ensure that they are registered so
// that measurements won't be dropped.
ViewManager manager = Stats.getViewManager();
for (View view : views)
manager.registerView(view);
final StatsRecorder STATS_RECORDER = Stats.getStatsRecorder();
STATS_RECORDER.newMeasureMap()
.put(M_LATENCY_MS, 17.0)
.put(M_LINES, 238)
.put(M_BYTES_IN, 7000)
.record();
A good question; I've also struggled with this.
The Agent should provide better logging.
A Prometheus Exporter has been my general-purpose solution to this. Once configured, you can hit the agent's metrics endpoint to confirm that it's originating metrics.
If you're getting metrics, then the issue is with your Prometheus Server's target confguration. You can check the server's targets endpoint to ensure it's receiving metrics from the agent.
A second step is to configure zPages and then check /debug/rpcz and /debug/tracez to ensure that gRPC calls are hitting the agent.
E.g.
receivers:
opencensus:
address: ":55678"
exporters:
prometheus:
address: ":9100"
zpages:
port: 9999
You should get data on:
http://[AGENT]:9100/metrics
http://[AGENT]:9999/debug/rpcz
http://[AGENT]:9999/debug/tracez
Another hacky solution is to circumvent the Agent and configure e.g. a Prometheus Exporter directly in your code, run it and check that it's shipping metrics (via /metrics).
Oftentimes you need to flush exporters.

Any way to monitor Nifi Processors? Any Utility Dashboard?

If I have developed a NiFi flow and a support person wants to view what's the current state and which processor is currently running, which processor already ran, which ones completed?
I mean to say any dashboard kind of utility provided by NiFi to monitor activities ?
You can use the Reporting tasks and NiFi itself, or a new NiFi instance, that is what I choose.
To do that you must do the following:
Open the reporting task menu
And add the desired reporting tasks
And configure it properly
Then create a flow to manage the reporting data
In my case I am putting the information into an Elasticsearch
There are numerous ways to monitor NiFi flows and status. The status bar along the top of the UI shows running/stopped/invalid processor counts, and cluster status, thread count, etc. The global menu at the top right has options for monitoring JVM usage, flowfiles processed/in/out, CPU, etc.
Each individual processor will show a status icon for running/stopped/invalid/disabled, and can be right-clicked for the same JVM usage, flowfile status, etc. graphs as the global view, but for the individual processor. There are also some Reporting Tasks provided by default to integrate with external monitoring systems, and custom reporting tasks can be written for any other desired visualization or monitoring dashboard.
NiFi doesn’t have the concept of batch/job processing, so processors aren’t “complete”.
1. In built monitoring in Apache NiFi.
Bulletin Board
The bulletin board shows the latest ERROR and WARNING getting generated by NiFi processors in real time. To access the bulletin board, a user will have to go the right hand drop down menu and select the Bulletin Board option. It refreshes automatically and a user can disable it also. A user can also navigate to the actual processor by double-clicking the error. A user can also filter the bulletins by working out with the following −
by message
by name
by id
by group id
Data provenance UI
To monitor the Events occurring on any specific processor or throughout NiFi, a user can access the Data provenance from the same menu as the bulletin board. A user can also filter the events in data provenance repository by working out with the following fields −
by component name
by component type
by type
NiFi Summary UI
Apache NiFi summary also can be accessed from the same menu as the bulletin board. This UI contains information about all the components of that particular NiFi instance or cluster. They can be filtered by name, by type or by URI. There are different tabs for different component types. Following are the components, which can be monitored in the NiFi summary UI −
Processors
Input ports
Output ports
Remote process groups
Connections
Process groups
In this UI, there is a link at the bottom right hand side named system diagnostics to check the JVM statistics.
2. Reporting Tasks
Apache NiFi provides multiple reporting tasks to support external monitoring systems like Ambari, Grafana, etc. A developer can create a custom reporting task or can configure the inbuilt ones to send the metrics of NiFi to the externals monitoring systems. The following table lists down the reporting tasks offered by NiFi 1.7.1.
Reporting Task:
AmbariReportingTask - To setup Ambari Metrics Service for NiFi.
ControllerStatusReportingTask - To report the information from the NiFi summary UI for the last 5 minute.
MonitorDiskUsage - To report and warn about the disk usage of a specific directory.
MonitorMemory To monitor the amount of Java Heap used in a Java Memory pool of JVM.
SiteToSiteBulletinReportingTask To report the errors and warning in bulletins using Site to Site protocol.
SiteToSiteProvenanceReportingTask To report the NiFi Data Provenance events using Site to Site protocol.
3. NiFi API
There is an API named system diagnostics, which can be used to monitor the NiFI stats in any custom developed application.
Request
http://localhost:8080/nifi-api/system-diagnostics
Response
{
"systemDiagnostics": {
"aggregateSnapshot": {
"totalNonHeap": "183.89 MB",
"totalNonHeapBytes": 192819200,
"usedNonHeap": "173.47 MB",
"usedNonHeapBytes": 181894560,
"freeNonHeap": "10.42 MB",
"freeNonHeapBytes": 10924640,
"maxNonHeap": "-1 bytes",
"maxNonHeapBytes": -1,
"totalHeap": "512 MB",
"totalHeapBytes": 536870912,
"usedHeap": "273.37 MB",
"usedHeapBytes": 286652264,
"freeHeap": "238.63 MB",
"freeHeapBytes": 250218648,
"maxHeap": "512 MB",
"maxHeapBytes": 536870912,
"heapUtilization": "53.0%",
"availableProcessors": 4,
"processorLoadAverage": -1,
"totalThreads": 71,
"daemonThreads": 31,
"uptime": "17:30:35.277",
"flowFileRepositoryStorageUsage": {
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
},
"contentRepositoryStorageUsage": [
{
"identifier": "default",
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
}
],
"provenanceRepositoryStorageUsage": [
{
"identifier": "default",
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
}
],
"garbageCollection": [
{
"name": "G1 Young Generation",
"collectionCount": 344,
"collectionTime": "00:00:06.239",
"collectionMillis": 6239
},
{
"name": "G1 Old Generation",
"collectionCount": 0,
"collectionTime": "00:00:00.000",
"collectionMillis": 0
}
],
"statsLastRefreshed": "09:30:20 SGT",
"versionInfo": {
"niFiVersion": "1.7.1",
"javaVendor": "Oracle Corporation",
"javaVersion": "1.8.0_151",
"osName": "Windows 7",
"osVersion": "6.1",
"osArchitecture": "amd64",
"buildTag": "nifi-1.7.1-RC1",
"buildTimestamp": "07/12/2018 12:54:43 SGT"
}
}
}
}
You also can use nifi-api for monitoring. You can receive detailed information about each processor group, controller service or processor.
You can use MonitoFi. It is an open source tool that is highly configurable, uses nifi-api to collect stats about various nifi processors and stores those in influxdb. It also comes with Grafana Dashboards and Alerting functionality.
http://www.monitofi.com
or
https://github.com/microsoft/MonitoFi

AWS EMR - job counters exceeded limit 120

I have a hadoop code base that I inherited and which I'm trying to get running on EMR. But I'm running into issues with the job counters. I get an error saying that I'm exceeding the default limit of 120. I looked into my code and I see I have about 40 counters, and EMR adds another 30 internal counters, but that should still be within the 120 default limit.
I'm running on EMR AMI version 2.4.2, and Amazon 1.0.3 hadoop distribution.
Is there a way to increase the limit? I saw More than 120 counters in hadoop . But I'm not sure how to set this up on EMR.
Is there any way I can get more debug to figure out what is going on?
You can raise the counter limit with this configuration:
[
{
"Classification": "mapred-site",
"Properties": {
"mapreduce.job.counters.max:": "1024"
}
}
]
Here are Amazon's instructions on how to register those instructions with your cluster. (I'm not pasting it here directly because there are many ways to do it, depending on how you create and use your cluster.)

Resources