Any way to monitor Nifi Processors? Any Utility Dashboard? - apache-nifi

If I have developed a NiFi flow and a support person wants to view what's the current state and which processor is currently running, which processor already ran, which ones completed?
I mean to say any dashboard kind of utility provided by NiFi to monitor activities ?

You can use the Reporting tasks and NiFi itself, or a new NiFi instance, that is what I choose.
To do that you must do the following:
Open the reporting task menu
And add the desired reporting tasks
And configure it properly
Then create a flow to manage the reporting data
In my case I am putting the information into an Elasticsearch

There are numerous ways to monitor NiFi flows and status. The status bar along the top of the UI shows running/stopped/invalid processor counts, and cluster status, thread count, etc. The global menu at the top right has options for monitoring JVM usage, flowfiles processed/in/out, CPU, etc.
Each individual processor will show a status icon for running/stopped/invalid/disabled, and can be right-clicked for the same JVM usage, flowfile status, etc. graphs as the global view, but for the individual processor. There are also some Reporting Tasks provided by default to integrate with external monitoring systems, and custom reporting tasks can be written for any other desired visualization or monitoring dashboard.
NiFi doesn’t have the concept of batch/job processing, so processors aren’t “complete”.

1. In built monitoring in Apache NiFi.
Bulletin Board
The bulletin board shows the latest ERROR and WARNING getting generated by NiFi processors in real time. To access the bulletin board, a user will have to go the right hand drop down menu and select the Bulletin Board option. It refreshes automatically and a user can disable it also. A user can also navigate to the actual processor by double-clicking the error. A user can also filter the bulletins by working out with the following −
by message
by name
by id
by group id
Data provenance UI
To monitor the Events occurring on any specific processor or throughout NiFi, a user can access the Data provenance from the same menu as the bulletin board. A user can also filter the events in data provenance repository by working out with the following fields −
by component name
by component type
by type
NiFi Summary UI
Apache NiFi summary also can be accessed from the same menu as the bulletin board. This UI contains information about all the components of that particular NiFi instance or cluster. They can be filtered by name, by type or by URI. There are different tabs for different component types. Following are the components, which can be monitored in the NiFi summary UI −
Processors
Input ports
Output ports
Remote process groups
Connections
Process groups
In this UI, there is a link at the bottom right hand side named system diagnostics to check the JVM statistics.
2. Reporting Tasks
Apache NiFi provides multiple reporting tasks to support external monitoring systems like Ambari, Grafana, etc. A developer can create a custom reporting task or can configure the inbuilt ones to send the metrics of NiFi to the externals monitoring systems. The following table lists down the reporting tasks offered by NiFi 1.7.1.
Reporting Task:
AmbariReportingTask - To setup Ambari Metrics Service for NiFi.
ControllerStatusReportingTask - To report the information from the NiFi summary UI for the last 5 minute.
MonitorDiskUsage - To report and warn about the disk usage of a specific directory.
MonitorMemory To monitor the amount of Java Heap used in a Java Memory pool of JVM.
SiteToSiteBulletinReportingTask To report the errors and warning in bulletins using Site to Site protocol.
SiteToSiteProvenanceReportingTask To report the NiFi Data Provenance events using Site to Site protocol.
3. NiFi API
There is an API named system diagnostics, which can be used to monitor the NiFI stats in any custom developed application.
Request
http://localhost:8080/nifi-api/system-diagnostics
Response
{
"systemDiagnostics": {
"aggregateSnapshot": {
"totalNonHeap": "183.89 MB",
"totalNonHeapBytes": 192819200,
"usedNonHeap": "173.47 MB",
"usedNonHeapBytes": 181894560,
"freeNonHeap": "10.42 MB",
"freeNonHeapBytes": 10924640,
"maxNonHeap": "-1 bytes",
"maxNonHeapBytes": -1,
"totalHeap": "512 MB",
"totalHeapBytes": 536870912,
"usedHeap": "273.37 MB",
"usedHeapBytes": 286652264,
"freeHeap": "238.63 MB",
"freeHeapBytes": 250218648,
"maxHeap": "512 MB",
"maxHeapBytes": 536870912,
"heapUtilization": "53.0%",
"availableProcessors": 4,
"processorLoadAverage": -1,
"totalThreads": 71,
"daemonThreads": 31,
"uptime": "17:30:35.277",
"flowFileRepositoryStorageUsage": {
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
},
"contentRepositoryStorageUsage": [
{
"identifier": "default",
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
}
],
"provenanceRepositoryStorageUsage": [
{
"identifier": "default",
"freeSpace": "286.93 GB",
"totalSpace": "464.78 GB",
"usedSpace": "177.85 GB",
"freeSpaceBytes": 308090789888,
"totalSpaceBytes": 499057160192,
"usedSpaceBytes": 190966370304,
"utilization": "38.0%"
}
],
"garbageCollection": [
{
"name": "G1 Young Generation",
"collectionCount": 344,
"collectionTime": "00:00:06.239",
"collectionMillis": 6239
},
{
"name": "G1 Old Generation",
"collectionCount": 0,
"collectionTime": "00:00:00.000",
"collectionMillis": 0
}
],
"statsLastRefreshed": "09:30:20 SGT",
"versionInfo": {
"niFiVersion": "1.7.1",
"javaVendor": "Oracle Corporation",
"javaVersion": "1.8.0_151",
"osName": "Windows 7",
"osVersion": "6.1",
"osArchitecture": "amd64",
"buildTag": "nifi-1.7.1-RC1",
"buildTimestamp": "07/12/2018 12:54:43 SGT"
}
}
}
}

You also can use nifi-api for monitoring. You can receive detailed information about each processor group, controller service or processor.

You can use MonitoFi. It is an open source tool that is highly configurable, uses nifi-api to collect stats about various nifi processors and stores those in influxdb. It also comes with Grafana Dashboards and Alerting functionality.
http://www.monitofi.com
or
https://github.com/microsoft/MonitoFi

Related

Custom gauge for prometheus Go SDK

I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?

enableAutoTierToHotFromCool Does not move from cool to hot

I have some Azure Storage Accounts (StorageV2) located in West Europe. All blobs uploaded are by default in the Hot tier and I have this lifecycle rule defined on them:
{
"rules": [
{
"enabled": true,
"name": "moveToCool",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"enableAutoTierToHotFromCool": true,
"tierToCool": {
"daysAfterLastAccessTimeGreaterThan": 1
}
}
},
"filters": {
"blobTypes": [
"blockBlob"
]
}
}
}
]
}
Somehow the uploaded blobs are moved to cool but then after I access them again, in the portal they still appear under Cool tier. Any idea why? (I have waited more than 24 for the rule to be in effect)
Some more questions about: "enableAutoTierToHotFromCool": true:
does it depend on the blob size? (for example if some blobs were moved to cool and then they accessed simultaneously the time between a 1 Gib is moved back to hot is the same for 10KiB blob)
does it depend on the number of blobs that are accessed? (it there a queue and if multiple blobs from cool are accessed in the same time, the requests are served based on a queue order)
The enableAutoTierToHotFromCool property is a Boolean value that indicates whether a blob should automatically be tiered from cool back to hot if it is accessed again after being tiered to cool.
And to apply new policy it takes 48hrs, and enableAutoTierToHotFromCool": true doesn’t depend on size of blob , and not depends on the number of blobs
If you enable firewall rules for your storage account, lifecycle management requests may be blocked. You can unblock these requests by providing exceptions for trusted Microsoft services. For more information, refer this document the Exceptions section in Configure firewalls and virtual networks.
A lifecycle management policy must be read or written in full. Partial updates are not supported. So try with writing
"prefixMatch": [
"containerName/log"
]
For more details refer this document:

clear prometheus metrics from collector

I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.
Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},

Resource Identifiers between two FHIR servers

Our scenario is there is an EHR system that is integrating with a device sensor partner using FHIR. In this scenario both companies will have independent FHIR servers. Each of them has different Patient and Organization(s) records with their own identifiers. The preference is the the sensor FHIR server keep the mapping of EHR identifiers to it's own internal identifiers for these resources
The EHR wants to assign a Patient to a Device with the sensor FHIR server.
Step 1: First the EHR would #GET the list of Device resources for a given Organization where a Patient is not currently assigned from the sensor FHIR server e.g.
/api/Device?organization.identifier=xyz&patient:missing=true
Here I would assume the Organization identifier is that of the EHR system since the EHR system doesn't have knowledge of the sensor system Organization identifier at this point.
The reply to this call would be a bundle of devices:
... snip ...
"owner": {
"reference": "http://sensor-server.com/api/Organization/3"
},
... snip ...
Question 2: Would the owner Organization reference have the identifier from the search or the internal/logical ID as it's known by the sensor FHIR server as in the snippet above?
Step 2: The clinician of the EHR system chooses a Device from the list to assign it to a Patient in the EHR system
Step 3: The EHR system will now issue a #PUT /api/Device/{id} request back to the sensor FHIR server to assign a Patient resource to a Device resource e.g.
{
"resourceType": "Device",
"owner": {
"reference": "http://sensor-server.com/api/Organization/3"
},
"id": "b4994c31f906",
"patient": {
"reference": "https://ehr-server.com/api/Patient/4754475"
},
"identifier": [
{
"use": "official",
"system": "bluetooth",
"value": "b4:99:4c:31:f9:06",
"label": "Bluetooth address"
}
]
}
Question 3: What resource URI/identifier should be used for the Patient resource? I would assume it is that of the EHR system since the EHR system doesn't have knowledge of the sensor system Patient identifier. Notice however, that the Organization reference is to a URI in the sensor FHIR server while the Patient reference is a URI to the EHR system - this smells funny.
Step 4: The EHR can issue a #GET /api/Device/{id} on the sensor FHIR server and get back the Device resource e.g.
{
"resourceType": "Device",
"owner": {
"reference": "http://sensor-server.com/api/Organization/3"
},
"id": "b4994c31f906",
"patient": {
"reference": "https://sensor-server.com/api/Patient/abcdefg"
},
"identifier": [
{
"use": "official",
"system": "bluetooth",
"value": "b4:99:4c:31:f9:06",
"label": "Bluetooth address"
}
]
}
Question 4: Would we expect to see a reference to the Patient containing the absolute URI to the EHR FHIR server (as it was on the #PUT in Step 3) or would/could the sensor FHIR server have modified that to return a reference to a resource in it's FHIR server using it's internal logical ID?
I didn't see a Question 1, so I'll presume it's the "assume" sentence in front of your first example. If the EHR is querying the device sensor server and the organizations on the device sensor server include the business identifier known by the EHR, then that's reasonable. You would need some sort of business process to ensure that occurs though.
Question 2: The device owner element would be using a resource reference, which means it's pointing to the "id" element of the target organization. Think of resource ids as primary keys. They're typically assigned by the server that's storing the data, though in some architectures, they can be set by the client (who creates the record using PUT instead of POST). In any event, you can't count on them being meaningful business identifiers - and according to most data storage best practices, they generally shouldn't be. And if, as I expect, your scenario involves multiple EHR clients potentially talking to the "device" server, the resource id couldn't possibly align with the business ids of all of the EHRs. (That's a long way of saying "no 'xyz' probably won't be '3')
Question 3: If the EHR has its own server, the EHR client could update the device on the "sensor" server to point to a URL on the EHR server. Whether that's appropriate or not depends on your architecture. If you want other EHRs to recognize the patient, then you'd probably want the "sensor" server to host patients too and for the EHR to look up the patient by business id and then reference the "sensor" server's URL. If not, then pointing to the EHR server's URL is fine.
Question 4: When you do a "GET", its normal to receive back the same data you specified on a POST. It's legal for the server to change the data, including possibly updating references. But that's likely to confuse a lot of client systems, so it's not generally recommended or typical.

How to use RabbitMQ http api to see what queue had a messages in a ready state

I have a RabbitMQ server setup with thousands of queues. Of which only about 5 of these are persistent queues. Every now and then there is a back up of a queue that will have about 5-10 messages in a ready state. These messages do not appear to be in the persistent queues. I want to find out which queues had the messages in a ready state, but the only indication that it is happening is on the overview page of the web management console which is for all queues.
Is there a way to query Rabbit to tell me the stat info for messages that were in a ready state for a period of minutes and which queue they were in?
I would use the HTTP API.
http://rabbit-broker:15672/api/queues
This will give you a list of the current queue states in JSON so you'll have to keep polling it. Store the "messages_ready" for given queue "name" for the period you want to monitor. Now you'll be able to see which queues have that backlog spike.
You can use simple curl as well as whichever platform you prefer with an HTTP client.
Please note: the user you'll connect will have to have monitor tag to access all the queue information.
Out of the box there is no easy way AFAIK, you'd have to manually click through the queues and look at their graphs in the UI for the last hour, which is tedious.
I had similar requirements and I found a better way than polling. The docs say that you may get raw samples via api if you use special parameters in the request.
For example in your case, if you are interested in messages with ready state, you may ask your queue for a history of queue lengths, for example last 60 seconds with samples every 1 second (note 15672 is the default port used by rabbitmq_management):
http://rabbitHost:15672/api/queues/vhost/queue?lengths_age=60&lengths_incr=1
For default vhost=/ it will be:
http://rabbitHost:15672/api/queues/%2F/queue?lengths_age=60&lengths_incr=1
Then in the result json there will be some additional _details objects like this:
"messages_ready_details": {
"avg": 8.524590163934427,
"avg_rate": 0.08333333333333333,
"samples": [{
"timestamp": 1532699694000,
"sample": 5
}, {
"timestamp": 1532699693000,
"sample": 11
},
<... more samples ...>
],
"rate": -6.0
},
"messages_ready": 5,
Then on this raw data you may do any stats you need.
Other raw data samples appear if you use differen parameters in
What sampling will appear? What parameters are required for it to appear?
Messages sent and received msg_rates_age / msg_rates_incr
Bytes sent and received data_rates_age / data_rates_incr
Queue lengths lengths_age / lengths_incr
Node statistics (e.g. file descriptors, disk space free) node_stats_age / node_stats_incr

Resources