How to enable Disk I/O monitoring - opennms

I'm having difficulties in enabling/configuring SNMP monitoring of disk i/o in opennms version 1.16.0.3.
So let's explain this a bit. In OpenNMS when you go to "Configure OpenNMS>Operations>Manage SNMP Collections and Data Collection Groups" you can configure SNMP data collection.
In "Data collections Group" select "netsnmp.xml" data collection group file, there you'll find DIsk IO (UCD-SNMP MIB).
My question is: how can this resource (Disk IO) be monitored so that it will appear as a graph in: Nodes>"name of monitored node">Resource graphs

In the data collection config file (netsnmp.xml), you need to include the collection group in the system definition toward the bottom of the config file.
You'll notice that the group is called "ucd-diskio"
System definitions (systemDef in the config file) are used to create profiles of metric groups you wish to include for collection attempts. They use sysOID or sysOID masks to identify devices to attempt collections upon. So find
(that uses a sysoidMask of .1.3.6.1.4.1.8072.3)
and add another includeGroup to it of
ucd-diskio
Then just do a touch on $OPENNMS-HOME(dir)/etc/datacollection-config.xml and afer a few minutes, if OpenNMS can collect successfully against your node, you should see the IO data populating.
NOTE: Because disk io is using indexed collections against a MIB table, the graphs will appear under a Disk IO Index heading and not under the catchall Host Resources group of graphs. It should using something like the volume name to index.
Cheers

Related

Grafana/Prometheus visualizing multiple ips as query

I want to have a graph where all recent IPs that requested my webserver get shown as total request count. Is something like this doable? Can I add a query and remove it afterwards via Prometheus?
Technically, yes. You will need to:
Expose some metric (probably a counter) in your server - say, requests_count, with a label; say, ip
Whenever you receive a request, inc the metric with the label set to the requester IP
In Grafana, graph the metric, likely summing it by the IP address to handle the case where you have several horizontally scaled servers handling requests sum(your_prometheus_namespace_requests_count) by (ip)
Set the Legend of the graph in Grafana to {{ ip }} to 'name' each line after the IP address it represents
However, every different label value a metric has causes a whole new metric to exist in the Prometheus time-series database; you can think of a metric like requests_count{ip="192.168.0.1"}=1 to be somewhat similar to requests_count_ip_192_168_0_1{}=1 in terms of how it consumes memory. Each metric instance currently being held in the Prometheus TSDB head takes something on the order of 3kB to exist. What that means is that if you're handling millions of requests, you're going to be swamping Prometheus' memory with gigabytes of data just from this one metric alone. A more detailed explanation about this issue exists in this other answer: https://stackoverflow.com/a/69167162/511258
With that in mind, this approach would make sense if you know for a fact you expect a small volume of IP addresses to connect (maybe on an internal intranet, or a client you distribute to a small number of known clients), but if you are planning to deploy to the web this would allow a very easy way for people to (unknowingly, most likely) crash your monitoring systems.
You may want to investigate an alternative -- for example, Grafana is capable of ingesting data from some common log aggregation platforms, so perhaps you can do some structured (e.g. JSON) logging, hold that in e.g. Elasticsearch, and then create a graph from the data held within that.

Apache nifi: Difference between the flowfile State and StateManagement

From what I've read here and there, the flowfile repository serves as a Write Ahead Log for apache Nifi.
When walking the configuration files, I've seen that there is a state-management configuration section. When in a Standalone mode, a local-provider is used and writes the state (by default) to .state/local/.
It seems like both the flowfile repo and the state are used both, for example, to recover from a system failure.
Would someone please explain what's the difference between them? Do they work together ?
Also, it's a best practice to have the flowfile repo and the content repo on two separate disks. What about the local state ? Should we avoid using the "boot" disk and offload to another one ? Which one: a dedicated ? Co-locate with another one (I'm co-locating database and flowfile repos).
Thanks.
The flow file repository keeps track of all the flow files in the system, which content they point to, which attributes they have, and where they are in the flow.
State Management is an API provided to processors/services that can be used to store and retrieve key/value pairs, typically for remembering where something left off. For example, a source processor that pulls data since some timestamp would want to store the last timestamp it used so that if NiFi restarts it can retrieve this value and start from there again.

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

List error logs from Stckdriver of matching pattern

I am evaluating approaches for a scenario where i need to fetch list of logs from Stackdriver. There can be multiple Filter criteria's (for eg. payload contains a word 'retry' of logs of type 'Warning' ...)
With help gcp sdk i was able to query stackdriver but not sure how efficient is this approach. Kindly suggest other approaches where i can use elastic search client to query stackdriver and list matching logs
It looks like you have multiple sets of logs that you wish to consume separately and each of those log sets can be described with a Stackdriver filter. This is a good start since running filters against Stackdriver is an effective way to sort your data. And you are right that running the same filter against Stackdriver over and over again would be pretty inefficient.
The following approach uses Stackdriver log sinks and this is how we manage logs on our GCP account. Our monitoring team is pretty happy with it and it's easy to maintain.
You can read up on log sinks here and aggregated log sinks here.
The general idea is to have Google automatically filter and export the logs for you using multiple log sinks (one sink per filter). The export destination can be Google Storage, BigQuery, or Pub/Sub. Each sink should export to a different location and will do so continuously as long as the sink exists. Also, log sinks can be set up per project or at the organization level (where it can inherit all projects underneath).
For example, let's say you want to set up three log sinks. Each sink uses a different filter and different export location (but all to the same bucket):
Log Sink 1 (compute logs) -> gs://my-example-log-dump-bucket/compute-logs/
Log Sink 2 (network logs) -> gs://my-example-log-dump-bucket/network-logs/
Log Sink 3 (linux logs) -> gs://my-example-log-dump-bucket/linux-logs/
Once this is set up, your code's SDK can just access each location based on what logs it currently needs. This eliminates the need for your code to do the filtering since Google has already handled it for you in the background.
One thing to note: log exports to BigQuery and Pub/Sub are instant, but exports to Google Storage occur at the top of every hour. So if you need a fast turnaround on the logs, avoid Google Storage and go with either BigQuery or Pub/Sub.
Hope this helps!

S3 Ruby Client - when to specify regional endpoint

I have buckets in 2 AWS regions. I'm able to perform puts or gets against both buckets without specifying the regional endpoint(the ruby client defaults to us-east-1).
I haven't found much relevant info on how requests on a bucket reach the proper regional endpoint when the region is not specified. From what I've found(https://github.com/aws/aws-cli/issues/223#issuecomment-22872906), it appears that requests are routed to the bucket's proper region via DNS.
Does specifying the region have any advantages when performing puts and gets against existing buckets? I'm trying to decide whether I need to specify the appropriate region for operations against a bucket or if I can just rely on it working.
Note that the buckets are long lived so the DNS propagation delays mentioned in the linked github issue are not an issue.
SDK docs for region:
http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/Core/Configuration.html#region-instance_method
I do not think that there is any performance benefit to putting/getting data if you specify the bucket. All bucket names are supposed to be unique across all regions. I don't think there's a lot of overhead in that lookup, compared to data throughput.
I welcome comments to the contrary.

Resources