Flink web UI: Monitor Metrics doesn't work - user-interface

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:

I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

Related

Configuring connectors for multiple topics on Kafka Connect Distributed Mode

We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of

Nutch as stand-by spider with custom processing pipelines

I would like to use Apache Nutch as a spider which only fetches given url list (no crawling). The urls are going to be stored in Redis and I want Nutch to take constantly pop them from the list and fetch html. The spider needs to be in stand-by mode - it always waits for the new urls coming into Redis until the user decides to stop the job. Also, I would like to apply my own processing pipelines to the extracted html files (not only text extraction). Is it possible to do with Nutch?
StormCrawler would be a much better fit for achieving this - it was designed to be able to cater for scenarios like the one you described. You'd need to write a custom spout t connect to redis, reuse the fetcher and parser bolts then add bolts with your own processing. Some of SC's early users were doing exactly that

Multiple flows with nifi

We have multiple (50+) nifi flows that all do basically the same thing: pull some data out of a db, append some columns conver to parquet and upload to hdfs. They differ only in details such as the sql query to run or the location in hdfs that they land.
The question is how to factor these common nifi flows out such that any change made to the common flow automatically applies to all all derived flows. E.g if i want to add an extra step to also publish the data to Kafka I want to make this once and have it automatically apply to all 50 flows.
We’ve tried to get this working with nifi registry, however it seems like an imperfect fit. Essentially the issue is that nifi registry seems to work well for updating a flow in one environment (say wat) and then autmatically updating it in another environment (say prod). It seems less suited for updating multiple flows in the same environment with one specific example bing that it will reset the name of each flow to be the template name every time we redeploy meaning that al flows end up with the same name!
Does anyone know how one is supposed to manage a situation like ours asi guess it must be pretty common.
Apache NiFi has ProcessorGroups. As the name itself suggests, the processor groups are there to group together a set of processors' and their pipeline that does similar task.
So for your case what you can do is, you can refactor the flow by moving the common flow which can be reused with different pipelines to a separate processor group with an input port. Connect the outside flow that depends on this reusable flow by connecting to the input port of the reusable processor group. Depending on your requirement you can create an output port as well in this processor group and connect it with the outside flow.
Attaching a sample:
For the sake of explaining, I have made a mock flow so ignore the Processor types that are used, but rather see the name I had given to those processors.
The following screenshots show that I read from two different sources and individually connect them to two different processors that does the source specific changes to those processors
Then I connect these two flows to the input port of a processor group that has the reusable flow inside. So ultimately the two different flows shown in the above screenshot gets to work with a common reusable flow.
Showing what's inside the reusable flow:
Finally the output port output to outside connects the reusable flow to the outside component Write to somewehere
I hope this helps you with refactoring your complex flows. Feel free to get back, if you have any queries.

How to define/send derived metrics in addition to built-in Aerospike metrics

I'm trying to ship Aerospike metrics to another node using some available methods, e.g., collectd.
For example, among the Aerospike monitoring metrics, given two fields: say X and Y, how can I define and send a derived metric like Z = X+Y or X/Y?
We could calculate it on the receiver side but it degrades the performance of our application overall. Will appreciate your guidance in advance.
Thanks.
It can't be done within the Aerospike collectd plugin, as the metrics are more or less shipped immediately once they are read. There's no variable that saves the metrics that have been shipped.
If you can use the Graphite plugin, it keeps track of all gathered metrics then sends once at the very end. You can add another stanza for your calculated metrics right before nmsg line. You'll have to search through the msg[] array for your source metrics.
The Nagios plugin is a very different method. It's a single metric pull, so a wrapper script would be needed to run the plugin for each operand, and run the calculation in the wrapper.
Or you can supplement existing plugins with your own script(s) just for derived metrics. All of our monitoring plugins utilize the Aerospike Info Protocol and you can use asinfo to gather metrics for your operands similar to the previous Nagios method.

Can't expose data as graph in nagios

I am monitoring Message Count attribute of JMS Queue using Nagios. For that I am using check_jmx plugin and it gives the output as "JMX OK MessageCount=400". I configured graph for this service, but when click on graph icon it shows no data available. This service not generating any rrd file. How can I configure graph for my message count monitoring service? In graph i want to show message count/hour. Whether I have to use another plugin?
Nagios graphing addons such as PNP4Nagios use the performance data of the plugin output which is everything after the |. Run the plugin on the command line and see if it's outputting performance data, and try different verbosity options by adding -vvv to check_jmx.
More info on performance data
check_jmx usage
As i can't tell from your text, which plugin you are using in order to gernerate the graphs, i myself recommend PNP4Nagios.
Once installed it is working really great.
for your problem:
and it gives the output as "JMX OK MessageCount=400".
if this is the messages / hour, you dont even have to change anything.
if it's not, you might include the code of the plugin in its current version on your nagios to your question or modyfie yourself ( store / grab messagecount and timestamp in order to calculate your messages / hour )

Resources