NiFi PutSFTP 1.2.0 - Stop Option Not Always Displayed - apache-nifi

In NiFi 1.2.0, using a two-node cluster, I have a simple flow with two processors:
GenerateFlowFile 1.2.0 - Generates data files
PutSFTP 1.2.0 - SCP puts files
Often after I've started both processors and let them run for a short while, I can stop the GenerateFlowFile processor, but I'm not able to stop (or start, for that matter) the PutSFTP processor. The Start and Stop items don't display in the context menu, and I can only view and not edit the processor's configuration. The PutSFTP processor's status icon indicates that it is stopped.
I'm not convinced that the behavior that I'm seeing is specific PutSFTP processors.
Why might this processor be "unstoppable"?

This isn't a direct answer to the question, but I just noticed that, when I refresh my browser, the PutSFTP process is startable again. The problem seems to lie with the Web application failing to update the processor's context menu for some reason.
I'm using Chrome 62.0.3202.94 (64-bit).

Related

How to reload external configuration data in NiFi processor

I have a custom nifi processor that uses external data for some user controlled configuration. I want to know how to signal the processor to reload the data when it is changed.
I was thinking that a flofile could be sent to signal the processor but I am concerned that in a clustered environment only one processor would get the notification and all the others would still be running on old configuration.
The most common ways to watch a file for changes are the JDK WatchService or Apache Commons IO Monitor...
https://www.baeldung.com/java-watchservice-vs-apache-commons-io-monitor-library
https://www.baeldung.com/java-nio2-watchservice
Your processor could use one of these and reload the data when the file changed, just make sure to synchronize access to relevant fields in your processor between the code that is reloading them and the code that is using them during execution.

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

How to debug a Flink application for memory and garbage collection?

I'm using Flink 1.1.4 and have added to flink-conf.yaml the configuration parameters for memory debugging, as stated in Memory and Performance Debugging:
taskmanager.debug.memory.startLogThread: true
taskmanager.debug.memory.logIntervalMs: 1000
After restarting Flink, I'm seeing the new parameters added to the Job Manager interface, but I'm unable to see any new logs.
Any idea about what I may be missing?
It seems this was resolved in this mailinglist
Key extracts, including one that confirmed the exact settings were tested succesfully:
That is exactly the right way to do it. Logging has to be at least
INFO and the parameter "taskmanager.debug.memory.startLogThread" set
to true. The log output should be under
"org.apache.flink.runtime.taskmanager.TaskManager".
Do you see other outputs for that class in the log?
Make sure you restarted the TaskManager processes after you changed
the config file.
Someone else just used the memory logging with the exact described
settings - it worked.
There is probably some mixup, you may be looking into the wrong log
file, or may setting the a value in a different config...
How do you start the flink cluster? If it's a standalone cluster and
you don't use a shared directory, then you'll find the log of the
taskmanager on the machine on which the taskmanager runs. If you use
YARN then you can activate log aggregation to retrieve the log easily
after the job has finished.

Impala Open/Alive Sessions Monitoring

I have been looking around on the current available documentation as well as in the following UIs:
Cloudera Manager (Impala Tab)
Impala StateStore Web UI
Impala Catalog Server Web UI
For a place where I can see the current open sessions.
Any idea where I could find it? or an alternative method for monitoring the alive Impala connections?
This is a significant weakness (at least of the version of CM we're using). The only solution I have found so far is:
First:
Cloudera Manager Home > Impala
Click Queries
This will show you all queries that are executing, and that have been executed within the selected time window. This detail is high-level, and we have found it often shows queries as "Executing" that long-since failed (this may have been fixed in a more recent version of CDH than the 5.2.6 we run).
In any case, this list will identify the nodes running impalad in the Coordinator field. To get much greater detail, access node-by-node. If the host running a query you were interested in was called node12, use
http://node12:25000/queries
and look for "In flight" queries at the top.
https://impala.apache.org/docs/build/html/topics/impala_webui.html
Sessions Page
By default, the sessions page of the debug web UI is at http://impala-server-hostname:25000/sessions (non-secure cluster) or https://impala-server-hostname:25000/sessions (secure cluster).
This page displays information about the sessions currently connected to this impalad instance. For example, sessions could include connections from the impala-shell command, JDBC or ODBC applications, or the Impala Query UI in the Hue web interface.

How to collect Hadoop userlogs?

I am running M/R jobs and logging errors when they occur, rather than making the job fail. There are only a few errors, but the job is run on a hadoop cluster with hundreds of nodes. How to search in task logs without having to manually open each task log in the web ui (jobtaskhistory)? In other words, how to automatically search in M/R task logs that are spread all over the cluster, stored in each node locally?
Side Note First: 2.0.0 is oldy moldy (that's the "beta" version of 2.0), you should consider upgrading to a newer stack (e.g. 2.4, 2.5 2.6).
Starting with 2.0, Hadoop implemented what's called "log aggregation" (though it's not what you would think. The logs are just stored on HDFS). There are bunch of command line tools that you can use to get the logs and analyze them without having to go through the UI. This is, in fact, much faster than the UI.
Check out this blog post for more information.
Unfortunately, even with the command line tool, there's not way for you to get all task logs at the same time and pipe it to something like grep. You'll have to get each task log as a separate command. However, this is at least scriptable.
The Hadoop community is working on a more robust log analysis tool that will not only store the job logs on HDFS, but will also give you the ability to perform search and other analyses on these logs. However, this is tool is still a ways out.
This is how we did it (large internet company): we made sure that only v critical messages were logged: but for those messages we actually did use System.err.println. Please keep the aggregate messages per tracker/reducer to only a few KB.
The majority of messages should still use the standard log4j mechanism (which goes to the System logs area)
Go to to your http://sandbox-hdp.hortonworks.com:8088/cluster/apps
There look for the instantiation of the execution you are interested in, and for that entry click the History link (in the Tracking UI column),
then look for the Logs link (in the Logs column), and click on it
yarn logs -applicationId <myAppId> | grep ...

Resources