NiFi blocked/hanging - apache-nifi

We're running a NiFi 1.7.1 server that picks up files via 3 GetFile processors and uploads them to the cloud.
When the server starts, it runs fine and chugs through any files that it finds. However, after running for a few days it seems to grind to a halt:
The GetFile processors all show 1 thread running, but they don't seem to be doing anything, even though there are files present in their source directories.
Nothing is waiting in any queues.
No messages appear in the logs.
The "top" command shows java using about 3% CPU and 21% mem. This is a 4-CPU server with 8GB of memory.
If I try to stop any processors via the web interface, it will become unresponsive. Upon reloading the page, I get the login screen, but after login it hangs on the loading animation, without showing the flow schema.
If I restart the NiFi service, it suddenly runs fine: it will pick up all the waiting files and not leave any threads hanging (according to the web interface). This will last another few days...
What is going on here? How can I resolve this?
Edit: The three GetFile processors each read from a different folder, but they all send their files to the same place. They are configured to pick up all files (filepattern .*), poll every 10 seconds, minimum file age 1 minute, don't keep the source file. I didn't touch the scheduling tab so it's just defaults.

Related

To get historical data of memory consumption by a particular process before it gets stopped

I have a process (it is a windows service). It throws bad_alloc exception and stops. Later it is being started by another monitoring tool. I want to see the memory related details specific to that process just before it stops.
The tools like Process explorer, VMmap can be used for running processes. But, as my process stops we loose the data here. Is there any way to log the data of this process till it stops/ till some time period?
I tried 2 options in VMmap for the same.
(a) View Running process option works fine, but it needs regular 'Refresh' from user and During refresh if the process is stopped/restarted (now it is with new PID) the previous data are lost.
(b) Launch and trace a new process(here I have option of auto refresh after each second) -But it is not able to initiate my windows service.
Could you please suggest if there are any other ways for it?
I referred multiple articles for this , but none of them helped in my case.
The reason to capture logs is- these services are in production system on customer machines, so cannot analyse at the time of issue.
I am using Performance Monitor (PerfMon) to capture data specific to my process for every 10 minutes. It gives me both historic data as well as the current data.

Azure Website Kudu HTMLLog Analysis shows Always On with high response time

We deployed our WebAPI as an azure website under the standard plan and have turned on Always On. After getting multiple memory and CPU alerts we decided on checking the logs via xyz.scm.azurewebsites.net. It seems Always ON has a high response time. Could this be causing high memory and CPU issues. Sometimes the alerts come when none is even using the system and auto resolve within 5 mins.
The always on feature only invokes the root of your web app every 5 minutes.
If this is causing high memory or cpu it could be a memory leak within your application because if you don't use the always on feature your process gets recycled on idle.
You should check what your app does if you invoke it with the root path and determine why this is causing high response time.

502 server error in Google App Engine Flexible when load testing with JMeter

I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.

Service Fabric Resource balancer uses stale Reported load

While looking into the resource balancer and dynamic load metrics on Service Fabric, we ran into some questions (Running devbox SDK GA 2.0.135).
In the Service Fabric Explorer (the portal and the standalone application) we can see that the balancing is ran very often, most of the time it is done almost instantly and this happens every second. While looking at the Load Metric Information on the nodes or partitions it is not updating the values as we report load.
We send a dynamic load report based on our interaction (a HTTP request to a service), increasing the reported load data of a single partition by a large amount. This spike becomes visible somewhere in 5 minutes at which point the balancer actually starts balancing. This seems to be an interval in which the load data gets refreshed. The last reported time gets updated all the time but without the new value.
We added the metrics to applicationmanifest and the clustermanifest to make sure it gets used in the balancing.
This means the resource balancer uses the same data for 5 minutes. Is this a configurable setting? Is it constraint because it is running on a devbox?
We tried a lot of variables in the clustermanifest but none seem to be affecting this refreshtime.
If this is not adaptable, can someone explain why would you run the balancer with stale data? and why this 5 minute interval was chosen?
This is indeed a configurable setting, and the default is 5 minutes. The idea behind it is that in prod you have tons of replicas all reporting load all the time, and so you want to batch them up so you don't spam the Cluster Resource Manager with all those as independent messages.
You're probably right in that this value is way too long for local development. We'll look into changing that for the local clusters, but in the meantime you can add the following to your local cluster manifest to change the amount of time we wait by default. If there are other settings already in there, just add the SendLoadReportInterval line. The value is in seconds and you can adjust it accordingly. The below would change the default load reporting interval from 5 minutes (300 seconds) to 1 minute (60 seconds).
<Section Name="ReconfigurationAgent">
<Parameter Name="SendLoadReportInterval" Value="60" />
</Section>
Please note that doing so does increase load on some of the system services (TANSTAAFL), and as always if you're operating on a generated or complete cluster manifest be sure to Test-ServiceFabricClusterManifest before deploying it. If you're working with a local development cluster the easiest way to get it deployed is probably just to modify the cluster manifest template (by default here: "C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\NonSecure\ClusterManifestTemplate.xml") and just add the line, then right click on the Service Fabric Local Cluster Manager in your system tray and select "Reset Local Cluster". This will regenerate the local cluster with your changes to the template.

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

Resources