ConsumeAzureEventHub 1.15.3 processor suddenly stops consuming the data - apache-nifi

ConsumeAzureEventHub 1.15.3 processor suddenly stops consuming the data and NiFi logs written below error.
error The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'
let me know how to fix this issue ?
I tried to stop and restart the processor but didn't worked, so i have restarted the NiFi Service.

Related

Azure VM is closing the idle sessions in my postgres database

I am creating a VM in azure to upload a postgres instance in docker and connect to it with my local backend in Spring. What happens is that once connected to the DB after X time of inactivity when trying to make a request I get the following "HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value." digging around I realized that it is as if my VM has some kind of behavior that when a connection becomes inactive it closes it causing the above error. The curious thing here is that the sessions are not closed as you can see in the following image even shutting down my backend the sessions are maintained and the only options to delete them is restarting the container in which the DB is hosted.
I have tried to reproduce this behavior on local but it never happens even if I leave the backend idle for an hour if I do the request to the DB it works as if nothing, it only happens with my VM in azure.
I want to clarify that the sessions that appear in the attached image no longer work, i.e. if I try to consume the DB from spring, the error I mentioned appears and automatically Hikari creates new sessions for its pool and I can reproduce this behavior until I reach 100 sessions that after a while would not work again and that Spring never closes when shutting down the backend.
HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
This error is thrown by method isConnectionDead. While checking the connection, if it's still alive & can be used and it will issue an above error if it has already been closed.
You can adjust your maxLifetime setting, to resolve this problem. 30000ms (30 seconds) is the shortest value that is permitted (30 seconds), 1800000ms (30 minutes) is by default value.
A connection in the pool can only last for a certain amount of time which is controlled by the maxLifetime attribute. the value of this property it should be several seconds below any connection time restrictions set by the infrastructure or any databases.
Reference: Hikari Configuration Github.
Well, after much research and reviewing various sources it turns out that azure when creating a VM has certain security policies as Pedro Perez says in the following post in Stack Excahnge: Azure closing idle network connections
You're hitting a design feature of the software load balancer in front of your VMs. By default it will close any idle connections after 4 minutes, but you can configure the timeout to be anything between those 4 and 30 minutes
So in order to overwrite this policy that governs your VM you must do the process of creating a load balancer, do all the relevant configuration and create a Load balancing rule for port 5432 which is the default port of postgres and set the Idle timeout in a range of 4 to 30 min according to your needs.
finally configure your VM so its public ip points to the LB(Load Balancer) public ip and everything will work normally.
It is true that if you simply want to take advantage of Azure's default security policies on the VMs you create you should set the maxLifetime to a maximum of 4 minutes in your Spring application.properties or appliation.yml as #PratikLad says.
in my case I prefer to leave the default Hikari configuration (maxLifetime of 30 mins) so I need to create the LB but if you prefer to change the property by setting it to a maximum of 4 min you would not need to do all the above mentioned on the LB.

The fix client can receive incoming messages but cannot send outgoing heartbeat message

We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.
fix version: fix5.0
The same incident happened before, we have tcpdump for one session in that time
we deploy every fix session to separated k8s pods.
We doubted it's CPU resource issue because the load average is high around the issue time, but it's not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
We doubted it's IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it's still not solved after we use pod affinity to assign 3 sessions to different nodes.
It's not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.
We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.
rate session to get fx price from server
order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
backoffice session to get marketstatus
We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.
heartbeatInt = 30s
The fix event messages at client side
20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/172.28.65.164:44974, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/10.60.45.132:11050
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon
The fix incoming messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=Quote1☺52=20201004-21:09:02.887☺56=TA_Quote1☺10=186☺
8=FIXT.1.1☺9=65☺35=0☺34=2514☺49=Quote1☺52=20201004-21:09:33.089☺56=TA_Quote1☺10=185☺
8=FIXT.1.1☺9=74☺35=1☺34=2515☺49=Quote1☺52=20201004-21:09:48.090☺56=TA_Quote1☺112=TEST☺10=203☺
----- 21:10:53.203 Already disconnected ----
8=FIXT.1.1☺9=87☺35=A☺34=1☺49=Quote1☺52=20201004-21:10:53.639☺56=TA_Quote1☺98=0☺108=30☺141=Y☺1137=9☺10=183☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=Quote1☺52=20201004-21:11:23.887☺56=TA_Quote1☺10=026☺
The fix outgoing messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=TA_Quote1☺52=20201004-21:09:02.884☺56=Quote1☺10=183☺
---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---
8=FIXT.1.1☺9=134☺35=A☺34=1☺49=TA_Quote1☺52=20201004-21:10:53.433☺56=Quote1☺98=0☺108=30☺141=Y☺553=xxxx☺554=xxxxx☺1137=9☺10=098☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=TA_Quote1☺52=20201004-21:11:23.884☺56=Quote1☺10=023☺
8=FIXT.1.1☺9=62☺35=0☺34=3☺49=TA_Quote1☺52=20201004-21:11:53.884☺56=Quote1☺10=027☺
Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.
https://gist.github.com/hitxiang/345c8f699b4ad1271749e00b7517bef6
We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.
The sequence in time serial
20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it's sending, but hung at io writing - in Running state
20201101-23:56:18.651 test message from server side to trigger thread dump
20201101-22:57:45.654 server side began to close the connection
20201101-22:57:46.727 thread dump - right
20201101-23:57:48.363 logon message
20201101-22:58:56.515 thread dump - left
The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
It looks like that the storage - aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it's the network issue between k8s ec2 instance and aws efs.
First, we make the logging async at all session, make the disconnection happen less.
Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.
BTW, aws ebs is not high availability across zone, but it's better than fix disconnection.

Jmeter throwing "socketexception: connection reset" during the tests. But no error can be seen on server log [duplicate]

The target for us to achieve is 500 concurrent users.
We have tried running a test for 100 users over 3 machines. And it ran fine without any errors.
When i tried running the test for 150 or More users with same number of machines, i started getting the following response code
Response code:Non Http Response code:java.net.socketException
Response message:Connection Reset
I have also tried increasing the number of machines to 8 machines. Still it is of no help. Response time is also very high (156 seconds) for some of the requests.
When we checked the server logs to find out what could be causing this issue, No error logs were found there during the time of the execution.
I'm having a hard time finding out what could be the issue. The server side is ruling out if there could be an issue from their end.
Tried the following fixes from Jmeter side:
Increasing the heap size
Changing the retry count in user.properties file
Changing Boolean=True in hc.parameters file
Used HTTP Request Defaulters to change the implementation to HTTPClient4
CPU Config:
Intel (R) Xeon(R) CPU E5-2690 v3 # 2.60 GHz (2 Processors)
5 GB Ram
64-bit Operating System
The Connection Reset error means failed attempt to write to the socket which has been closed already, on TCP protocol level it means receiving a TCP RST
It might be the case JMeter is closing the connection prematurely as JMeter 5.0 had httpclient4.time_to_live property set to 2000 and if you're seeing response times > 2 seconds (and you do) most probably JMeter is closing the connection before getting the full response.
You can try increasing this setting to 60000 matches modern browsers default settings or even more to match your application response time (if you think 3 minutes is acceptable) or consider upgrading to JMeter 5.3 which has better default value.
More information: Bug 64289

socketException broken pipe upon upgrading httpclient jar version to 4.5.3

I am getting socket exception for broken pipe in my client side.
[write] I/O error: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe (Write failed)
[LoggingManagedHttpClientConnection::shutdown] http-outgoing-278: Shutdown connection
1520546494584[20180308 23:01:34] [ConnectionHolder::abortConnection] Connection discarded
1520546494584[20180308 23:01:34] [BasicHttpClientConnectionManager::releaseConnection] Releasing connection [Not bound]
It seems that the upgradation of httpclient jar is causing issue.
Issue is not coming with httpclient-4.3.2
Exception is coming in every 2 minutes. Issue is intermittent at times.
after , send expect:100-continue ,conn.flush is throwing exception
client and server are Linux machine
client uses http jar to make request to server REST.
Please help me in debugging the issue
can httpjar cause such issue?
The persistent connections that are kept alive by the connection manager become stale. That is, the target server shuts down the connection on its end without HttpClient being able to react to that event, while the connection is being idle, thus rendering the connection half-closed or 'stale'
This is a general limitation of the blocking I/O in Java. There is simply no way of finding out whether or not the opposite endpoint has closed connection other than by attempting to read from the socket.
If a stale connection is used to transmit a request message the request execution usually fails in the write operation with SocketException and gets automatically retried.
Apache HttpClient works this problem around by employing the so stale connection check which is essentially a very brief read operation. However, the check can and often is disabled. In fact it is often advisable to have it disabled due to extra latency the check introduces.
The handling of stale connections was changed in version 4.4. Previously, the code would check every connection by default before re-using it. The code now only checks the connection if the elapsed time since the last use of the connection exceeds the timeout that has been set. The default timeout is set to 2000ms

How to use wait\notify Processor?

I have running nifi instance 1.2.0.
I just tried to use "Wait\Notify" Processor with following reference
http://ijokarumawak.github.io/nifi/2017/02/02/nifi-notify-batch/#why-mergecontent-dont-suffice
Drag the template in canvas and i try to running it.
It shows following error in "Wait/Notify" processor.
Unable to communicate with cache when processing StandardFlowFileRecord[uuid=faab337f-034c-4137-a2f3-abb46f22b474,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1494485406343-1, container=default, section=1], offset=0, length=7005603],offset=5280310,name=input.csv,size=1054261] due to java.net.ConnectException: Connection refused: connect:
I don't know what this error says?
I am not apply any patches in 1.2.0 i just download binary file and tried it
Please stop me if anything if anything i'm missed?
And guide me to solve this.
You will need to create and start a DistributedMapCacheServer and DistributedMapCacheClient. The client needs to be configured with the port and host that the server is listening on.
Then the Wait and Notify processors use the DistributedMapCacheClient.
Since you were able to start the processors, you likely already have the client setup, but you don't have the server running.

Resources