PPPD connection without interrupt - sms

I try to get a system working with a modem and the following services :
permanent Data connection (2G/3G/4G) using old serial interface (not qmi/wmi)
permanent polling in AT command for getting SMS and get monitoring information like signal, provider, cell...
Does this modem can be connected 24/24 without any interruption ? Actually I have fast every day one or two small interruption of data (1 to 10 minutes).
The main question is, does a Modem can be connected to the provider 24/24 without interrupt, or it's a standard behavior to be disconnected sometimes ?
Additional note : I have multiples devices, and the PPPD binary fail in 10 minutes on all devices.
I have collect some logs, and I can see the system is disconnected each ~720 minutes = 12 hours :
Sep 14 20:23:02 daemon.info pppd[1905]: Connect time 718.6 minutes.
Sep 15 08:23:02 daemon.info pppd[19903]: Connect time 719.9 minutes.
Sep 15 20:23:03 daemon.info pppd[2493]: Connect time 719.9 minutes.
Sep 16 08:23:03 daemon.info pppd[16865]: Connect time 719.9 minutes.
Sep 16 20:23:03 daemon.info pppd[31234]: Connect time 719.8 minutes.
Sep 17 08:23:03 daemon.info pppd[13827]: Connect time 719.8 minutes.

Depending of the provider / cells. The system will de-register your modem each X hours (in my case 12 hours).
The best solution I have found is to set the option --persist on pppd. The system take ~1 minutes to re-register, but PPPD stay alive.

Related

packets.go:123: closing bad idle connection: connection reset by peer

I am using Go, Fiber web framework, mariadb 10.6, debian 11 and github.com/go-sql-driver/mysql to connection to mariadb. I have played with these settings
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)
db.SetConnMaxLifetime(5 * time.Minute)
ie I increase the values, decrease values but still get like 1 or 2 waring
packets.go:123: closing bad idle connection: connection reset by peer
per minute. Any suggestion?
answar was I was having wait_timeout 20 second and interactive timeout 50 second I increased now its fixed thanks to #ysth for solution
the answer was I was having wait_timeout 20 seconds and interactive timeout 50 seconds I increased now its fixed thanks to #ysth for the solution

Time drift between processes on the same virtual guest (VMWare host and windows guest)

I'm struggling a bit with time drift between processes on the same virtual guest
In real life - we have around 50 processes sending messages to each other and the drift
makes reading logs hard.
To illustrate my problem I am running two processes I wrote on a Windows server 2019.
Process 1 (time_client - tc ) finds out what time it is,
and then send a string representing the current time to
process 2 (server TS) via a named pipe.
The string sent looks like '23-May-2022 14:26:55.608'
printout from TS looks like
23-May-2022 13:03:29.344 -
23-May-2022 14:39:57.396 -
23-May-2022 14:39:57.492
diff is 00000:00:00:00.096
server is ahead FALSE
where diff is days:hours:minutes:seconds.milliseconds
TS - the server - does the following:
save the time when process is started
then, upon arrival of time_string from TC
get time from os
print starttime, time from os, time from TC, and diff time_from_os - time_String from TC
tc send a new time_string every minute. The TC is started at every invokation and runs to completion, so every time it is a new instance running
I notice this after ca 2 hrs
win server 2019 on VMWare
diff grows to ca 150 ms after 1 hour
win server 2016 on VMWare
diff grows to ca 50 ms after 1 hour
Linux Rocky 8.6 on VirtualBox - Host Win 10
diff grows to ca 0 ms after 2 hours
The drift between the two processes on windows is very annoying since it messes up log completely.
One process is creating an event, and send a message to another process, which treats is several seconds earlier - according to the logs.
The process are usually up for months - but for some communications processes that are restarted due to lost communication at least daily
So - it is not that the guest is out of synch - that would be ok.
It is that the processes get a different value of 'now' depending on how long they have been running.
Is this a well known problem ? Having a hard time googling it
The problem could be within VMWare or within Windows.

Postgres connect time delay on Windows

There is a long delay between "forked new backend" and "connection received", from about 200 to 13000 ms. Postgres 12.2, Windows Server 2016.
During this delay the client is waiting for the network packet to start the authentication. Example:
14:26:33.312 CEST 3184 DEBUG: forked new backend, pid=4904 socket=5340
14:26:33.771 CEST 172.30.100.238 [unknown] 4904 LOG: connection received: host=* port=56983
This was discussed earlier here:
Postegresql slow connect time on Windows
But I have not found a solution.
After rebooting the server the delay is much shorter, about 50 ms. Then it gradually increases in the course of a few hours. There are about 100 clients connected.
I use ip addresses only in "pg_hba.conf". "log_hostname" is off.
There is BitDefender running on the server but switching it off did not help. Further, Postgres files are excluded from BitDefender checks.
I used Process Monitor which revealed the following: Forking the postgres.exe process needs 3 to 4 ms. Then, after loading DLLs, postgres.exe is looking for custom and extended locale info of 648 locales. It finds none of these. This locale search takes 560 ms (there is a gap of 420 ms, though). Perhaps this step can be skipped by setting a connection parameter. After reading some TCP/IP parameters, there are no events for 388 ms. This time period overlaps the 420 ms mentioned above. Then postgres.exe creates a thread. The total connection time measured by the client was 823 ms.
Locale example, performed 648 times:
"02.9760160","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","REPARSE","Desired Access: Read"
"02.9760500","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","SUCCESS","Desired Access: Read"
"02.9760673","RegQueryValue","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale\bg-BG","NAME NOT FOUND","Length: 532"
"02.9760827","RegCloseKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","SUCCESS",""
"02.9761052","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","REPARSE","Desired Access: Read"
"02.9761309","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","SUCCESS","Desired Access: Read"
"02.9761502","RegQueryValue","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale\bg-BG","NAME NOT FOUND","Length: 532"
"02.9761688","RegCloseKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","SUCCESS",""
No events for 388 ms:
"03.0988152","RegCloseKey","HKLM\System\CurrentControlSet\Services\Tcpip6\Parameters\Winsock","SUCCESS",""
"03.4869332","Thread Create","","SUCCESS","Thread ID: 2036"

Go Kafka `ProduceChannel()` Fills Up and Hangs

I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to kafkaProducer.Events(). You can see that num queued = num delivered.
10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52: kafkaProducer.ProduceChannel() is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.
17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge normally stays at zero because sending the message to the Kafka ProduceChannel() doesn't block and each Inc() is immediately followed by a matching Dec() However, when the ProduceChannel() is filled up, the goroutine blocks and recordChannelGauge stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
Go server binary built with golang 1.10.x
Latest version of github.com/confluentinc/confluent-kafka-go/kafka. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version.
Server OS Ubuntu 16.04.5
librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after
I have the similar problem as you.And I found an article that might explain the cause of the problem.
When there is no message in the blocked topic, after a certain period of time, you will timeout error as below.
%5|1598190018.518|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_ssl://abcd....confluent.cloud:xxxx/2]: sasl_ssl://abcd....confluent.cloud:xxxx/2: Timed out FetchRequest in flight (after 359947ms, timeout #0)
%4|1598190018.840|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_ssl://abcd.confluent.cloud:xxxx/2]: sasl_ssl://abcd.xxxxx.confluent.cloud:xxxx/2: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
The link of Article: https://www.thecodebuzz.com/apache-kafka-net-client-producer-consumer-csharp-confluent-examples-ii/
I hope that it can do some help to you.

Socket Exception while running load test with Self Provisioned test rig

I am getting the Socket Exception while running load test on self-provisioned test rig.
I am trigger those load tests in agent machine(self-provisioned test rig) from my local machine.
Note : For first 2 to 3 minutes test iterations are passing , after that we are getting the Socket Exception.
Below is the error message :
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond.
Below are the stack trace details :
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at
System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure,
Socket s4, Socket s6, Socket& socket, IPAddress& address,
ConnectSocketState state, IAsyncResult asyncResult, Exception&
exception)
Run Time - 20min
Sample rate - 10sec
warm up duration 10sec
number of agents used - 2
Load pattern :
initial load - 10user
max user count - 300
step duration - 10sec
step user count - 10
Although, by Changing above values I am still getting the exception in the same way.
I am using Visual studio 2015 enterprise.
The question states: start with 10 users, every 10 seconds add 10 users to a maximum of 300. Thus after 29 increments there will be 300 users and that will take 29*10 seconds which is 4m50s. The test will thus (attempt to) run with the maximum load of 300 users for the remaining 15m10s.
Given that all tests pass for the first 2 or 3 minutes plus the the error message, that suggests that you are overloading some part of the network. It might be the agents, it might be the servers or it might be on the connections between them. Some network components have a maximum number of active connections and the 300 users might be too many.
Increasing the load so rapidly means you do not clearly know what the limiting value. The sampling rate (at 10 seconds) seems high. At each sampling interval a lot of data is transferred (i.e. the sample data) and that can swamp parts of the network. You should look at the network counters for the agents and controller, also the servers if available.
I recommend changing the load test steps to add 10 users every 30 seconds, so it takes about 15 minutes to reach 300 users. It may also be worth reducing the sample rate to every 20 seconds.

Resources