Are there any hints when imudp loses messages? - rsyslog

For example, when imjournal module drops messages due to rate-limiting, we can see imjournal: <N> messages lost due to rate-limiting in journalctl -u rsyslog.service output.
Are there any methods to find out why imudp lost messages(packet loss/rate limiting/buffer overfow etc.)?

Related

What do the fields in rpcdebug -c's dmesg output mean?

I'm trying to track down a stall that may or may not be on the client host, but on the server side. Unfortunately this is all kernel RPC level, with some of if not controlled by my code...
I'm something of a neophyte when it comes to rpc debugging, so any references would be helpful.
I have this output in dmesg when rpcdebug -c -m all is run:
[2109401.599881] -pid- flgs status -client- --rqstp- -timeout ---ops--
[2109401.600055] 51580 0880 0 ffff9af4c4da9800 ffff9af4c4416600 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
[2109401.600300] 51581 0880 0 ffff9af4c4da9800 ffff9af42465f800 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
I get the PID, flags, and timeout, but:
What are the "client" and "rqstp" fields supposed to mean? If I see either duplicated in subsequent outputs, does that mean the RPC is stalled? And which way?
Is "xprt_pending" a "waiting to send the RPC" queue? If that queue was "delayq," I would know what that means in the context of the problem we're trying to diagnose. But this state doesn't seem to be explained anywhere I find in Google. (And my Google-Fu is usually better than this...)
What is the "ffffffffc0b26680" supposed to be? It repeats all over the output, for EVERY RPC listed.
I'm trying to avoid running with rpcdebug set, because I'm dealing with an intermittent stall, and I'd rather not slow EVERYTHING down in the hopes of catching the stall.

IBM MQ version 7.5 error AMQ7472: Object %CHLBATCH.706, type scratchpad damaged

We are currently having an issue with an MQ Cluster were a CLUSSDR channel is going into retry as the receiving MQ object is showing as damaged.
Configuration is many QMGR's (STAT00-11) sending messages to the Cluster of 4 QMGR's, 2 FullRepos (HUB01-02 and 2PartialRepos HUB03-04)
Problem is that on the STAT02 QMGR the CLUSSDR channel to HUB01 is in a retry state
with the MQ log error;
AMQ9506: Message receipt confirmation failed.
and on HUB01 the MQ log errors;
AMQ7472: Object %CHLBATCH.706, type scratchpad damaged. (many)
AMQ9999: Channel 'TO_HUB01' to host 'server02 (n.n.n.n)' ended abnormally.
AMQ9588: Program cannot update queue manager object. (single instance)
AMQ9587: Program cannot open queue manager object (many)
I have now stopped the CLUSSDR on STAT02 to HUB01 and there is no longer any log entries, however as the QMGR's have linear logging the log files are not being released on the HUB01 QMGR
this has introduced a new error
AMQ7084: Object syncfile, type syncfile damaged.
which is filling up the disk.
I have so far tried to recover the damaged object, the command used was on the HUB01 QMGR
rcrmqobj -m HUB01 -t channel TO_STAT02
and this returned the result, AMQ7085: Object TO_STAT02, type channel not found., although the following results contradict this;
DIS CLUSQMGR(STAT*) CHANNEL
outputs a list of all the STAT* QMGR's which includes the TO_STAT02 channel
and the channel status
DIS CHS(TO_STAT*) STATUS
shows all the channels in a RUNNING state, including the supposed non-existent TO_STAT02
Anyone had similar issues please, note that this is the second occurrence we have had in the last month to different clusters and last time we had to take the drastic action of rebuilding the QMGR once the disk space was exhausted and the QMGR crashed
rcrmqobj -m HUB01 -t syncfile
is the correct way to rebuild a corrupt syncfile and if using linear logging this will also repair any damaged scratchpad objects. Damaged scratchpad objects should only ever occur through operational or filesystem error, for example if files were deleted or partially restored from backup and so having a large number is something that you should try and identify the root cause.
rcrmqobj -t channel will be able to recover damage to channel object definitions, but it is the synchronization data and its index (syncfile) that is damaged/missing. TO_STAT02 sounds like it is a cluster sender that MQ clustering maintains from information shared within the cluster - you can check on whether a cluster channel has a local channel definition using DEFTYPE on DISPLAY CLUSQMGR.

Weird messages "rsyslogd: msg: ruleset ' &è·Æ ' could not be found and could not be assgined to message object" in rsyslog logs

We have an rsyslog configured to receive messages from multiple sources on different ports.
Messages are then assigned to different action rulesets depending on the incoming port.
We have noticed that sometimes (but not systematically), after an rsyslog restart, there are error logged in /var/log/messages with content like
"2022-08-16T16:46:26.841640+02:00 mysyslogserver rsyslogd: msg: ruleset ' 6È B ' could not be found and could not be assgined to message object. This possibly leads to the message being processed incorrectly. We cannot do anything against this, but wanted to let you know. [v8.32.0 try http://www.rsyslog.com/e/3003 ]"
The name of ruleset is changing every time and seems to be a random binary string. Such message is logged several thousands of time (with same ruleset name), at a rate which often exceeds ratelimit for internal messages.
(And of course we don't have rulesets with such names in our config file... )
Would you know what could be the cause of such issue ? Is it a bug ?
Note that in some rulesets we use "call" statement to call sub-rulesets, but we don't use "call_indirectly".
Thanks in advance for any help.
S.Hemelaer

messages lost due to rate-limiting

We are testing the capacity of a Mail relay based on RHEL 7.6.
We are observing issues when sending an important number of msgs (e.g.: ~1000 msgs in 60 seconds).
While we have sent all the msgs and the recipient has received all the msgs, logs are missing in the /var/log/maillog_rfc5424.
We have the following message in the /var/log/messages:
rsyslogd: imjournal: XYZ messages lost due to rate-limiting
We adapted the /etc/rsyslog.conf with the following settings but without effect:
$SystemLogRateLimitInterval 0 # turn off rate limit
$SystemLogRateLimitBurst 0 # turn rate limit off
Any ideas ?
The error is from imjournal, but your configuration settings are for imuxsock.
According to the rsyslog configuration page you need to set
$imjournalRatelimitInterval 0
$imjournalRatelimitBurst 0
Note that for very high message rates you might like to change to imuxsock, as it says:
this module may be notably slower than when using imuxsock. The journal provides imuxsock with a copy of all “classical” syslog messages, however, it does not provide structured data. Only if that structured data is needed, imjournal must be used. Otherwise, imjournal may simply be replaced by imuxsock, and we highly suggest doing so.

Goaccess show error data on real-time HTML report due to loading wrong data from multiple websocket service

I need to run multiple goaccess processs with --real-time-html option to analyze multiple logs.
my commands are:
/usr/bin/goaccess --real-time-html -o /data/html/log1/index.html -f log/log1.log --port=7890
/usr/bin/goaccess --real-time-html -o /data/html/log2/index.html -f log/log2.log --port=7891
...
when only 1 process is running, everything is ok, and I can see the data frames of websocket on Chrome, every data frame is generally the same length;
But when 2 or more processes are running, 2 things happened:
On the terminal which goaccess processes are running, "SIGPIPE caught!" came out continuously;
On the web page, the dashboard are showing wrong data discontinuously, and I notice that the websocket data frames received by the browser are quite different in length(which means the web page are receiving different websocket data frames from other goaccess processes), when the data frame length are similiar to the data length when only 1 goaccess process is running, the data shown on the web page are right, when the data frame length ars different, the data ares wrong.
It seems like that even I run the goaccess process with "--port" option to specify different port for every WebSocket process, multiple websocket services ares still mixed up.
To run multiple instances, you need to ensure the following:
Run each instance on a different port --port.
Different pipes (FIFOs) --fifo-in=/path/in.1 --fifo-out=/path/out.1.
(Optionally) IFF you are using the on-disk storage, then you will need different path where the DB files are stored --db-path=/path/instance1/.
Examples:
goaccess -f /prod/access.log -o /var/www/html/prod.html --real-time-html --ws-url=192.168.1.2 --port=7890 --fifo-in=/tmp/prod.in --fifo-out=/tmp/prod.out
AND
goaccess -f /dev/access.log -o /var/www/html/dev.html --real-time-html --ws-url=192.168.1.2 --port=7891 --fifo-in=/tmp/dev.in --fifo-out=/tmp/dev.out
Source

Resources