After re-transmission failure from one node to another, both node mark each other as dead and does not show status of each other in crm_mon - high-availability

So in starting Node 1 not showing Node 2 and similarly Node 2 does not show Node 1 in crm_mon command
After analyzing corosync log I found that because of multiple retransmit failure both nodes mark each other as dead so I tried to stop and start the corosync and pacemaker but still they are not forming cluster and does not show each other in crm_mon
Logs of Node 2:
For srv-vme-ccs-02
Oct 30 02:22:49 srv-vme-ccs-02 crmd[1973]: notice:
crm_update_peer_state: plugin_handle_membership: Node
srv-vme-ccs-01[2544637100] - state is now member (was (null)
It is member till now
Oct 30 10:07:34 srv-vme-ccs-02 corosync[1613]: [TOTEM ] Retransmit List: 117 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]: [TOTEM ]
Retransmit List: 118 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]:
[TOTEM ] FAILED TO RECEIVE Oct 30 10:07:49 srv-vme-ccs-02 arpwatch:
bogon 192.168.0.120 d4:be:d9:af:c6:23 Oct 30 10:07:59 srv-vme-ccs-02
corosync[1613]: [pcmk ] notice: pcmk_peer_update: Transitional
membership event on ring 232: memb=1, new=0, lost=1 Oct 30 10:07:59
srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update:
memb: srv-vme-ccs-02 2561414316 Oct 30 10:07:59 srv-vme-ccs-02
corosync[1613]: [pcmk ] info: pcmk_peer_update: lost:
srv-vme-ccs-01 2544637100 Oct 30 10:07:59 srv-vme-ccs-02
corosync[1613]: [pcmk ] notice: pcmk_peer_update: Stable membership
event on ring 232: memb=1, new=0, lost=0 Oct 30 10:07:59
srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update:
MEMB: srv-vme-ccs-02 2561414316
Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-01 was not seen in the
previous transition Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]:
[pcmk ] info: update_member: Node 2544637100/srv-vme-ccs-01 is now:
lost Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info:
send_member_notification: Sending membership update 232 to 2 children
Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [TOTEM ] A processor
joined or left the membership and a new membership was formed. Oct 30
10:07:59 srv-vme-ccs-02 corosync[1613]: [CPG ] chosen downlist:
sender r(0) ip(172.20.172.152) ; members(old:2 left:1) Oct 30 10:07:59
srv-vme-ccs-02 crmd[1973]: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:07:59 srv-vme-ccs-02
corosync[1613]: [MAIN ] Completed service synchronization, ready to
provide service. Oct 30 10:07:59 srv-vme-ccs-02 cib[1968]: notice:
plugin_handle_membership: Membership 232: quorum lost
Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: notice: crm_update_peer_state: plugin_handle_membership: Node
srv-vme-ccs-01[2544637100] - state is now lost (was member) Oct 30
10:07:59 srv-vme-ccs-02 cib[1968]: notice: crm_update_peer_state:
plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is
now lost (was member) Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]:
warning: reap_dead_nodes: Our DC node (srv-vme-ccs-01) left the
cluster
Now srv-vme-ccs-01 is no more a member
On the other node, I find the similar logs of failed retransmit
Logs of Node 1
For srv-vme-ccs-01
Oct 30 09:48:32 [2000] srv-vme-ccs-01 pengine: info:
determine_online_status: Node srv-vme-ccs-01 is online Oct 30
09:48:32 [2000] srv-vme-ccs-01 pengine: info:
determine_online_status: Node srv-vme-ccs-02 is online
ct 30 09:48:59 [2001] srv-vme-ccs-01 crmd: info: update_dc:
Unset DC. Was srv-vme-ccs-01
Oct 30 09:48:59 corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f 110 111 112 113 114 115 116 117 Oct 30 09:48:59
corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f
110 111 112 113 114 115 116 117 118
Oct 30 10:08:22 corosync [TOTEM ] A processor failed, forming new configuration. Oct 30 10:08:25 corosync [pcmk ] notice:
pcmk_peer_update: Transitional membership event on ring 232: memb=1,
new=0, lost=1 Oct 30 10:08:25 corosync [pcmk ] info:
pcmk_peer_update: memb: srv-vme-ccs-01 2544637100 Oct 30 10:08:25
corosync [pcmk ] info: pcmk_peer_update: lost: srv-vme-ccs-02
2561414316 Oct 30 10:08:25 corosync [pcmk ] notice: pcmk_peer_update:
Stable membership event on ring 232: memb=1, new=0, lost=0 Oct 30
10:08:25 corosync [pcmk ] info: pcmk_peer_update: MEMB:
srv-vme-ccs-01 2544637100
Oct 30 10:08:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-02 was not seen in the previous transition Oct 30
10:08:25 corosync [pcmk ] info: update_member: Node
2561414316/srv-vme-ccs-02 is now: lost Oct 30 10:08:25 corosync [pcmk
] info: send_member_notification: Sending membership update 232 to 2
children
Oct 30 10:08:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 30 10:08:25 [1996]
srv-vme-ccs-01 cib: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: notice: crm_update_peer_state: plugin_handle_membership:
Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct
30 10:08:25 corosync [CPG ] chosen downlist: sender r(0)
ip(172.20.172.151) ; members(old:2 left:1) Oct 30 10:08:25 [2001]
srv-vme-ccs-01 crmd: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: notice: crm_update_peer_state: plugin_handle_membership:
Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct
30 10:08:25 [2001] srv-vme-ccs-01 crmd: info:
peer_update_callback: srv-vme-ccs-02 is now lost (was member)
Oct 30 10:08:25 corosync [MAIN ] Completed service synchronization,
ready to provide service. Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: warning: match_down_event: No match for shutdown action on
srv-vme-ccs-02 Oct 30 10:08:25 [1990] srv-vme-ccs-01 pacemakerd:
info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=9):
Try again (6)
Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info:
join_make_offer: Skipping srv-vme-ccs-01: already known 1 Oct 30
10:08:25 [2001] srv-vme-ccs-01 crmd: info: update_dc: Set
DC to srv-vme-ccs-01 (3.0.7) Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: info: cib_process_request: Completed cib_modify
operation for section crm_config: OK (rc=0, origin=local/crmd/185,
version=0.116.3)
So at the same time on both node retransmission of message occur heavily (it occurs after server rebooted abruptly) and both the node mark each other as lost member and form individual cluster as marking itself as DC

I got the solution of this :
First as checked in tcpdump pacemkaer is using multicasting and upon investigating with Network team , we came to know multicasting is not enabled .
So when we removed mcastaddere and restarted corosync and pacemaker , but corosyn refused to start and said the error :
No mcastaddresss define in corosync.conf .
Laster on debugging found that synaxt for
transport: udpu
is not correct it was writter as following :
transport=udpu
So, corosync by default running is multicasting mode .
So , issue is resolved after correcting corosync.conf .

Related

drbd & Corosync - My drbd works, it shows me that it is upToDate, but it is not

I have a high availability cluster with two nodes, with a resource for drbd, a virtual IP and the mariaDB files shared on the drbd partition.
Everything seems to work OK, but drbd is not syncing the latest files I have created, even though drbd status tells me they are UpToDate.
sudo drbdadm status
iba role:Primary
disk:UpToDate
Pcs also does not show errors
sudo pcs status
Cluster name: cluster_iba
Cluster Summary:
* Stack: corosync
* Current DC: iba2-ip192 (version 2.0.3-4b1f869f0f) - partition with quorum
* Last updated: Tue Feb 22 18:16:20 2022
* Last change: Mon Feb 21 16:19:38 2022 by root via cibadmin on iba1-ip192
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ iba1-ip192 iba2-ip192 ]
Full List of Resources:
* virtual_ip (ocf::heartbeat:IPaddr2): Started iba2-ip192
* Clone Set: DrbdData-clone [DrbdData] (promotable):
* Masters: [ iba2-ip192 ]
* Slaves: [ iba1-ip192 ]
* DrbdFS (ocf::heartbeat:Filesystem): Started iba2-ip192
* WebServer (ocf::heartbeat:apache): Started iba2-ip192
* Maria (ocf::heartbeat:mysql): Started iba2-ip192
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
All constraint:
sudo pcs constraint list --full
Location Constraints:
Ordering Constraints:
promote DrbdData-clone then start DrbdFS (kind:Mandatory) (id:order-DrbdData-clone-DrbdFS-mandatory)
start DrbdFS then start virtual_ip (kind:Mandatory) (id:order-DrbdFS-virtual_ip-mandatory)
start virtual_ip then start WebServer (kind:Mandatory) (id:order-virtual_ip-WebServer-mandatory)
start DrbdFS then start Maria (kind:Mandatory) (id:order-DrbdFS-Maria-mandatory)
Colocation Constraints:
DrbdFS with DrbdData-clone (score:INFINITY) (with-rsc-role:Master) (id:colocation-DrbdFS-DrbdData-clone-INFINITY)
virtual_ip with DrbdFS (score:INFINITY) (id:colocation-virtual_ip-DrbdFS-INFINITY)
WebServer with virtual_ip (score:INFINITY) (id:colocation-WebServer-virtual_ip-INFINITY)
Maria with DrbdFS (score:INFINITY) (id:colocation-Maria-DrbdFS-INFINITY)
Ticket Constraints:
The files in /mnt/datosDRBD in node iba2-ip192 (when it's the master),
/mnt/datosDRBD$ ls -l
total 80
-rw-r--r-- 1 root root 5801 feb 21 12:16 drbd_cfg
-rw-r--r-- 1 root root 10494 feb 21 12:18 fs_cfg
drwx------ 2 root root 16384 feb 21 10:12 lost+found
drwxr-xr-x 4 mysql mysql 4096 feb 22 18:00 mariaDB
-rw-r--r-- 1 root root 17942 feb 21 12:39 MariaDB_cfg
-rw-r--r-- 1 root root 5 feb 21 10:13 testMParicio.txt
-rw-r--r-- 1 root root 13578 feb 21 12:21 WebServer_cfg
And the files in /mnt/datosDRBD in node iba1-ip192 (when it's the master),
ls -l
total 92
-rw-r--r-- 1 root root 5801 feb 21 12:16 drbd_cfg
drwxrwxrwx 5 www-data www-data 4096 feb 22 13:41 FilesSGITV
-rw-r--r-- 1 root root 10494 feb 21 12:18 fs_cfg
drwx------ 2 root root 16384 feb 21 10:12 lost+found
drwxr-xr-x 7 mysql mysql 4096 feb 22 17:55 mariaDB
-rw-r--r-- 1 root root 17942 feb 21 12:39 MariaDB_cfg
-rw-r--r-- 1 root root 5 feb 22 17:58 testMParicio2.txt
-rw-r--r-- 1 www-data www-data 9 feb 22 17:58 testMParicio3.txt
-rw-r--r-- 1 root root 5 feb 21 10:13 testMParicio.txt
-rw-r--r-- 1 root root 13578 feb 21 12:21 WebServer_cfg
All new files, testMParicio2.txt testMParicio3.txt and the folder FilesSGITV are missing.
I do not know what to do. I am very lost.
I appreciate any help, thanks.
(EDIT)
My config for drbd, in both nodes...
cat /etc/drbd.conf
# You can find an example in /usr/share/doc/drbd.../drbd.conf.example
include "drbd.d/global_common.conf";
include "drbd.d/*.res";
And my *.res config, in both nodes too:
resource iba {
device /dev/drbd0;
disk /dev/md3;
meta-disk internal;
on iba1 {
address 10.0.0.248:7789;
}
on iba2 {
address 10.0.0.249:7789;
}
}
drbdadm use iba1 and iba2, with IP 10.0.0.248 and 10.0.0.249
Corosync use iba1-ip192 and iba2-192, with IP 192.168.1.248 and 192.168.1.249
cat /etc/hosts
127.0.0.1 localhost
#127.0.1.1 iba1
10.0.0.248 iba1
10.0.0.249 iba2
192.168.1.248 iba1-ip192
192.168.1.249 iba2-ip192
cat /etc/drbd.d/global_common.conf
global {
usage-count yes;
udev-always-use-vnr; # treat implicit the same as explicit volumes
}
common {
handlers {
}
startup {
}
options {
}
disk {
}
net {
protocol C;
}
}
(EDIT 2)
I have found a problem in /proc/drbd
In primary node:
cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: FC3433D849E3B88C1E7B55C
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
ns:0 nr:0 dw:2284 dr:11625 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:42364728
in secondary node
cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: FC3433D849E3B88C1E7B55C
0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:36538580
Secondary node don't remember ssh key, fix with
ssh-keygen -R 10.0.0.248
ssh-copy-id iba#iba1
But drbd still with StandAlone status.
I don't know how to continue
I have found a Split-Brain that did not appear in the status of pcs.
sudo journalctl | grep Split-Brain
feb 21 13:00:10 ibatec1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
feb 21 13:21:40 ibatec1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
feb 21 13:27:54 ibatec1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
I have stopped the cluster, with --force on the master,
Then...
On split-brain victim (assuming the DRBD resource is iba):
drbdadm disconnect iba
drbdadm secondary iba
drbdadm connect --discard-my-data iba
On split-brain survivor:
drbdadm primary iba
drbdadm connect iba

Sort a marix by timestamp

I am not sure if that applicable but I need to arrange and sort below output by timestamp below in column 2 under From , the newer should be on first line and the older on last line, what is needed is to keep the time format as it is, only I need to arrange by date
COUNT FROM TO
97 Oct 10 10:00:56 Oct 10 10:18:35
9 Mar 10 10:02:09 Oct 10 10:02:55
768 Oct 10 10:01:09 Oct 10 10:18:24
764 Oct 10 10:00:53 Oct 10 10:18:24
33 Oct 10 10:18:35 Oct 10 10:18:39
306 May 10 10:00:52 Oct 10 10:21:20
3 Oct 10 10:00:52 Oct 10 10:00:52
3 Oct 12 15:33:26 Nov 2 03:30:06
2 Oct 17 09:16:53 Oct 17 09:17:05
18 Nov 2 00:07:24 Nov 2 01:03:13
11 Oct 10 10:00:52 Oct 10 10:00:56
10095 Jun 10 10:00:52 Oct 10 10:18:24
10 Oct 10 10:18:40 Oct 10 10:18:45
1 Nov 2 03:21:32 Nov 2 03:21:32
1 Feb 2 01:31:53 Nov 2 01:31:53
1 Aug 2 03:26:24 Nov 2 03:26:24
1 Nov 2 03:21:32 Nov 2 03:21:32
1 Oct 10 10:18:05 Oct 10 10:18:05
1 Oct 17 09:16:52 Oct 17 09:16:52
1 Jan 10 10:02:55 Oct 10 10:02:55
1 Nov 2 23:24:09 Nov 2 23:29:09
1 Oct 10 10:00:52 Oct 10 10:00:52
1 Oct 10 10:00:53 Oct 10 10:00:53
1 Nov 2 03:22:22 Nov 2 03:22:22
1 Apr 2 06:41:29 Nov 2 06:41:29
The output should be with the same header with below as first line
1 Nov 2 23:24:09 Nov 2 23:29:09
, and below as the last line.
1 Jan 10 10:02:55 Oct 10 10:02:55
Take a look at man sort and you will see that you can sort by columns using the -k option.
This option supports a column number, and optional sort method.
For your case this might work:
sort -k2Mr -k3nr -k4r file.txt
-k2Mr do month sort on column two and reverse it.
-k3nr do numeric sort on column three and reverse it.
-k4r sort on column four and reverse it.

Windows SSDP discovery service throttling outgoing SSDP broadcasts

I have a Python app that broadcasts SSDP discovery requests. I noticed the devices I'm attempting to discover aren't always responding. Using Wireshark I found that only some of my broadcasts are reaching the wire. After some troubleshooting I isolated the source of the problem to the SSDP Discovery service - if I disable that service then my packet loss goes away. Also, if I use a multicast address other than SSDP (239.255.255.250) the problem also goes away. So it definitely seems SSDP is throttling my outgoing UDP broadcasts. Any idea why this is? Perhaps trying to coalesce broadcasts/limit traffic? I'm using Windows 7. The problem doesn't occur under OSX.
Here is a quick test app demonstrating the packet loss - both instances running on the same system, the sender instance transmits a packet every second and the receive reports any gaps in the test-defined packet number.
def testSend():
seqNumber = 0
while True:
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, 1)
sock.sendto(str(seqNumber), ("239.255.255.250", 1900))
sock.close()
print("Sent Seq #{:4d} [{}]".format(seqNumber, time.ctime()))
seqNumber += 1
time.sleep(1)
def testReceive():
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("", 1900))
mreq = struct.pack("4sl", socket.inet_aton("239.255.255.250"), socket.INADDR_ANY)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)
expectedSequenceNumber = 0
while True:
response = sock.recv(100)
actualSequenceNumber = int(response)
if actualSequenceNumber == expectedSequenceNumber:
print("Good: Received Seq #{:>4s} [{}]".format(response, time.ctime()))
else:
print("Bad: Expected Seq #{}, Got #{} ({} frames dropped) [{}]".format(expectedSequenceNumber, actualSequenceNumber, actualSequenceNumber - expectedSequenceNumber, time.ctime()))
expectedSequenceNumber = actualSequenceNumber + 1
if sys.argv[1] == 'send':
testSend()
elif sys.argv[1] == 'receive':
testReceive()
Output on receiving side:
Good: Received Seq # 0 [Sun Sep 20 11:01:18 2015]
Good: Received Seq # 1 [Sun Sep 20 11:01:19 2015]
Good: Received Seq # 2 [Sun Sep 20 11:01:20 2015]
Good: Received Seq # 3 [Sun Sep 20 11:01:21 2015]
Good: Received Seq # 4 [Sun Sep 20 11:01:22 2015]
Bad: Expected Seq #5, Got #12 (7 frames dropped) [Sun Sep 20 11:01:30 2015]
Good: Received Seq # 13 [Sun Sep 20 11:01:31 2015]
Good: Received Seq # 14 [Sun Sep 20 11:01:32 2015]
Good: Received Seq # 15 [Sun Sep 20 11:01:33 2015]
Good: Received Seq # 16 [Sun Sep 20 11:01:34 2015]
Good: Received Seq # 17 [Sun Sep 20 11:01:35 2015]
Good: Received Seq # 18 [Sun Sep 20 11:01:36 2015]
Good: Received Seq # 19 [Sun Sep 20 11:01:37 2015]
Good: Received Seq # 20 [Sun Sep 20 11:01:38 2015]
Bad: Expected Seq #21, Got #51 (30 frames dropped) [Sun Sep 20 11:02:09 2015]
Good: Received Seq # 52 [Sun Sep 20 11:02:10 2015]
Good: Received Seq # 53 [Sun Sep 20 11:02:11 2015]
Good: Received Seq # 54 [Sun Sep 20 11:02:12 2015]
Good: Received Seq # 55 [Sun Sep 20 11:02:14 2015]
Good: Received Seq # 56 [Sun Sep 20 11:02:15 2015]
Good: Received Seq # 57 [Sun Sep 20 11:02:16 2015]
Good: Received Seq # 58 [Sun Sep 20 11:02:17 2015]
Good: Received Seq # 59 [Sun Sep 20 11:02:18 2015]
Good: Received Seq # 60 [Sun Sep 20 11:02:19 2015]
Bad: Expected Seq #61, Got #71 (10 frames dropped) [Sun Sep 20 11:02:30 2015]
Good: Received Seq # 72 [Sun Sep 20 11:02:31 2015]
Good: Received Seq # 73 [Sun Sep 20 11:02:32 2015]
Good: Received Seq # 74 [Sun Sep 20 11:02:33 2015]
Good: Received Seq # 75 [Sun Sep 20 11:02:34 2015]
Good: Received Seq # 76 [Sun Sep 20 11:02:35 2015]
Good: Received Seq # 77 [Sun Sep 20 11:02:36 2015]
Good: Received Seq # 78 [Sun Sep 20 11:02:37 2015]
Good: Received Seq # 79 [Sun Sep 20 11:02:38 2015]
Good: Received Seq # 80 [Sun Sep 20 11:02:39 2015]
Bad: Expected Seq #81, Got #110 (29 frames dropped) [Sun Sep 20 11:03:09 2015]
Good: Received Seq # 111 [Sun Sep 20 11:03:10 2015]
Good: Received Seq # 112 [Sun Sep 20 11:03:11 2015]
Good: Received Seq # 113 [Sun Sep 20 11:03:12 2015]
Good: Received Seq # 114 [Sun Sep 20 11:03:13 2015]
Good: Received Seq # 115 [Sun Sep 20 11:03:14 2015]
Good: Received Seq # 116 [Sun Sep 20 11:03:15 2015]
Good: Received Seq # 117 [Sun Sep 20 11:03:16 2015]
Good: Received Seq # 118 [Sun Sep 20 11:03:17 2015]
Good: Received Seq # 119 [Sun Sep 20 11:03:18 2015]
Bad: Expected Seq #120, Got #130 (10 frames dropped) [Sun Sep 20 11:03:30 2015]
Good: Received Seq # 131 [Sun Sep 20 11:03:31 2015]
Good: Received Seq # 132 [Sun Sep 20 11:03:32 2015]
Good: Received Seq # 133 [Sun Sep 20 11:03:33 2015]
Good: Received Seq # 134 [Sun Sep 20 11:03:34 2015]
Good: Received Seq # 135 [Sun Sep 20 11:03:35 2015]
Good: Received Seq # 136 [Sun Sep 20 11:03:36 2015]
Edit (10/06/15):
I believe I've root-caused the issue. The Windows SSDP Discovery service appears periodically cycle which interface multicast packets go out on and also which interfaces incoming packets are, even on systems with only one physical network interface configured/online. On my system I have a single wired ethernet network and two virtual VMware network adapters (I'm not running in a VM - these are on the host side and are enabled but not being used). I modified the source ofthe utility above to support configuring which interface my broadcasts go out on via setsockopt(IP_MULTICAST_IF) and also which interface I listen to broadcasts on via setsockopt(IP_ADD_MEMBERSHIP). I then ran four instances of the utility - one sending on INADDR_ANY, one receiving on INADDR_ANY, and two more listening on each of the VMware virtual network adapters (VMnet1 and VMnet8, both preconfigured with their own fabricated/virtual subnets). When the INADDR_ANY receiver instance starts missing packets I see them show up on one of the VMware listeners. This is my proof that the Windows SSDP Discovery service is cycling the default adapter set for multicast transmission. I don't see this occur when the SSDP service is disabled. I assume the discovery service is doing this to catch SSDP messages on all network interfaces, although it's not clear why it has to change the system default multicast to accomplish this rather than just having multiple sockets, one on each interface in the system.
The workaround is to explicitly set which interface I configure for multicasting transmissions and listens, rather than relying on INADDR_ANY, which is the traditional way to handle multicasts that works fine on every other OS platform that is single-homed. Note that you have to explicitly set not just the transmitting interface but also the receiving side as well, because the discovery service's cycling of the default interface applies to both the default transmit interface and also which interface incoming packets are accepted for the IP multicast membership group.

Why logstash-forwarder is not sending more than 100 events per lumberjack request?

I have seen that it is the flush_size that controls the events send per request in lumberjack(logstash-forwarder) but I have set it to 150 default as shown below
config :flush_size, :validate => :number, :default => 150
FILE: /opt/logstash/lib/logstash/outputs/elasticsearch_http.rb
but still I am not seeing lumberjack sending more than 100 events per request.
Jan 23 16:59:01 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:01.496540 Connecting to [127.0.0.1]:5000 (127.0.0.1)
Jan 23 16:59:01 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:01.828968 Connected to 127.0.0.1
Jan 23 16:59:08 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:08.146238 Registrar received 100 events
Jan 23 16:59:13 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:13.500840 Registrar received 100 events
Jan 23 16:59:16 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:16.938172 Registrar received 100 events
Jan 23 16:59:18 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:18.330341 Registrar received 100 events
Jan 23 16:59:19 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:19.347694 Registrar received 100 events
Jan 23 16:59:20 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:20.341879 Registrar received 100 events
Jan 23 16:59:21 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:21.339127 Registrar received 100 events
Jan 23 16:59:23 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:23.060140 Registrar received 100 events
Jan 23 16:59:24 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:24.680771 Registrar received 100 events
Jan 23 16:59:26 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:26.196146 Registrar received 100 events
Jan 23 16:59:27 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:27.043658 Registrar received 100 events
Jan 23 16:59:28 197nodnb13846 logstash-forwarder[30342]: 2015/01/23 16:59:28.203279 Registrar received 100 events
I have restarted logstash and logstash-forwarder after this but it is still not working.
Edit the init script and change the value of -spool-size:
grep DAEMON_ARGS /etc/init.d/logstash-forwarder
DAEMON_ARGS="-config /etc/logstash-forwarder -spool-size 100 -log-to-syslog"
/etc/init.d/logstash-forwarder restart

How to debug this condition of "eth2: tx hang 1 detected on queue 11, resetting adapter"?

I want to send sk_buff by "dev_queue_xmit", when I just send 2 packets, the network card may be hang.
I want to know how to debug this condition.
the /var/log/messages is:
[root#10g-host2 test]# tail -f /var/log/messages
Sep 29 10:38:22 10g-host2 acpid: waiting for events: event logging is off
Sep 29 10:38:23 10g-host2 acpid: client connected from 2018[68:68]
Sep 29 10:38:23 10g-host2 acpid: 1 client rule loaded
Sep 29 10:38:24 10g-host2 automount[2210]: lookup_read_master: lookup(nisplus): couldn't locate nis+ table auto.master
Sep 29 10:38:24 10g-host2 mcelog: failed to prefill DIMM database from DMI data
Sep 29 10:38:24 10g-host2 xinetd[2246]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in.
Sep 29 10:38:24 10g-host2 xinetd[2246]: Started working: 0 available services
Sep 29 10:38:25 10g-host2 abrtd: Init complete, entering main loop
Sep 29 10:39:41 10g-host2 kernel: vmalloc mmap_buf=ffffc90016e29000 mmap_size=4096
Sep 29 10:39:41 10g-host2 kernel: insmod module wsmmap successfully!
Sep 29 10:39:49 10g-host2 kernel: mmap_buf + 1024 is ffffc90016e29400
Sep 29 10:39:49 10g-host2 kernel: data ffffc90016e2942a, len is 42
Sep 29 10:39:49 10g-host2 kernel: udp data ffffc90016e29422
Sep 29 10:39:49 10g-host2 kernel: ip data ffffc90016e2940e
Sep 29 10:39:49 10g-host2 kernel: eth data ffffc90016e29400
Sep 29 10:39:49 10g-host2 kernel: h_source is ffffc90016e29406, dev_addr is ffff880c235c4750, len is 6result is 0
Sep 29 10:39:50 10g-host2 kernel: mmap_buf + 1024 is ffffc90016e29400
Sep 29 10:39:50 10g-host2 kernel: data ffffc90016e2942a, len is 42
Sep 29 10:39:50 10g-host2 kernel: udp data ffffc90016e29422
Sep 29 10:39:50 10g-host2 kernel: ip data ffffc90016e2940e
Sep 29 10:39:50 10g-host2 kernel: eth data ffffc90016e29400
Sep 29 10:39:50 10g-host2 kernel: h_source is ffffc90016e29406, dev_addr is ffff880c235c4750, len is 6result is 0
Sep 29 10:39:52 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: Detected Tx Unit Hang
Sep 29 10:39:52 10g-host2 kernel: Tx Queue <11>
Sep 29 10:39:52 10g-host2 kernel: TDH, TDT <0>, <5>
Sep 29 10:39:52 10g-host2 kernel: next_to_use <5>
Sep 29 10:39:52 10g-host2 kernel: next_to_clean <0>
Sep 29 10:39:52 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: tx_buffer_info[next_to_clean]
Sep 29 10:39:52 10g-host2 kernel: time_stamp <fffd3dd8>
Sep 29 10:39:52 10g-host2 kernel: jiffies <fffd497f>
Sep 29 10:39:52 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: tx hang 1 detected on queue 11, resetting adapter
Sep 29 10:39:52 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: Reset adapter
Sep 29 10:39:52 10g-host2 kernel: ixgbe 0000:03:00.0: master disable timed out
Sep 29 10:39:53 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: detected SFP+: 5
Sep 29 10:39:54 10g-host2 kernel: ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
some information of my computer is:
ethtool -i eth2
driver: ixgbe
version: 3.21.2
firmware-version: 0x1bab0001
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.5 (Final)
Release: 6.5
Codename: Final
kernel version is: 2.6.32-431.el6.x86_64
Thank you for your help.
I use vmalloc() which alloc the memory for skb->data, so this let NIC down. I fix it by use kmalloc().

Resources