DRBD Too many messages after /sbin/drbdadm verify kernel: block drbdX: Out of sync: - high-availability

In all the systems that I am working with DRBD after verification that there are many messages in the log.
kernel: block drbd0: Out of sync: start=403446112, size=328 (sectors)
In some system might think it is by the workload, but there are some teams that are not nearly work.
The computers are connected in a network with 1Gb quality
These messages do not give me much fiablidad the system and that ultimately require cron to check the timing, and reset the fault blocks, which converts a synchronous system of course, in an asynchronous system.
Is this normal?
Any solution?
Any wrong?
common {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
verify-alg sha1;
rate 40M;
}
}
resource r0 {
protocol C;
startup {
wfc-timeout 15; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
degr-wfc-timeout 60;
}
net {
cram-hmac-alg sha1;
shared-secret "XXXXXXXXXX";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on pro01 {
device /dev/drbd0;
disk /dev/pve/vm-100-disk-1;
address YYY.YYY.YYY.YYY:7788;
meta-disk internal;
}
on pro02 {
device /dev/drbd0;
disk /dev/pve/vm-100-disk-1;
address YYY.YYY.YYY.YYY:7788;
meta-disk internal;
}
}

It might happen from time to time and it's normal.
Just do disconnect and connect again - then out-of-sync blocks will be synced.
DRBD - online verify

There is a long story related to this (http://www.gossamer-threads.com/lists/drbd/users/25227) and still I'm not sure if this can be (or should be) fixed by DRBD developers or we need to fix upper layer behaviour (KVM in my case).

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?
The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

Replacing a failed drive in DRBD ()

How to correctly set the size of the disk, when replacing, if I want to use the original disk size?
The volume of the new disk is 4 Gb, but I want to use only the volume that was used before and is used on the disk of another node (2 Gb).
Resource:
resource res-vdb {
device drbd_res_vdb1 minor 1;
disk /dev/vdb;
meta-disk internal;
protocol C;
on node01 {
address 192.168.0.1:7005;
}
on node02 {
address 192.168.0.2:7005;
}
}
Do I understand correctly that I can take the size from lsblck or from /sys/block/drbd1/size and set in res config before drbdadm create-md and drbdadm attach?
i.e. config:
resource res-vdb {
device drbd_res_vdb1 minor 1;
disk /dev/vdb;
meta-disk internal;
protocol C;
disk {
size 2097052K; <==== 2GB
}
on node01 {
address 192.168.0.1:7005;
}
on node02 {
address 192.168.0.2:7005;
}
}
You're correct in that you can set the size in the DRBD res file before you create-md and attach in order to explicitly set the size of the DRBD device.
As you've also suggested, you can retrieve the exact size of the DRBD device in various ways, including using lsblk or inspecting the kernel settings with cat /sys/block/drbd1/size, run from the peer node.
However, when you use lsblk, it's going to do some rounding. DRBD's parser doesn't seem to accept bytes (B) as a valid unit (with drbd-utils version 9.13.1 seems to only like KB, MB, and GB), so you might be better off setting the size in sectors (s).
The size you find in /sys/block/drbd1/size is already in sectors, so an example would be:
# cat /sys/block/drbd1/size
27262072
# cat /etc/drbd.d/r1.res
resource res-vdb {
protocol C;
disk /dev/vdb;
device minor 1;
disk {
size 27262072s;
}
on centos7-a {
address 172.16.7.100:7779;
}
on centos7-b {
address 172.16.7.101:7779;
}
}
All that said, because DRBD auto-negotiates the device size among it's peers, you could simply, drbdadm create-md res-vdb, drbdadm up res-vdb, and it should just work.

How can I measure how long this Linux interrupt handler takes to run?

I am trying to debug a custom Linux serial driver that is having some issues missing some receive data. It has one interrupt for 4 serial ports, and baud rate is 115200. Firstly I would like to see how to measure how long the interrupt handler takes. I have used perf, but things are just in percent and not seconds. Secondly does anyone see any issues with the below code that can be improved to speed things up?
void serial_interrupt(int irq, void *dev_id)
{
...
// Need to loop through each port to see which port caused the interrupt.
list_for_each(lpNode, &serial_ports)
{
struct serial_port_module *ser_dev = list_entry(lpNode, struct serial_port_module, port_list);
lnIsr = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_ISR);
if (lnIsr & IPM512_RX_INT)
{
while (serialdata_is_data_available(ser_dev)) // equals a ioread8()
{
lcIn = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_RBR);
kfifo_in(&ser_dev->rx_fifo, &lcIn, sizeof(lcIn));
// Notify if anyone is doing a blocking read.
wake_up_interruptible(&ser_dev->read_queue);
}
}
}
}
Use the ftrace API to try to track down your latency issues. It's woth the time to get to know: https://www.kernel.org/doc/Documentation/trace/ftrace.txt
If this is too heavy-weight, what about adding some simple instrumentation yourself? getnstimeofday(struct timespec *ts) is relatively lightweight... with a little code you could output in a sysfs debug file the worst case execution times, some stats on latency of call to this function, worst-case number of bytes available per interrupt... if this number gets near your hardware FIFO size, you're in trouble.
One optimization would be to read the data in batches into a buffer, as long as data is available, then input the entire buffer, then wake up any readers.
while(data_available(dev))
{
buf[cnt++] = ioread8();
}
kfifo_in(fifo, buf, cnt);
wake_up_interruptible();
But execution time of code this simple is not likely to be an issue. You're probably suffering from missed interrupts or unexpected latency of the interrupt handling.

blk_cleanup_queue() doesn't return on block device deregistration

I'm writing a block device driver for a hot-pluggable PCI memory device on 2.6.43.2-6.fc15 (so LDD3 is out of date with respect to a lot of functions) and I'm having trouble getting the block device de-registration to go smoothly. When the device is removed, I go to tear down the gendisk and request_queue, but it hangs on blk_cleanup_queue(). Presumably there's some queue-related process I have neglected to carry out before that, but I can't see any major consistent differences with other block drivers from that kernel tree that i am using for reference (memstick, cciss, etc). What are the steps I should carry out before going to tidy up the queue and gendisk?
I am implementing .open, .release, .ioctl in the block_ops as well as a mydev_request(struct request_queue *q) attached with blk_init_queue(mydev_request, &mydev->lock), but I'm not sure exactly how to tidy the queue either when requests occur or when de-registering the block device.
This is caused by not ending the requests that you fetch off the queue. To fix it, end the request as follows:
while ((req = blk_fetch_request(q)) != NULL )
{
res = mydev_submit_request_sg(mydev, req);
if (res)
__blk_end_request_all(req, res);
else
__blk_end_request_cur (req, res);
}

What could cause "The MDL is being inserted twice on the same process list"?

We are developing an NDIS protocol and miniport driver. When the driver is in-use and the system hibernates we get a bug check (blue screen) with the following error:
LOCKED_PAGES_TRACKER_CORRUPTION (d9)
Arguments:
Arg1: 00000001, The MDL is being inserted twice on the same process list.
Arg2: 875da420, Address of internal lock tracking structure.
Arg3: 87785728, Address of memory descriptor list.
Arg4: 00000013, Number of pages locked for the current process.
The stack trace is not especially helpful as our driver does not appear in the listing:
nt!RtlpBreakWithStatusInstruction
nt!KiBugCheckDebugBreak+0x19
nt!KeBugCheck2+0x574
nt!KeBugCheckEx+0x1b
nt!MiAddMdlTracker+0xd8
nt!MmProbeAndLockPages+0x629
nt!NtWriteFile+0x55c
nt!KiFastCallEntry+0xfc
ntdll!KiFastSystemCallRet
ntdll!ZwWriteFile+0xc
kernel32!WriteFile+0xa9
What types of issues could cause this MDL error?
It turns out the problem was related to this code in our IRP_MJ_WRITE handler:
/* If not in D0 state, don't attempt transmits */
if (ndisProtocolOpenContext &&
ndisProtocolOpenContext->powerState > NetDeviceStateD0)
{
DEBUG_PRINT(("NPD: system in sleep mode, so no TX\n"));
return STATUS_UNSUCCESSFUL;
}
This meant that we weren't fully completing the IRP and NDIS was likely doing something funny as a result. The addition of a call to IoCompleteRequest fixed the issue.
/* If not in D0 state, don't attempt transmits */
if (ndisProtocolOpenContext &&
ndisProtocolOpenContext->powerState > NetDeviceStateD0)
{
DEBUG_PRINT(("NPD: system in sleep mode, so no TX\n"));
pIrp->IoStatus.Status = STATUS_UNSUCCESSFUL;
IoCompleteRequest(pIrp, IO_NO_INCREMENT);
return STATUS_UNSUCCESSFUL;
}

Resources