Replacing a failed drive in DRBD () - disk

How to correctly set the size of the disk, when replacing, if I want to use the original disk size?
The volume of the new disk is 4 Gb, but I want to use only the volume that was used before and is used on the disk of another node (2 Gb).
Resource:
resource res-vdb {
device drbd_res_vdb1 minor 1;
disk /dev/vdb;
meta-disk internal;
protocol C;
on node01 {
address 192.168.0.1:7005;
}
on node02 {
address 192.168.0.2:7005;
}
}
Do I understand correctly that I can take the size from lsblck or from /sys/block/drbd1/size and set in res config before drbdadm create-md and drbdadm attach?
i.e. config:
resource res-vdb {
device drbd_res_vdb1 minor 1;
disk /dev/vdb;
meta-disk internal;
protocol C;
disk {
size 2097052K; <==== 2GB
}
on node01 {
address 192.168.0.1:7005;
}
on node02 {
address 192.168.0.2:7005;
}
}

You're correct in that you can set the size in the DRBD res file before you create-md and attach in order to explicitly set the size of the DRBD device.
As you've also suggested, you can retrieve the exact size of the DRBD device in various ways, including using lsblk or inspecting the kernel settings with cat /sys/block/drbd1/size, run from the peer node.
However, when you use lsblk, it's going to do some rounding. DRBD's parser doesn't seem to accept bytes (B) as a valid unit (with drbd-utils version 9.13.1 seems to only like KB, MB, and GB), so you might be better off setting the size in sectors (s).
The size you find in /sys/block/drbd1/size is already in sectors, so an example would be:
# cat /sys/block/drbd1/size
27262072
# cat /etc/drbd.d/r1.res
resource res-vdb {
protocol C;
disk /dev/vdb;
device minor 1;
disk {
size 27262072s;
}
on centos7-a {
address 172.16.7.100:7779;
}
on centos7-b {
address 172.16.7.101:7779;
}
}
All that said, because DRBD auto-negotiates the device size among it's peers, you could simply, drbdadm create-md res-vdb, drbdadm up res-vdb, and it should just work.

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?
The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

What does virtual core in YARN vcore mean?

Yarn is using the concept of virtual core to manage CPU resources. I would ask what's the benefit to use virtual core, is there some reason here that YARN uses vcore?
Here is what the documentation states (emphasis mine)
A node's capacity should be configured with virtual cores equal to its
number of physical cores. A container should be requested with the
number of cores it can saturate, i.e. the average number of threads it
expects to have runnable at a time.
Unless the CPU core is hyper-threaded it can run only one thread at a time (in case of hyper threaded OS actually sees 2 cores for one physical core and can run two threads - of course it's a bit of cheating and no-where as efficient as having actual physical core). Essentially what it means to end user is that a core can run a single thread so theoretically if I want parallelism using java threads then a reasonably good approximation is number of threads equal to number of core. So if your container process ( which is a JVM)
will require 2 threads then it's better to map it to 2 vcore - that what the last line means. And as total capacity of node the vcore should be equal to number of physical cores.
The most important thing to remember is still that it's actually the OS which will schedule the threads to be executed in different cores as it happens in any other application and
YARN in itself does not have control on it except the fact that what is the best possible approximation for how many thread to allocate for each container. And that's why it is important to take into consideration other applications running on OS, CPU cycles used by kernel etc., as all of cores will not be available to YARN application all the time.
EDIT: Further research
Yarn does not influence hard limits on CPU but Going through the code I can see how it tries to influence the CPU scheduling or cpu rate. Technically Yarn can launch different container processes - java, python , custom shell command etc. The responsibility of launching containers in Yarn belongs to the ContainerExecutor component of Node manager and I can see code for launching the container etc., along with some hints (depending on platform). For example in case of DefaultContainerExecutor ( which extends ContainerExecutor) - for windows it uses "-c" parameter for cpu restriction and on linux it uses process niceness to influence it. There is another implementation LinuxContainerExecutor (or better still CgroupsLCEResourcesHandler as former does not force the usage of cgroups) which tries to use Linux cgroups to limit the Yarn CPU resources on that node. More details can be found here.
ContainerExecutor {
.......
.......
protected String[] getRunCommand(String command, String groupId,
String userName, Path pidFile, Configuration conf, Resource resource) {
boolean containerSchedPriorityIsSet = false;
int containerSchedPriorityAdjustment =
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY;
if (conf.get(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY) !=
null) {
containerSchedPriorityIsSet = true;
containerSchedPriorityAdjustment = conf
.getInt(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY,
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY);
}
if (Shell.WINDOWS) {
int cpuRate = -1;
int memory = -1;
if (resource != null) {
if (conf
.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED)) {
memory = resource.getMemory();
}
if (conf.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED)) {
int containerVCores = resource.getVirtualCores();
int nodeVCores = conf.getInt(YarnConfiguration.NM_VCORES,
YarnConfiguration.DEFAULT_NM_VCORES);
// cap overall usage to the number of cores allocated to YARN
int nodeCpuPercentage = Math
.min(
conf.getInt(
YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT,
YarnConfiguration.DEFAULT_NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT),
100);
nodeCpuPercentage = Math.max(0, nodeCpuPercentage);
if (nodeCpuPercentage == 0) {
String message = "Illegal value for "
+ YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT
+ ". Value cannot be less than or equal to 0.";
throw new IllegalArgumentException(message);
}
float yarnVCores = (nodeCpuPercentage * nodeVCores) / 100.0f;
// CPU should be set to a percentage * 100, e.g. 20% cpu rate limit
// should be set as 20 * 100. The following setting is equal to:
// 100 * (100 * (vcores / Total # of cores allocated to YARN))
cpuRate = Math.min(10000,
(int) ((containerVCores * 10000) / yarnVCores));
}
}
return new String[] { Shell.WINUTILS, "task", "create", "-m",
String.valueOf(memory), "-c", String.valueOf(cpuRate), groupId,
"cmd /c " + command };
} else {
List<String> retCommand = new ArrayList<String>();
if (containerSchedPriorityIsSet) {
retCommand.addAll(Arrays.asList("nice", "-n",
Integer.toString(containerSchedPriorityAdjustment)));
}
retCommand.addAll(Arrays.asList("bash", command));
return retCommand.toArray(new String[retCommand.size()]);
}
}
}
For windows (it utilizes winutils.exe) , it uses cpu rate
For Linux it uses niceness as a parameter to control the CPU priority
"Virtual cores" are merely an abstraction of actual cores. This abstraction or "lie" (as i like to call it), allows YARN (and others) to dynamically spin threads (parallel process) based on availability. Take for example running map reduce on an "elastic" cluster with a processing limit constrained only by your wallet... The cloud baby... The. Cloud.
you can read more here

Accessing 0xCxxxxxxx guest kernel pointers within qemu-system-mips

In my QEMU-based project (system emulation) I analyse various kernel structures of the guest Linux. To read the guest virtual memory I use cpu_memory_rw_debug() function.
In particular, I search struct module linked list in the kernel memory using some kind of heuristics.
Lest assume that the relevant part of an element in this list looks like this:
--------------------- ---------------------
| prev = 0xc1231234 | | prev = 0xc5675678 |
--------------------- ---------------------
| next = 0xc1122334 | | next = 0xc5566778 |
--------------------- ---------------------
| etc. | | etc. |
--------------------- ---------------------
When QEMU emulates x86 or ARM, prev/next pointers can be accessed by cpu_memory_rw_debug() and they actually point to previous/next list elements.
However, when QEMU emulates MIPS, I observe the following strange behavior: while prev/next pointers look like a valid kernel pointers in every element in the list, I cannot access their pointees by means of cpu_memory_rw_debug(), because finding the corresponding physical address fails: the access permissions are ok, the virtual CPU is in kernel mode, but tlb->map_address() fails.
Since I can't walk through the linked list, I tried to find the elements one by one - just to see what their prev/next pointers look like - and I actually found all the elements, but all of them reside at 0xAxxxxxxx addresses, not 0xCxxxxxxx, as prev/next imply.
The function r4k_map_address(), which performs physical address lookup looks like this (only the relevant excerpt):
#define KSEG0_BASE 0x80000000UL
#define KSEG1_BASE 0xA0000000UL
#define KSEG2_BASE 0xC0000000UL
#define KSEG3_BASE 0xE0000000UL
//..............
if (address < (int32_t)KSEG1_BASE) {
/* kseg0 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG0_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG2_BASE) {
/* kseg1 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG1_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG3_BASE) {
/* sseg (kseg2) */
if (supervisor_mode || kernel_mode) {
ret = env->tlb->map_address(env, physical, prot, real_address, rw, access_type);
} else {
ret = TLBRET_BADADDR;
}
That is, on MIPS 0xC0000000...0xE0000000 range is mapped differently from lower kernel ranges.
If I replace the TLB access with *physical = address - (int32_t)KSEG1_BASE direct mapping, I get the things working, but certainly that's not the solution.
Does it look like QEMU-related issue or a MIPS-related one? I'd appreciate any idea or debugging direction.
The bottom line is that cpu_memory_rw_debug() doesn't work reliably in qemu-system-mips.
The reason is that QEMU emulates MIPS software-managed TLB. With this approach, whenever virtual->physical address mapping does not exist in the TLB cache, QEMU emulates "TLB-miss" exception, which should be handled by the OS. It is OS responsibility to walk through the page directory and fill the TLB -- QEMU (just like real MIPS) won't do that.
While this approach works for the guest code, it results in inability
to read guest virtual memory using cpu_memory_rw_debug() - it
doesn't work reliably for mapped segments.
As for the question why kernel structs that actually reside in KSEG2 where observed in KSEG1 - that's just because some virtual ranges of KSEG1 and KSEG2 correspond to the same physical pages.

Map USB disk BSD Name to actual mounted drive(s) in OSX

I am trying to get from the USB device BSD Name to the actual mounted volume(s) for that device e.g. device has BSD name "disk2" and mounts a single volume with BSD name "disk2s1" at "/Volumes/USBSTICK".
Here is what I have been doing so far. Using
NSNotificationCenter NSWorkspaceDidMountNotification
I detect when a drive has been added. I then scan through all the USB devices and use
IORegistryEntrySearchCFProperty kIOBSDNameKey
to get the BSD name of the device.
For my USB stick this returns "disk2". Running
system_profiler SPUSBDataTypesystem_profiler SPUSBDataType
shows
Product ID: 0x5607
Vendor ID: 0x03f0 (Hewlett Packard)
Serial Number: AA04012700008687
Speed: Up to 480 Mb/sec
Manufacturer: HP
Location ID: 0x14200000 / 25
Current Available (mA): 500
Current Required (mA): 500
Capacity: 16.04 GB (16,039,018,496 bytes)
Removable Media: Yes
Detachable Drive: Yes
BSD Name: disk2
Partition Map Type: MBR (Master Boot Record)
S.M.A.R.T. status: Not Supported
Volumes:
USBSTICK:
Capacity: 16.04 GB (16,037,879,808 bytes)
Available: 5.22 GB (5,224,095,744 bytes)
Writable: Yes
File System: MS-DOS FAT32
BSD Name: disk2s1
Mount Point: /Volumes/USBSTICK
Content: Windows_FAT_32
which makes sense since there could be multiple volumes for a single USB device.
I assumed I could use DiskArbitration to find the actual volumes, but
DASessionRef session = DASessionCreate(NULL);
if (session)
{
DADiskRef disk = DADiskCreateFromBSDName(NULL,session,"disk2");
if (disk)
{
CFDictionaryRef dict = DADiskCopyDescription(disk);
if (dict)
always returns a NULL dictionary.
So, how do I get from the BSD name for a USB device to the actual mounted volume(s) for that device? I guess it should be possible to get iterate over all the volumes, get their BSD name and check if it starts with the string e.g. /Volumes/USBSTICK above is "disk2s1", but that's hacky and what if there is a disk20 etc.
Found a solution using IOBSDNameMatching will create a dictionary to match the service with a given BSD name. Then the children of that service can be searched for their BSD names.
NOTE: This is my first time doing anything on OSX. Also, the 'dict' in the above code was NULL because of bug, but that dictionary is of no use for this anyway.
Here's some cut down code with no error checking etc.
CFMutableDictionaryRef matchingDict;
matchingDict = IOBSDNameMatching(kIOMasterPortDefault, 0, "disk2");
io_iterator_t itr;
// Might only ever be one service so, MatchingService could be used. No sure though
IOServiceGetMatchingServices(kIOMasterPortDefault, matchingDict, &itr);
io_object_t service;
while ((service = IOIteratorNext(itr)))
{
io_iterator_t children;
io_registry_entry_t child;
// Obtain the service's children.
IORegistryEntryGetChildIterator(service, kIOServicePlane, &children);
while ((child = IOIteratorNext(children)))
{
CFTypeRef name = IORegistryEntrySearchCFProperty(child,
kIOServicePlane,
CFSTR(kIOBSDNameKey),
kCFAllocatorDefault,
kIORegistryIterateRecursively);
if (name)
{
// Got child BSD Name e.g. "disk2s1"
}
}
}

DRBD Too many messages after /sbin/drbdadm verify kernel: block drbdX: Out of sync:

In all the systems that I am working with DRBD after verification that there are many messages in the log.
kernel: block drbd0: Out of sync: start=403446112, size=328 (sectors)
In some system might think it is by the workload, but there are some teams that are not nearly work.
The computers are connected in a network with 1Gb quality
These messages do not give me much fiablidad the system and that ultimately require cron to check the timing, and reset the fault blocks, which converts a synchronous system of course, in an asynchronous system.
Is this normal?
Any solution?
Any wrong?
common {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
verify-alg sha1;
rate 40M;
}
}
resource r0 {
protocol C;
startup {
wfc-timeout 15; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
degr-wfc-timeout 60;
}
net {
cram-hmac-alg sha1;
shared-secret "XXXXXXXXXX";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on pro01 {
device /dev/drbd0;
disk /dev/pve/vm-100-disk-1;
address YYY.YYY.YYY.YYY:7788;
meta-disk internal;
}
on pro02 {
device /dev/drbd0;
disk /dev/pve/vm-100-disk-1;
address YYY.YYY.YYY.YYY:7788;
meta-disk internal;
}
}
It might happen from time to time and it's normal.
Just do disconnect and connect again - then out-of-sync blocks will be synced.
DRBD - online verify
There is a long story related to this (http://www.gossamer-threads.com/lists/drbd/users/25227) and still I'm not sure if this can be (or should be) fixed by DRBD developers or we need to fix upper layer behaviour (KVM in my case).

Resources