CentOs7: HugePages_Rsvd equals 18446744073709551615 - memory-management

I have an application which uses a large amount of huge pages, for the purpose of DPDK. I allocate the pages at system start and then load/unload the application several times.
After some reloads, the program cannot allocate huge pages anymore. When I look at meminfo, I see:
HugePages_Total: 2656
HugePages_Free: 1504
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
This will remains this way, and will not let any application allocate huge pages, until I reboot the machine.
Any idea?

The decrement_hugepage_resv_vma function attempts -1 for resv_huge_pages,
but unsigned arithmetic causes it to instead ULONG_MAX (unsigned long resv_huge_pages), which is 18446744073709551615 on 64-bit systems.
alloc_huge_page()
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve)
decrement_hugepage_resv_vma() {
h->resv_huge_pages--;
}
static void return_unused_surplus_pages(struct hstate *h,
unsigned long unused_resv_pages)
{
h->resv_huge_pages -= unused_resv_pages;
}
Other less odds reason is gather_surplus_pages() function can overflow resv_huge_pages.
Try next during you test:
while [ 1 ] ; do date | awk '{print $4}' >> HugePages_Rsvd.log; cat /proc/meminfo | grep HugePages_Rsvd >> HugePages_Rsvd.log; sleep 1; done
I gues if resv_huge_pages will increment slowly so problem in h->resv_huge_pages += delta;
But if resv_huge_pages suddenly become -1 (unsigned long == 18446744073709551615 ) so problem in h->resv_huge_pages--; (resv_huge_pages was 0 and aftre decrement == -1)
Depend from you kernel you can check patch
mm: numa: disable change protection for vma(VM_HUGETLB)
6b79c57b92cdd90853002980609af516d14c4f9c
and BUG large value for HugePages_Rsvd

Related

How to replace read function for procfs entry that returned EOF and byte count read both?

I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.

Getting TSC rate from x86 kernel

I have an embedded Linux system running on an Atom, which is a new enough CPU to have an invariant TSC (time stamp counter), whose frequency the kernel measures on startup. I use the TSC in my own code to keep time (avoiding kernel calls), and my startup code measures the TSC rate, but I'd rather just use the kernel's measurement. Is there any way to retrieve this from the kernel? It's not in /proc/cpuinfo anywhere.
BPFtrace
As root, you can retrieve the kernel's TSC rate with bpftrace:
# bpftrace -e 'BEGIN { printf("%u\n", *kaddr("tsc_khz")); exit(); }' | tail -n
(tested it on CentOS 7 and Fedora 29)
That is the value that is defined, exported and maintained/calibrated in arch/x86/kernel/tsc.c.
GDB
Alternatively, also as root, you can also read it from /proc/kcore, e.g.:
# gdb /dev/null /proc/kcore -ex 'x/uw 0x'$(grep '\<tsc_khz\>' /proc/kallsyms \
| cut -d' ' -f1) -batch 2>/dev/null | tail -n 1 | cut -f2
(tested it on CentOS 7 and Fedora 29)
SystemTap
If the system doesn't have bpftrace nor gdb available but SystemTap you can get it like this (as root):
# cat tsc_khz.stp
#!/usr/bin/stap -g
function get_tsc_khz() %{ /* pure */
THIS->__retvalue = tsc_khz;
%}
probe oneshot {
printf("%u\n", get_tsc_khz());
}
# ./tsc_khz.stp
Of course, you can also write a small kernel module that provides access to tsc_khz via the /sys pseudo file system. Even better, somebody already did that and a tsc_freq_khz module is available on GitHub. With that the following should work:
# modprobe tsc_freq_khz
$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz
(tested on Fedora 29, reading the sysfs file doesn't require root)
Kernel Messages
In case nothing of the above is an option you can parse the TSC rate from the kernel logs. But this gets ugly fast because you see different kinds of messages on different hardware and kernels, e.g. on a Fedora 29 i7 system:
$ journalctl --boot | grep 'kernel: tsc:' -i | cut -d' ' -f5-
kernel: tsc: Detected 2800.000 MHz processor
kernel: tsc: Detected 2808.000 MHz TSC
But on a Fedora 29 Intel Atom just:
kernel: tsc: Detected 2200.000 MHz processor
While on a CentOS 7 i5 system:
kernel: tsc: Fast TSC calibration using PIT
kernel: tsc: Detected 1895.542 MHz processor
kernel: tsc: Refined TSC clocksource calibration: 1895.614 MHz
Perf Values
The Linux Kernel doesn't provide an API to read the TSC rate, yet. But it does provide one for getting the mult and shift values that can be used to convert TSC counts to nanoseconds. Those values are derived from tsc_khz - also in arch/x86/kernel/tsc.c - where tsc_khz is initialized and calibrated. And they are shared with userspace.
Example program that uses the perf API and accesses the shared page:
#include <asm/unistd.h>
#include <inttypes.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}
The actual code:
int main(int argc, char **argv)
{
struct perf_event_attr pe = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.config = PERF_COUNT_HW_INSTRUCTIONS,
.disabled = 1,
.exclude_kernel = 1,
.exclude_hv = 1
};
int fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
perror("perf_event_open failed");
return 1;
}
void *addr = mmap(NULL, 4*1024, PROT_READ, MAP_SHARED, fd, 0);
if (!addr) {
perror("mmap failed");
return 1;
}
struct perf_event_mmap_page *pc = addr;
if (pc->cap_user_time != 1) {
fprintf(stderr, "Perf system doesn't support user time\n");
return 1;
}
printf("%16s %5s\n", "mult", "shift");
printf("%16" PRIu32 " %5" PRIu16 "\n", pc->time_mult, pc->time_shift);
close(fd);
}
Tested in on Fedora 29 and it works also for non-root users.
Those values can be used to convert a TSC count to nanoseconds with a function like this one:
static uint64_t mul_u64_u32_shr(uint64_t cyc, uint32_t mult, uint32_t shift)
{
__uint128_t x = cyc;
x *= mult;
x >>= shift;
return x;
}
CPUID/MSR
Another way to obtain the TSC rate is to follow DPDK's lead.
DPDK on x86_64 basically uses the following strategy:
Read the 'Time Stamp Counter and Nominal Core Crystal Clock Information Leaf' via cpuid intrinsics (doesn't require special privileges), if available
Read it from the MSR (requires the rawio capability and read permissions on /dev/cpu/*/msr), if possible
Calibrate it in userspace by other means, otherwise
FWIW, a quick test shows that the cpuid leaf doesn't seem to be that widely available, e.g. an i7 Skylake and a goldmont atom don't have it. Otherwise, as can be seen from the DPDK code, using the MSR requires a bunch of intricate case distinctions.
However, in case the program already uses DPDK, getting the TSC rate, getting TSC values or converting TSC values is just a matter of using the right DPDK API.
I had a brief look and there doesn't seem to be a built-in way to directly get this information from the kernel.
However, the symbol tsc_khz (which I'm guessing is what you want) is exported by the kernel. You could write a small kernel module that exposes a sysfs interface and use that to read out the value of tsc_khz from userspace.
If writing a kernel module is not an option, it may be possible to use some Dark Magicâ„¢ to read out the value directly from the kernel memory space. Parse the kernel binary or System.map file to find the location of the tsc_khz symbol and read it from /dev/{k}mem. This is, of course, only possible provided that the kernel is configured with the appropriate options.
Lastly, from reading the kernel source comments, it looks like there's a possibility that the TSC may be unstable on some platforms. I don't know much about the inner workings of the x86 arch but this may be something you want to take into consideration.
The TSC rate is directly related to "cpu MHz" in /proc/cpuinfo. Actually, the better number to use is "bogomips". The reason is that while the freq for TSC is the max CPU freq, the current "cpu Mhz" can vary at time of your invocation.
The bogomips value is computed at boot. You'll need to adjust this value by number of cores and processor count (i.e. the number of hyperthreads) That gives you [fractional] MHz. That is what I use to do what you want to do.
To get the processor count, look for the last "processor: " line. The processor count is <value> + 1. Call it "cpu_count".
To get number of cores, any "cpu cores: " works. number of cores is <value>. Call it "core_count".
So, the formula is:
smt_count = cpu_count;
if (core_count)
smt_count /= core_count;
cpu_freq_in_khz = (bogomips * scale_factor) / smt_count;
That is extracted from my actual code, which is below.
Here's the actual code I use. You won't be able to use it directly because it relies on boilerplate I have, but it should give you some ideas, particularly with how to compute
// syslgx/tvtsc -- system time routines (RDTSC)
#include <tgb.h>
#include <zprt.h>
tgb_t systvinit_tgb[] = {
{ .tgb_val = 1, .tgb_tag = "cpu_mhz" },
{ .tgb_val = 2, .tgb_tag = "bogomips" },
{ .tgb_val = 3, .tgb_tag = "processor" },
{ .tgb_val = 4, .tgb_tag = "cpu_cores" },
{ .tgb_val = 5, .tgb_tag = "clflush_size" },
{ .tgb_val = 6, .tgb_tag = "cache_alignment" },
TGBEOT
};
// _systvinit -- get CPU speed
static void
_systvinit(void)
{
const char *file;
const char *dlm;
XFIL *xfsrc;
int matchflg;
char *cp;
char *cur;
char *rhs;
char lhs[1000];
tgb_pc tgb;
syskhz_t khzcpu;
syskhz_t khzbogo;
syskhz_t khzcur;
sysmpi_p mpi;
file = "/proc/cpuinfo";
xfsrc = fopen(file,"r");
if (xfsrc == NULL)
sysfault("systvinit: unable to open '%s' -- %s\n",file,xstrerror());
dlm = " \t";
khzcpu = 0;
khzbogo = 0;
mpi = &SYS->sys_cpucnt;
SYSZAPME(mpi);
// (1) look for "cpu MHz : 3192.515" (preferred)
// (2) look for "bogomips : 3192.51" (alternate)
// FIXME/CAE -- on machines with speed-step, bogomips may be preferred (or
// disable it)
while (1) {
cp = fgets(lhs,sizeof(lhs),xfsrc);
if (cp == NULL)
break;
// strip newline
cp = strchr(lhs,'\n');
if (cp != NULL)
*cp = 0;
// look for symbol value divider
cp = strchr(lhs,':');
if (cp == NULL)
continue;
// split symbol and value
*cp = 0;
rhs = cp + 1;
// strip trailing whitespace from symbol
for (cp -= 1; cp >= lhs; --cp) {
if (! XCTWHITE(*cp))
break;
*cp = 0;
}
// convert "foo bar" into "foo_bar"
for (cp = lhs; *cp != 0; ++cp) {
if (XCTWHITE(*cp))
*cp = '_';
}
// match on interesting data
matchflg = 0;
for (tgb = systvinit_tgb; TGBMORE(tgb); ++tgb) {
if (strcasecmp(lhs,tgb->tgb_tag) == 0) {
matchflg = tgb->tgb_val;
break;
}
}
if (! matchflg)
continue;
// look for the value
cp = strtok_r(rhs,dlm,&cur);
if (cp == NULL)
continue;
zprt(ZPXHOWSETUP,"_systvinit: GRAB/%d lhs='%s' cp='%s'\n",
matchflg,lhs,cp);
// process the value
// NOTE: because of Intel's speed step, take the highest cpu speed
switch (matchflg) {
case 1: // genuine CPU speed
khzcur = _systvinitkhz(cp);
if (khzcur > khzcpu)
khzcpu = khzcur;
break;
case 2: // the consolation prize
khzcur = _systvinitkhz(cp);
// we've seen some "wild" values
if (khzcur > 10000000)
break;
if (khzcur > khzbogo)
khzbogo = khzcur;
break;
case 3: // remember # of cpu's so we can adjust bogomips
mpi->mpi_cpucnt = atoi(cp);
mpi->mpi_cpucnt += 1;
break;
case 4: // remember # of cpu cores so we can adjust bogomips
mpi->mpi_corecnt = atoi(cp);
break;
case 5: // cache flush size
mpi->mpi_cshflush = atoi(cp);
break;
case 6: // cache alignment
mpi->mpi_cshalign = atoi(cp);
break;
}
}
fclose(xfsrc);
// we want to know the number of hyperthreads
mpi->mpi_smtcnt = mpi->mpi_cpucnt;
if (mpi->mpi_corecnt)
mpi->mpi_smtcnt /= mpi->mpi_corecnt;
zprt(ZPXHOWSETUP,"_systvinit: FINAL khzcpu=%d khzbogo=%d mpi_cpucnt=%d mpi_corecnt=%d mpi_smtcnt=%d mpi_cshalign=%d mpi_cshflush=%d\n",
khzcpu,khzbogo,mpi->mpi_cpucnt,mpi->mpi_corecnt,mpi->mpi_smtcnt,
mpi->mpi_cshalign,mpi->mpi_cshflush);
if ((mpi->mpi_cshalign == 0) || (mpi->mpi_cshflush == 0))
sysfault("_systvinit: cache parameter fault\n");
do {
// use the best reference
// FIXME/CAE -- with speed step, bogomips is better
#if 0
if (khzcpu != 0)
break;
#endif
khzcpu = khzbogo;
if (mpi->mpi_smtcnt)
khzcpu /= mpi->mpi_smtcnt;
if (khzcpu != 0)
break;
sysfault("_systvinit: unable to obtain cpu speed\n");
} while (0);
systvkhz(khzcpu);
zprt(ZPXHOWSETUP,"_systvinit: EXIT\n");
}
// _systvinitkhz -- decode value
// RETURNS: CPU freq in khz
static syskhz_t
_systvinitkhz(char *str)
{
char *src;
char *dst;
int rhscnt;
char bf[100];
syskhz_t khz;
zprt(ZPXHOWSETUP,"_systvinitkhz: ENTER str='%s'\n",str);
dst = bf;
src = str;
// get lhs of lhs.rhs
for (; *src != 0; ++src, ++dst) {
if (*src == '.')
break;
*dst = *src;
}
// skip over the dot
++src;
// get rhs of lhs.rhs and determine how many rhs digits we have
rhscnt = 0;
for (; *src != 0; ++src, ++dst, ++rhscnt)
*dst = *src;
*dst = 0;
khz = atol(bf);
zprt(ZPXHOWSETUP,"_systvinitkhz: PRESCALE bf='%s' khz=%d rhscnt=%d\n",
bf,khz,rhscnt);
// scale down (e.g. we got xxxx.yyyy)
for (; rhscnt > 3; --rhscnt)
khz /= 10;
// scale up (e.g. we got xxxx.yy--bogomips does this)
for (; rhscnt < 3; ++rhscnt)
khz *= 10;
zprt(ZPXHOWSETUP,"_systvinitkhz: EXIT khz=%d\n",khz);
return khz;
}
UPDATE:
Sigh. Yes.
I was using "cpu MHz" from /proc/cpuinfo prior to the introduction of processors with "speed step" technology, so I switched to "bogomips" and the algorithm was derived empirically based on that. When I derived it, I only had access to hyperthreaded machines. However, I've found an old one that is not and the SMT thing isn't valid.
However, it appears that bogomips is always 2x the [maximum] CPU speed. See http://www.clifton.nl/bogo-faq.html That hasn't always been my experience on all kernel versions over the years [IIRC, I started with 0.99.x], but it's probably a reliable assumption these days.
With "constant TSC" [which all newer processors have], denoted by constant_tsc in the flags: field in /proc/cpuinfo, the TSC rate is the maximum CPU frequency.
Originally, the only way to get the frequency information was from /proc/cpuinfo. Now, however, in more modern kernels, there is another way that may be easier and more definitive [I had code coverage for this in other software of mine, but had forgotten about it]:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
The contents of this file is the maximum CPU frequency in kHz. There are analogous files for the other CPU cores. The files should be identical for most sane motherboards (e.g. ones that are composed of the same model chip and don't try to mix [say] i7s and atoms). Otherwise, you'd have to keep track of the info on a per-core basis and that would get messy fast.
The given directory also has other interesting files. For example, if your processor has "speed step" [and some of the other files can tell you that], you can force maximum performance by writing performance to the scaling_governor file. This will disable use of speed step.
If the processor did not have constant_tsc, you'd have to disable speed step [and run the cores at maximum rate] to get accurate measurements

Why doesn't free execute munmap?

I have the following code:
unsigned char *p = (unsigned char *)valloc(page_size);
if (!p) {
ret = -1;
goto out;
}
printf("valloc: allocated %d bytes, virtual address: %p\n", page_size, p);
memset(p, 0xFF, page_size);
memcpy(p, s, sizeof(s));
trace_mem(p, sizeof(s));
printf("Memory: %p - press any key\n", p);
getchar();
if (ioctl(fd, MY_IOC_PATCH) == -1) {
fprintf(stderr, "ioctl %s error(%d): %s\n ", "MY_IOC_PATCH", errno, strerror(errno));
ret = -1;
goto out;
}
if (p) {
printf("free: freed %d bytes, virtual address: %p\n", page_size, p);
free(p);
}
.........................
Then I use strace to observe system calls: strace ./my_program I get the following:
fstat64(1, {st_mode=S_IFREG|0644, st_size=1533, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7730000
brk(0) = 0x9d81000
brk(0x9da4000) = 0x9da4000
fstat64(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb772f000
read(0, "\n", 1024) = 1
ioctl(3, RTC_IRQP_SET, 0x1000) = 0
read(0, "\n", 1024) = 1
ioctl(3, RTC_EPOCH_READ, 0x9d82000) = 0
read(0, "\n", 1024) = 1
close(3) = 0
valloc: allocated 4096 bytes, virtual address: 0x9d82000
After the first IOCTL I don't see munlock. I suppose that free must use munlock to unmap memory, but it doesn't cause. What is the reason for that?
I think that Paramagnetic Croissant's comment, above, qualifies as "the Answer" to this one. It is ordinary practice for malloc() implementations to ask the operating-system for more memory when they need it, but then to never give it back. For any operating-system.
You see, there's really no need to "give it back." Pestering the kernel, asking him to carve out more VM-space and to update the memory-management data structures, is a comparatively expensive operation. But, it doesn't really "cost" much to keep the storage around. (The cost of "releasing them" doesn't gain you anything, especially if you turn right around and have to ask for them again!) So, you just do it once.
If you stop using those pages, they'll eventually get swapped-out, and the physical resource (page frames) will automagically get used for other purposes. "No harm, no foul." But then, if you then suddenly start using that storage again, there's no reason to "pester the kernel" a second (or third) time. The pages just get swapped-in again, and off you go.
malloc/valloc(page size variant of malloc) actually gets the memory addresses from virtual address space. These addresses have mapping to physical address by way of page tables that are specific to a particular process.Thence in my opinion all kernel has to do in case of [vm]alloc is:
1) Attach an anonymous segment to the process.
2) Associate a bunch of virtual address (heap area) entries with physical pages, of course on first use.
In case of "free" it just needs to disassociate the virtual memory entries with the physical pages. Note that since these are anonymous pages it aint need to care where the "data" needs to go, while mmaping a file it may need to stage it back to the disk.
The physical pages are tracked and managed by the memory manager independently and is governed by cache principles (hot, cold color etc). Thus there is no question of free trying to give back memory to the kernel. Since all it got was a virtual address. It will give back the virtual address to the glibc library which should maintain virtual address chunks for use by the specific process.

Reliable way to determine file size on POSIX/OS X given a file descriptor

I wrote a function to watch a file (given an fd) growing to a certain size including a timeout. I'm using kqueue()/kevent() to wait for the file to be "extended" but after I get the notification that the file grew I have to check the file size (and compare it against the desired size). That seems to be easy but I cannot figure out a way to do that reliably in POSIX.
NB: The timeout will hit if the file doesn't grow at all for the time specified. So, this is not an absolute timeout, just a timeout that some growing happens to the file. I'm on OS X but this question is meant for "every POSIX that has kevent()/kqueue()", that should be OS X and the BSDs I think.
Here's my current version of my function:
/**
* Blocks until `fd` reaches `size`. Times out if `fd` isn't extended for `timeout`
* amount of time. Returns `-1` and sets `errno` to `EFBIG` should the file be bigger
* than wanted.
*/
int fwait_file_size(int fd,
off_t size,
const struct timespec *restrict timeout)
{
int ret = -1;
int kq = kqueue();
struct kevent changelist[1];
if (kq < 0) {
/* errno set by kqueue */
ret = -1;
goto out;
}
memset(changelist, 0, sizeof(changelist));
EV_SET(&changelist[0], fd, EVFILT_VNODE, EV_ADD | EV_ENABLE | EV_CLEAR, NOTE_DELETE | NOTE_RENAME | NOTE_EXTEND, 0, 0);
if (kevent(kq, changelist, 1, NULL, 0, NULL) < 0) {
/* errno set by kevent */
ret = -1;
goto out;
}
while (true) {
{
/* Step 1: Check the size */
int suc_sz = evaluate_fd_size(fd, size); /* IMPLEMENTATION OF THIS IS THE QUESTION */
if (suc_sz > 0) {
/* wanted size */
ret = 0;
goto out;
} else if (suc_sz < 0) {
/* errno and return code already set */
ret = -1;
goto out;
}
}
{
/* Step 2: Wait for growth */
int suc_kev = kevent(kq, NULL, 0, changelist, 1, timeout);
if (0 == suc_kev) {
/* That's a timeout */
errno = ETIMEDOUT;
ret = -1;
goto out;
} else if (suc_kev > 0) {
if (changelist[0].filter == EVFILT_VNODE) {
if (changelist[0].fflags & NOTE_RENAME || changelist[0].fflags & NOTE_DELETE) {
/* file was deleted, renamed, ... */
errno = ENOENT;
ret = -1;
goto out;
}
}
} else {
/* errno set by kevent */
ret = -1;
goto out;
}
}
}
out: {
int errno_save = errno;
if (kq >= 0) {
close(kq);
}
errno = errno_save;
return ret;
}
}
So the basic algorithm works the following way:
Set up the kevent
Check size
Wait for file growth
Steps 2 and 3 are repeated until the file reached the wanted size.
The code uses a function int evaluate_fd_size(int fd, off_t wanted_size) which will return < 0 for "some error happened or file larger than wanted", == 0 for "file not big enough yet", or > 0 for file has reached the wanted size.
Obviously this only works if evaluate_fd_size is reliable in determining file size. My first go was to implement it with off_t eof_pos = lseek(fd, 0, SEEK_END) and compare eof_pos against wanted_size. Unfortunately, lseek seems to cache the results. So even when kevent returned with NOTE_EXTEND, so the file grew, the result may be the same! Then I thought to switch to fstat but found articles that fstat caches as well.
The last thing I tried was using fsync(fd); before off_t eof_pos = lseek(fd, 0, SEEK_END); and suddenly things started working. But:
Nothing states that fsync() really solves my problem
I don't want to fsync() because of performance
EDIT: It's really hard to reproduce but I saw one case in which fsync() didn't help. It seems to take (very little) time until the file size is larger after a NOTE_EXTEND event hit user space. fsync() probably just works as a good enough sleep() and therefore it works most of the time :-.
So, in other words: How to reliably check file size in POSIX without opening/closing the file which I cannot do because I don't know the file name. Additionally, I can't find a guarantee that this would help
By the way: int new_fd = dup(fd); off_t eof_pos = lseek(new_fd, 0, SEEK_END); close(new_fd); did not overcome the caching issue.
EDIT 2: I also created an all in one demo program. If it prints Ok, success before exiting, everything went fine. But usually it prints Timeout (10000000) which manifests the race condition: The file size check for the last kevent triggered is smaller than the actual file size at this very moment. Weirdly when using ftruncate() to grow the file instead of write() it seems to work (you can compile the test program with -DUSE_FTRUNCATE to test that).
Nothing states that fsync() really solves my problem
I don't want to fsync() because of performance
Your problem isn't "fstat caching results", it's the I/O system buffering writes. Fstat doesn't get updated until the kernel flushes the I/O buffers to the underlying file system.
This is why fsync fixes your problem and any solution to your problem more or less has to do the equivalent of fsync. ( This is what the open/close solution does as a side effect. )
Can't help you with 2 because I don't see any way to avoid doing fsync.

Solaris libumem why not show memory leak for first dynamic allocation

Say
void main()
{
void *buff;
buff = malloc(128);
buff = malloc(60);
buff = malloc(30);
buff = malloc(16);
free(buff);
sleep(180);
}
ulib mem in solaris10 show only 60 bytes and 30 bytes as leak , why it not show 128 bytes also leaked?
Three memory leaks are detected by mdb and libumem with your code here:
cat > leak.c <<%
int main() { void *buff; buff = malloc(128); buff = malloc(60); buff = malloc(30); buff = malloc(16); free(buff); sleep(180); }
%
gcc -g leak.c -o leak
pkill leak
UMEM_DEBUG=default UMEM_LOGGING=transaction LD_PRELOAD=libumem.so.1 leak &
sleep 5
rm -f core.leak.*
gcore -o core.leak $(pgrep leak)
mdb leak core.leak.* <<%
::findleaks -d
%
gcore: core.leak.1815 dumped
CACHE LEAKED BUFCTL CALLER
0807f010 1 0808b4b8 main+0x29
0807d810 1 0808d088 main+0x39
0807d010 1 08092cd0 main+0x49
------------------------------------------------------------------------
Total 3 buffers, 280 bytes
umem_alloc_160 leak: 1 buffer, 160 bytes
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
808b4b8 8088f28 1af921ab662 1
807f010 806c000 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
main+0x29
_start+0x83
umem_alloc_80 leak: 1 buffer, 80 bytes
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
808d088 808cf68 1af921c11eb 1
807d810 806c064 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
main+0x39
_start+0x83
umem_alloc_40 leak: 1 buffer, 40 bytes
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
8092cd0 808efc8 1af921f2bd2 1
807d010 806c0c8 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
main+0x49
_start+0x83

Resources