I have a limited understanding of the Nuttx OS but have run into a limitation set by the config parameter CONFIG_NFILE_DESCRIPTORS using the PX4 stack. I'm using a Pixhawk 4 FCU board that has a STM32F76 processor. The firmware build (px4_fmu-v5) by default has that parameter set to 20. My understanding is that this is a soft limit that is applied to each module in the stack to limit its I/O. I can increase that limit without any visible issues so far but this raises a few concerns:
To what extent can I increase the limit of that parameter without causing any issues?
What are the potential consequences of exceeding that limit?
Is there a way to find the hard limit of the number of file descriptors (assuming this is specific to the processor type)? If not, can I monitor the usage of file descriptors per module over an NSH shell?
If this is over-simplifying the issue I'd appreciate any pointers in the right direction, but I preferably would not like to delve too deep into NuttX to understand how this generally works and what the limitations are here.
Related
I'm looking to understanding the relationship of
container_memory_working_set_bytes vs process_resident_memory_bytes vs total_rss (container_memory_rss) + file_mapped so as to better equipped system for alerting on OOM possibility.
It seems against my understanding (which is puzzling me right now) given if a container/pod is running a single process executing a compiled program written in Go.
Why is the difference between container_memory_working_set_bytes is so big(nearly 10 times more) with respect to process_resident_memory_bytes
Also the relationship between container_memory_working_set_bytes and container_memory_rss + file_mapped is weird here, something I did not expect, after reading here
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
So cgroup total resident set size is rss + file_mapped how does this value is less than container_working_set_bytes for a container that is running in the given cgroup
Which make me feels something with this stats that I'm not correct.
Following are the PROMQL used to build the above graph
process_resident_memory_bytes{container="sftp-downloader"}
container_memory_working_set_bytes{container="sftp-downloader"}
go_memstats_heap_alloc_bytes{container="sftp-downloader"}
container_memory_mapped_file{container="sftp-downloader"} + container_memory_rss{container="sftp-downloader"}
So the relationship seems is like this
container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file
container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes
Relationship between container_memory_rss and container_working_sets can be summed up using following expression
container_memory_usage_bytes = container_memory_cache + container_memory_rss
cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)
This explains why the container_working_set was higher.
Ref #1
Ref #2
Not really an answer, but still two assorted points.
Does this help to make sense of the chart?
Here at my $dayjob, we had faced various different issues with how different tools external to the Go runtime count and display memory usage of a process executing a program written in Go.
Coupled with the fact Go's GC on Linux does not actually release freed memory pages to the kernel but merely madvise(2)s it that such pages are MADV_FREE, a GC cycle which had freed quite a hefty amount of memory does not result in any noticeable change of the readings of the "process' RSS" taken by the external tooling (usually cgroups stats).
Hence we're exporting our own metrics obtained by periodically calling runtime.ReadMemStats (and runtime/debug.ReadGCStats) in any major serivice written in Go — with the help of a simple package written specifically for that. These readings reflect the true idea of the Go runtime about the memory under its control.
By the way, the NextGC field of the memory stats is super useful to watch if you have memory limits set for your containers because once that reading reaches or surpasses your memory limit, the process in the container is surely doomed to be eventually shot down by the oom_killer.
I have written a CUDA kernel in which each thread makes an update to a particular memory address (with int size). Some threads might want to update this address simultaneously.
How does CUDA handle this? Does the operation become atomic? Does this increase the latency of my application in any way? If so, how?
The operation does not become atomic, and it is essentially undefined behavior. When two or more threads write to the same location, one of the values will end up in the location, but there is no way to predict which one.
It can be especially problematic if you are reading and writing, such as to increment a variable.
CUDA provides a set of atomic operations to help.
You may also use other coding techniques such as parallel reductions, to help when there are multiple updates to the same location, such as finding a max or min value.
If you don't care about the order of updates, it should not be a performance issue for newer GPUs which automatically condense writes or reads to a single location in global memory or shared memory, but this is also not specified behavior.
I'm finding a way to limit the memory usage in Go language. My application implementing with Go language has a big data that must be loaded in main memory, so I want to limit the maximum memory size of the process to the size specified by the user.
In C language, actually, I accumulate the sizes of malloc'ed memory to do that, but I don't know how to do same thing in Go language.
Please let me know if there is a way to do it.
Thank you.
The Go garbage collector is not deterministic and it is conservative. Therefore, using the runtime.MemStats variable is not going to be accurate for your purpose.
Fix your approximate memory usage by setting the maximum size of data that you are going to allow to be loaded at one time into a process using the input from the user.
Perhaps you want to use ulimit in conjunction with your go code?
You can do this via runtime/debug.SetMemoryLimit
See here for the original proposal.
Take a look here for the GitHub issue.
Besides runtime.MemStats you could use gosigar to monitor system memory.
Need to "calculate" optimum ulimit and fs.file-max values according to my own server needs.
Please do not conflict with "how to set those limits in various Linux distros" questions.
I am asking:
Is there any good guide to explain in detail, parameters used for ulimit? (> 2.6 series kernels)
Is there any good guide to show fs.file-max usage metrics?
Actually there are some old reference i could find on the net:
http://www.faqs.org/docs/securing/chap6sec72.html
"something reasonable like 256 for every 4M of RAM we have: i.e. for a machine with 128 MB of RAM, set it to 8192 - 128/4=32 32*256=8192"
Any up to date reference is appreciated.
For fs.file-max, I think in almost all cases you can just leave it alone. If you are running a very busy server of some kind and actually running out of file handles, then you can increase it -- but the value you need to increase it to will depend on exactly what kind of server you are running and what the load on it is. In general you would just need to increase it until you don't run out of file handles any more, or until you realize you need more memory or more systems to handle the load. The gain from "tuning" things by reducing file-max below the default is so minimal as to not be worth thinking about -- my phone works fine with an fs-max value of 83588.
By the way, the modern kernel already uses a rule of thumb to set file-max based on the amount of memory in the system; from fs/file_table.c in the 2.6 kernel:
/*
* One file with associated inode and dcache is very roughly 1K.
* Per default don't use more than 10% of our memory for files.
*/
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = max_t(unsigned long, n, NR_FILE);
and files_stat.max_files is the setting of fs.file-max; this ends up being about 100 for every 1MB of ram.
ulimits of course are about limiting resources allocated by users or processes. If you have multiple users or another situation like that, then you can decide how you want to divide up system resources and limit memory use, number of processes, etc. The definitive guide to the details of the limits you can set is the setrlimit man page (and the kernel source, of course).
Typically larger systems like Oracle or SAP recommend a very high limit in order to never be affected by it. I can only recommend to use this approach. The data structures will be allocated dynamically, so as long as you dont need them they dont use up memory. If you actually need them it will not help you to limit them, because if the limit is reached the application will normally crash.
fs.file-max = 6815744 # this is roughly the default limit for a 70GB RAM system
The same is true for the user rlimits (nofile), you will use 65535.
Note that both recommendations are only good for dedicated servers with on critical application and trusted shell users. A multi-user interactive shell host must have a restrictive max setting.
The purpose of the VirtualLock WinAPI call is to lock pages into the working set of a process. However, the WorkingSet64 API inexplicably doesn't count those pages.
Possibly as a result of this, neither Process Explorer nor the standard Task Manager count locked pages in their per-process memory usage statistics.
What's up with this? Could someone intimately familiar with virtual memory in WinNT shed some light on this inconsistency, which can cause gigabytes of used RAM to go essentially undetected? (think of SQL Server or VirtualBox)
Ah, that is easily explained: You're using the wrong API. GetProcessWorkingSetSize queries the minimum and maximum working set sizes. Those are quotas, not acutal values.
The minimum working set size is what Windows will guarantee to keep locked in RAM as long as the world does not end. The maximum working set size is the amount of memory that Windows will allow your process before pages are moved into the pool (they are not necessarily gone, but accessing them causes a fault and re-mapping).
You want GetProcessMemoryInfo
EDIT:
Since it is now clear that you were not using the wrong API (only named the wrong func), I've done some testing (VirtualAlloc and memory mapped files, both in combination with VirtualLock) on my XP system. At first sight, it looked like you are totally right. Allocating 512MB or memory mapping 512MB out of a 650MB file added 512MB to the virtual size but did not increase the working set. Following with a VirtualLock(512MB) did not affect the working set at all!
Then it occurred to me that VirtualLock took exactly zero time in every case, which did not seem plausible e.g. for having to fetch half a gigabyte from disk. So, I checked the return code and guess what. Windows doesn't think that locking 512MB is a good idea, and will refuse to do it.
Repeated the experiment with only 64MB, and behold, the working set immediately went up by 64MB, just as it should. So, in one word: "works for me".
Just to be sure, you did check the return code?
On a second look, this behaviour is even well-defined and well-documented. The docs to VirtualLock state explicitly:
The maximum number of pages that a
process can lock is equal to the
number of pages in its minimum working
set minus a small overhead.
With and without locking, after appropriately setting the WS quotas:
VirtualBox is a different matter, what you see in the task manager is only the working set of the "Interface" program and "Manager" frontend, both of which maintain working set sizes of below 64M at all times. Though I'm not sure what memory it maybe allocates in some drivers, or if they lock memory at all.
I'm currently running 2 virtual machines with 1.6GB main memory each. Seeing how my 32-bit Windows only sees 3.25GB, that would leave a mere 50MB for if the memory belonging to the VMs is locked. Besides, Process Explorer tells me that Firefox alone has a working set of 474MB and going up while I'm typing this (holy...?!!). That does not make it likely that all the memory in the virtual machines is really locked, because such figures would be entirely impossible then.
As requested, here's a shot of VMMap:
The figures are admittedly funny... the VM has 1.6M total of which according to VMMap 821MiB are reserved and 772MiB are committed, Process Explorer only shows 163MiB and 54MiB, respectively. Something is definitively fishy there, but I suspect this is probably some obscure VirtualBox hackery rather than a Windows issue.