On a 54-core machine, I use os.Exec() to spawn hundreds of client processes, and manage them with an abundance of goroutines.
Sometimes, but not always, I get this:
runtime: failed to create new OS thread (have 1306 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
My ulimit is pretty high already:
$ ulimit -u
1828079
There's never a problem if I limit myself to, say, 54 clients.
Is there a way I can handle this situation more gracefully? E.g. not bomb out with a fatal error, and just do less/delayed work instead? Or query the system ahead of time and anticipate the maximum amount of stuff I can do (I don't just want to limit to the number of cores though)?
Given my large ulimit, should this error even be happening? grep -c goroutine on the stack output following the fatal error only gives 6087. Each client process (of which there are certainly less than 2000) might have a few goroutines of their own, but nothing crazy.
Edit: the problem only occurs on high-core machines (~60). Keeping everything else constant and just changing the number of cores down to 30 (this being an OpenStack environment, so the same underlying hardware still being used), these runtime errors don't occur.
Related
Dear Stackoverflowians,
I’m having a problem with a Spring Cloud Stream app using a Kafka Streams binder. It’s only in our own Pivotal Cloud Foundry (CF) environment where this issue occurs. I have kind of hit a wall at this point and so I turn to you and your wisdom!
When the application starts up I see the following error
<snip>
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT current active tasks: [0_3, 1_3, 2_3, 3_3, 4_3, 0_7, 1_7, 5_3, 2_7, 3_7, 4_7, 0_11, 1_11, 5_7, 2_11, 3_11, 4_11, 0_15, 1_15, 5_11, 2_15, 3_15, 4_15, 0_19, 1_19, 5_15, 2_19, 3_19, 4_19, 0_23, 1_23, 5_19, 2_23, 3_23, 4_23, 5_23]
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT current standby tasks: []
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT previous active tasks: []
2019-08-07T15:18:02.67-0700 [API/0] OUT Updated app with guid 2db4a719-53ee-4d4a-9573-fe958fae1b4f ({"state"=>"STOPPED"})
2019-08-07T15:18:02.64-0700 [APP/PROC/WEB/0]ERR terminate called after throwing an instance of 'std::system_error'
2019-08-07T15:18:02.64-0700 [APP/PROC/WEB/0]ERR what(): Resource temporarily unavailable
2019-08-07T15:18:02.67-0700 [CELL/0] OUT Stopping instance 516eca4f-ea73-4684-7e48-e43c
2019-08-07T15:18:02.67-0700 [CELL/SSHD/0]OUT Exit status 0
2019-08-07T15:18:02.71-0700 [APP/PROC/WEB/0]OUT Exit status 134
2019-08-07T15:18:02.71-0700 [CELL/0] OUT Destroying container
2019-08-07T15:18:03.62-0700 [CELL/0] OUT Successfully destroyed container
The key here being the line with
what(): Resource temporarily unavailable
The error is related to the number of partitions. If I set the partition count to 12 or less things work. If I double it, the process fails to start with this error.
This doesn’t happen on my local windows dev machine. It also doesn’t happen in my local docker environment when I wrap this app in a docker image and run. I can take the same image and push it to CF or push the app as a java app, I get this error.
Here is some information about the kafka streams app. We have an input topic with a number of partitions. The topic is the output of debezium connector and basically it’s a change log of a bunch of database tables. The topology is not super complex but it’s not trivial. Its job is to aggregate the table update information back into our aggregates. We end up with 17 local stores in the topology. I have a strong suspicion this issue has something to do with rocksdb and the resources available to the CF container the app is in. But I have not the faintest idea what the Resource is that’s “temporarily unavailable”.
As I mentioned, I tried deploying it as a docker container with various jdk8 jvms, different base images centos, debian, I tried various different CF java buildbacks, I tried limiting java heap with relation to max container memory size (thinking that maybe it has something to do with native memory allocation) to no avail.
I’ve also asked our ops folks to up some limits on the containers and open files limit changed from the initial 16k to now 500k+. I saw some file lock related errors as below but they went away after this change.
2019-08-01T15:46:23.69-0700 [APP/PROC/WEB/0]ERR Caused by: org.rocksdb.RocksDBException: lock : /home/vcap/tmp/kafka-streams/cms-cdc/0_7/rocksdb/input/LOCK: No locks available
2019-08-01T15:46:23.69-0700 [APP/PROC/WEB/0]ERR at org.rocksdb.RocksDB.open(Native Method)
However the error what(): Resource temporarily unavailable with higher number of partitions persists.
ulimit -a on the container looks like this
~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1007531
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 524288
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I really do need to understand what the root of this error is. It’s hard to plan in this case not knowing what limit we’re hitting here.
Hope to hear your ideas. Thanks!
Edit:
Is there maybe some way to get more verbose error messages from the rocksdb library or a way to maybe build it so it outputs more info?
Edit 2
I have also tried to customize the rocksdb memory settings using org.apache.kafka.streams.state.RocksDBConfigSetter
The defaults are in org.apache.kafka.streams.state.internals.RocksDBStore#openDB(org.apache.kafka.streams.processor.ProcessorContext)
First I made sure the Java heap settings were well below the container process size limit and left nothing to the memory calculator by setting
JAVA_OPTS: -XX:MaxDirectMemorySize=100m -XX:ReservedCodeCacheSize=240m -XX:MaxMetaspaceSize=145m -Xmx1000m
With this I tried:
1.
Lowering the write buffer size
org.rocksdb.Options#setWriteBufferSize(long)
org.rocksdb.Options#setMaxWriteBufferNumber(int)
2.
Setting max_open_files to half the limit for the container (the total of all db instances)
org.rocksdb.Options#setMaxOpenFiles(int)
3.
I tried turning off the block cache altogether
org.rocksdb.BlockBasedTableConfig#setNoBlockCache
4.
I also tried setting cache_index_and_filter_blocks = true after re-enabling block cache
https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks
All to no avail. The above issue is still happening when I set higher number of partitions (24) on the input topic. Now that I have RocksDBConfigSetter with logging in it, I can see that the error happens exactly when the rocksdb is configured.
Edit 3
I still haven't gotten to the bottom of this. I have asked the question on https://www.facebook.com/groups/rocksdb.dev and was advised to trace system calls with strace or similar, but I was not able to obtain the required permissions to do that in our environment.
It has eaten up so much time that I had to settle for a workaround for now. What I ended up doing is refactored the topology to
1) minimize the number of materialized ktables (and the number of resulting rocksdb instances) and
2) break up the topology among multiple processes.
This allowed me to turn on and off topology parts in separate deployments with spring profiles and has given me some limited way forward for now.
Here is what I did:
Ran a q process with a limiting vmem argument
(say in a 100GB system, running vmem of 50GB)
Logged a unix top command
After the entire process was completed, I was trying to analyse the memory usage. I saw that the process %age memory usage crossed 90% mark. I believed that vmem restricts the memory consumption. But it seems that my process used more than 90GB memory at times.
How can this be explained? Am I missing something?
As #terrylynch said, command line parameter "-w" is used to set memory cap in kdb/q process. I checked and found that "-vmem" is not a kdb command line parameter.
vmem, in this context, is used in unix to manage memory of processes.
I am attempting to do a relatively simple task using boost interprocess semaphore and shared memory. I want to fixed buffer of data shared between two processes, where the first process is a producer and the second process is a consumer. The buffer will consist of 3 parts. The first part will be a boost::interprocess::semaphore, used to coordinate the producer/consumer. The second part will just be an integer, so the consumer knows how many items are on the buffer. The third part will be the actual array of items. I have a very basic implementation started, but the processes hang when attempting to access open the shared memory, and i'm not certain why. I am doing this on 64-bit Centos 6.5 with gcc/g++ 4.8.2. I should note also that the machine has two CPUs, and using process affinity I am ensuring that the producer and the consumer both run on separate CPUs.
The code is at http://pastie.org/9693362. I am experiencing the following issues; with the code as is, the consumer and producer both hang at line 3 (fixed_managed_shared_memory shm(open_only, "SharedMem" );). If i comment that line out, then both end up terminating (no error is caught however) at line 26 (the post/wait on the semaphore). This makes me think that somehow the memory isnt being shared, b/c when i print out the addresses, they seem to be properly formed (as in the offsets seem to be correct), and they are properly passed between processes. Is there something I'm missing on how to properly set this up?
Our product consumes a lot of windows resources, such as socket handles, memory, threads and so on. Usually there are 700-900 active threads, but in some cases product can rapidly create new threads and do some work, and close it.
I came across with crash memory dump of our product. With ~* windbg command I can see 817 active threads, but when I run !handle command it prints me these summary:
Type Count
None 15
Event 2603
Section 13
File 705
Directory 4
Mutant 32
WindowStation 2
Semaphore 789
Key 208
Process 1
Thread 5766
Desktop 1
IoCompletion 308
Timer 276
KeyedEvent 1
TpWorkerFactory 48
So, actually process holds 5766 threads. So, my question, When Windows actually frees handles for process? Is it possible some kind of delay, or cashing? Can someone explain this behavior?
I don't think that we have handle leaks, but we have weird behavior in legacy part of system with rapidly creating and closing threads for small tasks. Also I would like to point, that we unlikely run more than 1000 threads simultaneously, I am pretty sure about this.
Thanks.
When you say So, actually process holds 5766 threads., what you really mean is that the process holds 5766 thread handles.
Even though a thread may no longer be running, whether that is the result of a call to ExitThread()/TerminateThread() or returning from the ThreadProc, any handles to that thread will remain valid. This makes it possible to do things like call GetExitCodeThread() on the handle of a thread that has finished its work.
Unfortunately, that means that you have to remember to call CloseHandle() instead of just letting it leak. The MSDN example on Creating Threads covers this to some extent.
Another thing that I will note is that somewhere not too far above 1000 running threads, you are likely to exhaust the amount of virtual address space available to a 32bit process since each thread by default reserves 1MB of address space for its stack.
I hit a bug in my code which uses WSARecv and WSAGetOverlapped result on an overlapped socket. Under heavy load, WSAGetOverlapped returns with WSASYSCALLFAILURE ('A system call that should never fail has failed') and my TCP stream is out of sync afterwards, causing mayhem in the upper levels of my program.
So far I have not been able to isolate it to a given set of hardware or drivers. Has somebody hit this issue as well, and found a solution or workaround?
How many connections, how many pending recvs, how many outsanding sends? What does perfmon or task manager say about the amount of non-paged pool used? How much memory in the box? Does it go away if you run the program on Vista or above? Do you have any LSPs installed?
You could be exhausting non-paged pool and causing a badly written driver to misbehave when it fails to allocate memory. This issue is less likely to bite on Vista or later as the amount of non-paged pool available has increased dramatically (see http://www.lenholgate.com/blog/2009/03/excellent-article-on-non-paged-pool.html for details). Alternatively you might be hitting the "locked pages" limit (you can only lock a fixed number of pages in memory on the OS and each pending I/O operation locks one or more pages depending on buffer size and allocation alignment).
It seems I have solved this issue by sleeping 1ms and retrying the WSAGetOverlapped result when it reports a WSASYSCALLFAILURE.
I had another issue related to overlapped events firing, even though there is no data, which I also had to solve first. The test is now running for over an hour, with a few WSASYSCALLFAILURE handled correctly. Hopefully the overnight test will succeed as well.
#Len: thanks again for your help.
EDIT: The overnight test was successful. My bug was caused by two interdependent issues:
Issue 1: WaitForMultipleObjects in ConnectionSet::select occasionally
signals data on an empty socket, causing SocketConnection::readSync to
deadlock.
Fix: Do a non-blocking read on the first byte of each packet. Reset
ConnectionSet if socket was empty
Issue 2: WSAGetOverlappedResult returns occasionally WSASYSCALLFAILURE,
causing out-of-sync on the TCP stream.
Fix: Retry WSAGetOverlappedResult after a small sleep period.
http://equalizer.svn.sourceforge.net/viewvc/equalizer?view=revision&revision=4649