Setting IO Priority of low instead of Very Low - winapi

I am trying to set the IO Priority of a process to LOW. Currently I am calling the SetProiorityClass Function with PROCESS_MODE_BACKGROUND_BEGIN. This sets the IO Priority to "VeryLow". How do I got about setting it one level higher to "LOW"
http://msdn.microsoft.com/en-us/library/ms686219(VS.85).aspx

Related

How to get The thread priority in Windows Kernel?

I am writing a kernel mode driver and I would like to get the priority of a /user-mode/ thread (should be a number between 0-15).
I have the PETHREAD.
KeQueryPriorityThread returns the current priority of the thread. It does not correspond to the priority given to a thread via SetThreadPriority. AFAIK, it is the combined priority, i.e. process priority + thread priority + dynamic priorities.

ZeroMQ buffer size v/s High Water Mark

In the zeromq socket options, we have flags for both a high water mark and a buffer size.
For sending, it's ZMQ_SNDHWM and ZMQ_SNDBUF.
Can someone explain the difference between these two?
Each one controls some other thing:
ZMQ_SNDBUF: Set kernel transmit buffer size
The ZMQ_SNDBUF option shall set the underlying kernel transmit buffer size for the socket to the specified size in bytes. A ( default ) value of -1 means leave the OS default unchanged.
where man 7 socket says: ( credits go to #Matthew Slatery )
[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes. The ker-
nel doubles this value (to allow space for bookkeeping overhead)
when it is set using setsockopt(), and this doubled value is
returned by getsockopt(). The default value is set by the
wmem_default sysctl and the maximum allowed value is set by the
wmem_max sysctl. The minimum (doubled) value for this option is
2048.
[...]
NOTES
Linux assumes that half of the send/receive buffer is used for internal
kernel structures; thus the sysctls are twice what can be observed on
the wire.
[...]
whereas
ZMQ_SNDHWM: Set high water mark for outbound messages
The ZMQ_SNDHWM option shall set the high water mark for outbound messages on the specified socket. The high water mark is a hard limit on the maximum number of outstanding messages ØMQ shall queue in memory for any single peer that the specified socket is communicating with. A ( non-default ) value of zero means no limit.
If this limit has been reached the socket shall enter an exceptional state and depending on the socket type, ØMQ shall take appropriate action such as blocking or dropping sent messages. Refer to the individual socket descriptions in zmq_socket(3) for details on the exact action taken for each socket type.
ØMQ does not guarantee that the socket will accept as many as ZMQ_SNDHWM messages, and the actual limit may be as much as 60-70% lower depending on the flow of messages on the socket.
Verba docent, Exempla trahunt:
Having an infrastructure setup, where zmq.PUB side leaves all the settings in their default values, and such sender will have about 20, 200, 2000 zmq.SUB abonents listening to the sender, there would be soon depleted the default, unmodified O/S kernel buffer space, as each .bind()/.connect() relation ( each abonent ) will try to "stuff" as much as about the total cummulative sum of ~ 1000 * aVarMessageSIZE [Bytes] of data, once having been broadcast in a .send( ..., ZMQ_DONTWAIT ) manner & if the O/S would not have provided a sufficient buffer space -- there we go --
"Houston, we have a problem..."If the message cannot be queued on the socket, the zmq_send() function shall fail with errno set to EAGAIN.
Q.E.D.

Interpreting Intel VTune's Memory Bound Metric

I see the following when I run Intel VTune on my workload:
Memory Bound 50.8%
I read the Intel doc, which says (Intel doc):
Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.
Does that mean that roughly half of the instructions in my app are stalled waiting for memory, or is it more subtle than that?
The pipeline slots concept used by VTune is explain e.g. here: https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win.
In short pipeline slot represents the hardware resources needed to process one uOp. So for 4-wide CPUs (most Intel processors) we can execute 4 Ops each cycle and the total number of slots will be measured as 4 * CPU_CLK_UNHALTED.THREAD by VTune.
The Memory Bound metric is built on CYCLE_ACTIVITY.STALLS_MEM_ANY event which gives you directly stalls due to memory. Taking into account out-of-order. Basically only if CPU is stalled and at the same time it has in-flight loads the counter is incremented. If there are loads in-flight but CPU is kept busy it is not accounted as memory stall.
So Memory Bound metric provides quite accurate estimation on how much the workload is bound by memory performance issues. The value of 50% means that half of the time was wasted waiting for data from memory.
A slot is an execution port of the pipeline. In general in the VTune documentation, a stall could either mean "not retired" or "not dispatched for execution". In this case, it refers to the number of cycles in which zero uops were dispatched.
According to the VTune include configuration files, Memory Bound is calculated as follows:
Memory_Bound = Memory_Bound_Fraction * BackendBound
Memory_Bound_Fraction is basically the fraction of slots mentioned in the documentation. However, according to the top-down method discussed in the optimization manual, the memory bound metric is relative to the backend bound metric. So this is why it is multiplied by BackendBound.
I'll focus on the first term of the formula, Memory_Bound_Fraction. The formula for the second term, BackendBound, is actually complicated.
Memory_Bound_Fraction is calculated as follows:
Memory_Bound_Fraction = (CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) * NUM_OF_PORTS / Backend_Bound_Cycles * NUM_OF_PORTS
NUM_OF_PORTS is the number of execution ports of the microarchitecture of the target CPU. This can be simplified to:
Memory_Bound_Fraction = CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB / Backend_Bound_Cycles
CYCLE_ACTIVITY.STALLS_MEM_ANY and RESOURCE_STALLS.SB are performance events. Backend_Bound_Cycles is calculated as follows:
Backend_Bound_Cycles = CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - Few_Uops_Executed_Threshold - Frontend_RS_Empty_Cycles + RESOURCE_STALLS.SB
Few_Uops_Executed_Threshold is either UOPS_EXECUTED.CYCLES_GE_2_UOP_EXEC or UOPS_EXECUTED.CYCLES_GE_3_UOP_EXEC depending on some other metric. Frontend_RS_Empty_Cycles is either RS_EVENTS.EMPTY_CYCLES or zero depending on some metric.
I realize this answer still needs a lot of additional explanation and BackendBound needs to be expanded. But this early edit makes the answer accurate.

Why Garbage Collect in web apps?

Consider building a web app on a platform where every request is handled by a User Level Thread(ULT) (green thread/erlang process/goroutine/... any light weight thread). Assuming every request is stateless and resources like DB connection are obtained at startup of the app and shared between these threads. What is the need for garbage collection in these threads?
Generally such a thread is short running(a few milliseconds) and if well designed doesn't use more than a few (KB or MB) of memory. If garbage collection of the resources allocated in the thread is done at the exit of the thread and independent of the other threads, then there would be no GC pauses for even the 98th or 99th percentile of requests. All requests would be answered in predictable time.
What is the problem with such a model and why is it not being widely used?
You assumption might not be true.
if well designed doesn't use more than a few (KB or MB) of memory
Imagine a function for counting words in a text file which is used in a web app. Some naive implementation could be,
def count_words(text):
words = text.split()
count = {}
for w in words:
if w in count:
count[w] += 1
else:
count[w] = 1
return count
It allocates larger memory than text.

Tuning Netty on 32 Core / 10Gbit Hosts

Netty Server streams to a Netty client (point to point, 1 to 1):
Good
case: Server and Client are both 12 cores, 1Gbit NIC => going at the steady rate of 300K 200 byte messages per second
Not So Good
case: Server and Client are both 32 cores, 10Gbit NIC => (same code) starting at 130K/s degrading down to hundreds per second within minutes
Observations
Netperf shows that the "bad" environment is actually quite excellent ( can stream 600MB/s steady for a half an hour ).
It does not seem to be a client issue, since if I swap the client to a known good client (wrote it in C) that sets a max OS's SO_RCVBUF and does nothing but reads byte[]s and ignores them => the behavior is still the same.
Performance degradation starts before a high write watermark ( 200MB, but tried others ) is reached
Heap feels up quickly, and of course once reaches the max, GC kicks in locking the world, but that happens way after the "bad" symptoms surface. On a "good" environment heap stays steady somewhere at 1Gb, where it logically, given the configs, should be.
One thing that I noticed: most of the 32 cores are utilized while Netty Server streams, which I tried to limit by setting all the Boss/NioWorker threads to 1 (although there is a single channel anyway, but just in case):
val bootstrap = new ServerBootstrap(
new NioServerSocketChannelFactory (
Executors.newFixedThreadPool( 1 ),
Executors.newFixedThreadPool( 1 ), 1 ) )
// 1 thread max, memory limitation: 1GB by channel, 2GB global, 100ms of timeout for an inactive thread
val pipelineExecutor = new OrderedMemoryAwareThreadPoolExecutor(
1, 1 *1024 *1024 *1024, 2 *1024 *1024 *1024, 100, TimeUnit.MILLISECONDS,
Executors.defaultThreadFactory() )
bootstrap.setPipelineFactory(
new ChannelPipelineFactory {
def getPipeline = {
val pipeline = Channels.pipeline( serverHandlers.toArray : _* )
pipeline.addFirst( "pipelineExecutor", new ExecutionHandler( pipelineExecutor ) )
pipeline
}
} )
But that does not limit the number of cores used => still most of the cores are utilized. I understand that Netty tries to round robin worker tasks, but have a suspicion that 32 cores "at once" may be just too much for the NIC to handle.
Question(s)
Suggestions on the degrading performance?
How do I limit the number of cores used by Netty (without of course going the OIO route)?
side notes: would've loved to discuss it on Netty's mailing list, but it is closed. tried Netty's IRC, but it is dead
have you tried cpu/interrupt affinity?
the idea is to send io/irq interrupts into 1 or 2 cores only and prevent context switch in other cores.
give it a good. try vmstat and monitor ctx and inverse context switched before and after.
you may unpin the application from the interrupt handler core(s).

Resources