Poor performance / lockup with STM - performance

I'm writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan is deprecated I decided to use TChan's as advertised.
The performance of TChan is much worse than I expected. I have the following program that illustrates the issue:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Concurrent.STM
import Control.Concurrent
import System.Random(randomRIO)
import Control.Monad(forever, when)
allCoords :: [(Int,Int)]
allCoords = [(x,y) | x <- [0..99], y <- [0..99]]
randomCoords :: IO (Int,Int)
randomCoords = do
x <- randomRIO (0,99)
y <- randomRIO (0,99)
return (x,y)
main = do
chan <- newTChanIO :: IO (TChan ((Int,Int),Int))
let watcher p = do
chan' <- atomically $ dupTChan chan
forkIO $ forever $ do
r#(p',_counter) <- atomically $ readTChan chan'
when (p == p') (print r)
return ()
mapM_ watcher allCoords
let go !cnt = do
xy <- randomCoords
atomically $ writeTChan chan (xy,cnt)
go (cnt+1)
go 1
When compiled (-O) and run the program first will output something like this:
./tchantest
((0,25),341)
((0,33),523)
((0,33),654)
((0,35),196)
((0,48),181)
((0,48),446)
((1,15),676)
((1,50),260)
((1,78),561)
((2,30),622)
((2,38),383)
((2,41),365)
((2,50),596)
((2,57),194)
((3,19),259)
((3,27),344)
((3,33),65)
((3,37),124)
((3,49),109)
((3,72),91)
((3,87),637)
((3,96),14)
((4,0),34)
((4,17),390)
((4,73),381)
((4,74),217)
((4,78),150)
((5,7),476)
((5,27),207)
((5,47),197)
((5,49),543)
((5,53),641)
((5,58),175)
((5,70),497)
((5,88),421)
((5,89),617)
((6,0),15)
((6,4),322)
((6,16),661)
((6,18),405)
((6,30),526)
((6,50),183)
((6,61),528)
((7,0),74)
((7,28),479)
((7,66),418)
((7,72),318)
((7,79),101)
((7,84),462)
((7,98),669)
((8,5),126)
((8,64),113)
((8,77),154)
((8,83),265)
((9,4),253)
((9,26),220)
((9,41),255)
((9,63),51)
((9,64),229)
((9,73),621)
((9,76),384)
((9,92),569)
...
And then, at some point, will stop writing anything, while still consuming 100% cpu.
((20,56),186)
((20,58),558)
((20,68),277)
((20,76),102)
((21,5),396)
((21,7),84)
With -threaded the lockup is even faster and occurs after only a handful of lines. It will also consume whatever number of cores are made available through RTS' -N flag.
Additionally the performance seems rather poor - only about 100 events per second are processed.
Is this a bug in STM or am I misunderstanding something about semantics of STM?

The program is going to perform quite badly. You're spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they're all going, you may well get this happening:
Each of the 10,000 threads tries to read from the channel, finds it empty, and adds itself to the wait queue for the underlying TVar. So you'll have 10,000 queue-up events, and 10,000 processes in the wait queue for the TVar.
Something is written to the channel. This will unqueue each of the 10,000 threads and put it back on the run-queue (this may be O(N) or O(1), depending on how the RTS is written).
Each of the 10,000 threads must then process the item to see if it's interested in it, which most won't be.
So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn't seem so unreasonable. I don't understand why the program would grind to a complete halt, though.
In general, I would scrap this design and replace it as follows:
Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf

This is a great test case! I think you've actually created a rare instance of genuine livelock/starvation. We can test this by compiling with -eventlog and running with -vst or by compiling with -debug and running with -Ds. We see that even as the program "hangs" the runtime still is working like crazy, jumping between blocked threads.
The high-level reason is that you have one (fast) writer and many (fast) readers. The readers and writer both need to access the same tvar representing the end of the queue. Let's say that nondeterministically one thread succeeds and all others fail when this happens. Now, as we increase the number of threads in contention to 100*100, then the probability of the reader making progress rapidly goes towards zero. In the meantime, the writer in fact takes longer in its access to that tvar than do the readers, so that makes things worse for it.
In this instance, putting a tiny throttle between each invocation of go for the writer (say, threadDelay 100) is enough to fix the problem. It gives the readers enough time to all block between successive writes, and so eliminates the livelock. However, I do think that it would be an interesting problem to improve the behavior of the runtime scheduler to deal with situations like this.

Adding to what Neil said, your code also has a space leak (noticeable with smaller n): After fixing the obvious tuple build-up issue by making tuples strict, I was left with the following profile: What's happening here, I think, is that the main thread is writing data to the shared TChan faster than the worker threads can read it (TChan, like Chan, is unbounded). So the worker threads spend most of their time reexecuting their respective STM transactions, while the main thread is busy stuffing even more data into the channel; this explains why your program hangs.

Related

Go performance penalty in high number of calls to append

I'm writing an emulator in Go, and for debugging purposes I'm logging the cpu' state at every emulator's cycle to generate a log file later.
There's something I'm not doing properly because while the logger is enabled performance drops and makes the emulator unusable.
Profiler shows clearly the culprit resides in the logging routine (logStep method):
logStep method is very simple, it calls CreateState to snapshot current cpu state in a struct, and then adds it to a slice (in method Log).
I call this method at every emulated cpu cycle (around 30.000 times per second), and I suspect either Garbage Collector is slowing my execution or I'm doing something wrong with this data structure.
I get the profile graph is pointing me to runtime growslice caused by an append located in (*cpu6502Logger)Log, but I'm unable to find information on how to do this more efficiently.
Also, I scratch my head on why CreateState takes that long to just create a simple struct.
This is what CpuState looks like:
type CpuState struct {
Registers Cpu6502Registers
CurrentInstruction Instruction
RawOpcode [4]byte
EvaluatedAddress Address
CyclesSinceReset uint32
}
This is how I create a CPU Snapshot:
func CreateState(cpu Cpu6502) CpuState {
pc := cpu.Registers().Pc
var rawOpcode [4]byte
rawOpcode[0] = 0x00
pc++
instruction := cpu.instructions[rawOpcode[0]]
for i := byte(0); i < (instruction.Size() - 1); i++ {
rawOpcode[1+i] = cpu.memory.Read(pc+Address(i))
}
_, evaluatedAddress, _, _ := cpu.addressEvaluators[instruction.AddressMode()](pc)
state := CpuState{
*cpu.Registers(),
instruction,
rawOpcode,
evaluatedAddress,
cpu.cycle,
}
return state
}
And finally, how I add this snapshot to a collection (log method in the profile graph). I've also addde how I initialize logger.snapshots:
func createCPULogger(outputPath string) cpu6502Logger {
return cpu6502Logger{
outputPath: outputPath,
snapshots: make([]CpuState, 0, 10024),
}
}
func (logger *cpu6502Logger) Log(state CpuState) {
logger.snapshots = append(logger.snapshots, state)
}
Disclaimer: following text contains grammar mistakes but i dont give a damn
why is it slow
Maintaining one gigantic slice to hold all data there is is wery costy mainly when it constantly extends. Each time you append few elements, whole memory section is copied to bigges section to allow expansion. with grownig slice, complexity grows and each realocation is slower and slower. You told us that you emulate tousands of cpu states per second.
solution
The best way to deal with this is allocating fixed buffer of some length. Now we now that eventually we will run out of space. When that happens we have two options. First you can write all data ftom buffer to file then truncate the buffer and start filling again (then write again). Other option is to save filled buffers in a slice and allocate new one. Choos witch one fits your machine. (slow or small ram is not good for second solution)
why does this help
i think this also helps the emulator it self. There will be performance spikes when restoring buffer, but most of the time, performance will be at maximum. Allocating big memory is just slow as alocator is less likely to find fitting section on first try. Garbage collection is also wery unhappy with frequent allocations. By allocating buffer and filling it, we use one big allocation, (but not too big), and store data in sections. Sections we already saved can stey where they are. We can also say that in this case we are handling memory our selfs more then gc does. (no garbage memory produced)

Kernel threads vs Timers

I'm writing a kernel module which uses a customized print-on-screen system. Basically each time a print is involved the string is inserted into a linked list.
Every X seconds I need to process the list and perform some operations on the strings before printing them.
Basically I have two choices to implement such a filter:
1) Timer (which restarts itself in the end)
2) Kernel thread which sleeps for X seconds
While the filter is performing its stuff nothing else can use the linked list and, of course, while inserting a string the filter function shall wait.
AFAIK timer runs in interrupt context so it cannot sleep, but what about kernel threads? Can they sleep? If yes is there some reason for not to use them in my project? What other solution could be used?
To summarize: my filter function has got only 3 requirements:
1) Must be able to printk
2) When using the list everything else which is trying to access the list must block until the filter function finishes execution
3) Must run every X seconds (not a realtime requirement)
kthreads are allowed to sleep. (However, not all kthreads offer sleepful execution to all clients. softirqd for example would not.)
But then again, you could also use spinlocks (and their associated cost) and do without the extra thread (that's basically what the timer does, uses spinlock_bh). It's a tradeoff really.
each time a print is involved the string is inserted into a linked list
I don't really know if you meant print or printk. But if you're talking about printk(), You would need to allocate memory and you are in trouble because printk() may be called in an atomic context. Which leaves you the option to use a circular buffer (and thus, you should be tolerent to drop some strings because you might not have enough memory to save all the strings).
Every X seconds I need to process the list and perform some operations on the strings before printing them.
In that case, I would not even do a kernel thread: I would do the processing in print() if not too costly.
Otherwise, I would create a new system call:
sys_get_strings() or something, that would dump the whole linked list into userspace (and remove entries from the list when copied).
This way the whole behavior is controlled by userspace. You could create a deamon that would call the syscall every X seconds. You could also do all the costly processing in userspace.
You could also create a new device says /dev/print-on-screen:
dev_open would allocate the memory, and print() would no longer be a no-op, but feed the data in the device pre-allocated memory (in case print() would be used in atomic context and all).
dev_release would throw everything out
dev_read would get you the strings
dev_write could do something on your print-on-screen system

How to locate idle time (and network IO time, etc.) in XPerf?

Let's say I have a contrived program:
#include <Windows.h>
void useless_function()
{
Sleep(5000);
}
void useful_function()
{
// ... do some work
useless_function();
// ... do some more work
}
int main()
{
useful_function();
return 0;
}
Objective: I want the profiler to tell me useful_function() is needlessly calling useless_function() which waits for no obvious reasons. Under XPerf, this doesn't show up in any of the graphs I have because the call to WaitForMultipleObjects() seem to be accounted to Idle.exe instead of my own program.
And here's the xperf command line that I currently run:
xperf -on Latency -stackwalk Profile
Any ideas?
(This is not restricted to wait functions. The above might have been solved by placing breakpoints at NtWaitForMultipleObjects. Ideally there could be a way to see the stack sample that's taking up a lot of wall-clock time as opposed to only CPU time)
I think what you are looking for is the Wait analysis with Ready Thread functionality in Xperf. It captures every context switch and gives you the call stack of the thread once it wakes up from sleep (or an otherwise blocked operation). In your case, you would see the stack just after the call sleep(5000) as well as the time spend sleeping.
The functionality is a bit obscure to use. But it is fortunately well described here:
Use Xperf's Wait Analysis for Application-Performance Troubleshooting
Wait Analysis is the way to do this. You should:
Record the CSWITCH provider, in order to get all context switches
Record call stacks on context switches by adding +CSWITCH to your -stackwalk argument
Probably record call stacks on the ready thread to get more information on who readied you (i.e.; who released the Mutex or CS or semaphore and where) by adding +READYTHREAD to your -stackwalk
Then you use CPU Usage (Precise) in WPA (or xperfview, but that's ancient) to look at the context switches and find where your TimeSinceLast is high on a thread that shouldn't be going idle. You'll typically want the columns in CPU Usage (Precise) in this sort of order:
NewProcess (your process being switched in)
NewThreadId
NewThreadStack
ReadyingProcess (who made your thread ready to run)
ReadyingThreadId (optional)
ReadyThreadStack (optional, requires +ReadyThread on -stackwalk)
Orange bar
Count
TimeSinceLast (us) - sort by this column, usually
Whatever other columns you want
For details see these particular articles from my blog:
- https://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
- https://randomascii.wordpress.com/2012/06/19/wpaxperf-trace-analysis-reimagined/
This "profiler" will tell you - just randomly pause it a few times and look at the stack. If do some work takes 5 seconds, and do some more work takes 5 seconds, then 33% of the time the stack will look like this
main: calling useful_function
useful_function: calling useless_function
useless_function: calling Sleep
So roughly 33% of your stack samples will show exactly that. Any line of code that's costing some fraction of wall-clock time will appear on roughly that fraction of samples.
On the rest of the samples you will see it doing the other things.
There are automated profilers that do the same thing in a more pretty way, such as Zoom and LTProf, although they don't actually show you the samples.
I looked at the xperf doc, trying to figure out if you could get stack samples on wall-clock time and get percents at line-level resolution. It seems you gotta be on Windows 7 or Vista. They only bother with functions, not lines, which if you have realistically big functions, is important. I couldn't figure out how to get access to the individual samples, which I think is important for seeing why the program is spending its time.

Standard term for a thread I/O reorder buffer?

I have a case where many threads all concurrently generate data that is ultimately written to one long, serial file stream. I need to somehow serialize these writes so that the stream gets written in the right order.
ie, I have an input queue of 2048 jobs j0..jn, each of which produces a chunk of data oi. The jobs run in parallel on, say, eight threads, but the output blocks have to appear in the stream in the same order as the corresponding input blocks — the output file has to be in the order o0o1o2...
The solution to this is pretty self evident: I need some kind of buffer that accumulates and writes the output blocks in the correct order, similar to a CPU reorder buffer in Tomasulo's algorithm, or to the way that TCP reassembles out-of-order packets before passing them to the application layer.
Before I go code it, I'd like to do a quick literature search to see if there are any papers that have solved this problem in a particularly clever or efficient way, since I have severe realtime and memory constraints. I can't seem to find any papers describing this though; a Scholar search on every permutation of [threads, concurrent, reorder buffer, reassembly, io, serialize] hasn't yielded anything useful. I feel like I must just not be searching the right terms.
Is there a common academic name or keyword for this kind of pattern that I can search on?
The Enterprise Integration Patterns book calls this a Resequencer (p282/web).
Actually, you shouldn't need to accumulate the chunks. Most operating system and languages provide a random-access file abstraction that would allow each thread to independently write its output data to the correct position in the file without affecting the output data from any of the other threads.
Or are you writing to truly serial output file like a socket?
I wouldn't use a reorderable buffer at all, personally. I'd create one 'job' object per job, and, depending on your environment, either use message passing or mutexes to receive completed data from each job in order. If the next job isn't done, your 'writer' process waits until it is.
I would use a ringbuffer that has the same lenght as the number of threads you are using. The ringbuffer would also have the same number of mutexes.
The rinbuffer must also know the id of the last chunk it has written to the file. It is equivalent to the 0 index of your ringbuffer.
On add to the ringbuffer, you check if you can write, ie index 0 is set, you can then write more than one chunk at a time to the file.
If index 0 is not set, simply lock the current thread to wait. -- You could also have a ringbuffer 2-3 times in lenght than your number of threads and lock only when appropriate, ie : when enough jobs to full the buffer have been launched.
Don't forget to update the last chunk written tough ;)
You could also use double buffering when writting to the file.
Have the output queue contain futures rather than the actual data. When you retrieve an item from the input queue, immediately post the corresponding future onto the output queue (taking care to ensure that this preserves the order --- see below). When the worker thread has processed the item it can then set the value on the future. The output thread can read each future from the queue, and block until that future is ready. If later ones become ready early this doesn't affect the output thread at all, provided the futures are in order.
There are two ways to ensure that the futures on the output queue are in the correct order. The first is to use a single mutex for reading from the input queue and writing to the output queue. Each thread locks the mutex, takes an item from the input queue, posts the future to the output queue and releases the mutex.
The second is to have a single master thread that reads from the input queue, posts the future on the output queue and then hand the item off to a worker thread to execute.
In C++ with a single mutex protecting the queues this would look like:
#include <thread>
#include <mutex>
#include <future>
struct work_data{};
struct result_data{};
std::mutex queue_mutex;
std::queue<work_data> input_queue;
std::queue<std::future<result_data> > output_queue;
result_data process(work_data const&); // do the actual work
void worker_thread()
{
for(;;) // substitute an appropriate termination condition
{
std::promise<result_data> p;
work_data data;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(input_queue.empty())
{
continue;
}
data=input_queue.front();
input_queue.pop();
std::promise<result_data> item_promise;
output_queue.push(item_promise.get_future());
p=std::move(item_promise);
}
p.set_value(process(data));
}
}
void write(result_data const&); // write the result to the output stream
void output_thread()
{
for(;;) // or whatever termination condition
{
std::future<result_data> f;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(output_queue.empty())
{
continue;
}
f=std::move(output_queue.front());
output_queue.pop();
}
write(f.get());
}
}

How to wait/block until a semaphore value reaches 0 in windows

Using the semop() function on unix, it's possible to provide a sembuf struct with sem_op =0. Essentially this means that the calling process will wait/block until the semaphore's value becomes zero. Is there an equivalent way to achieve this in windows?
The specific use case I'm trying to implement is to wait until the number of readers reaches zero before letting a writer write. (yes, this is a somewhat unorthodox way to use semaphores; it's because there is no limit to the number of readers and so there's no set of constrained resources which is what semaphores are typically used to manage)
Documentation on unix semop system call can be found here:
http://codeidol.com/unix/advanced-programming-in-unix/Interprocess-Communication/-15.8.-Semaphores/
Assuming you have one writer thread, just have the writer thread gobble up the semaphore. I.e., grab the semaphore via WaitForSingleObject for however many times you initialized the semaphore count to.
A Windows semaphore counts down from the maximum value (the maximum number of readers allowed) to zero. WaitXxx functions wait for a non-zero semaphore value and decrement it, ReleaseSemaphore increments the semaphore (allowing other threads waiting on the semaphore to unblock). It is not possible to wait on a Windows semaphore in a different way, so a Windows semaphore is probably the wrong choice of synchronization primitive in your case. On Vista/2008 you could use slim read-write locks; if you need to support earlier versions of Windows you'll have to roll your own.
I've never seen any function similar to that in the Win32 API.
I think the way to do this is to call WaitForSingleObject or similar and get a WAIT_OBJECT_0 the same number of times as the maximum count specified when the semaphore was created. You will then hold all the available "slots" and anyone else waiting on the semaphore will block.
The specific use case I'm trying to implement
is to wait until the number of readers reaches
zero before letting a writer write.
Can you guarantee that the reader count will remain at zero until the writer is all done?
If so, you can implement the equivalent of SysV "wait-for-zero" behavior with a manual-reset event object, signaling the completion of the last reader. Maintain your own (synchronized) count of "active readers", decrementing as readers finish, and then signal the patiently waiting writer via SetEvent() when that count is zero.
If you can't guarantee that the readers will be well behaved, well, then you've got an unhappy race to deal with even with SysV sems.

Resources