Standard term for a thread I/O reorder buffer? - algorithm

I have a case where many threads all concurrently generate data that is ultimately written to one long, serial file stream. I need to somehow serialize these writes so that the stream gets written in the right order.
ie, I have an input queue of 2048 jobs j0..jn, each of which produces a chunk of data oi. The jobs run in parallel on, say, eight threads, but the output blocks have to appear in the stream in the same order as the corresponding input blocks — the output file has to be in the order o0o1o2...
The solution to this is pretty self evident: I need some kind of buffer that accumulates and writes the output blocks in the correct order, similar to a CPU reorder buffer in Tomasulo's algorithm, or to the way that TCP reassembles out-of-order packets before passing them to the application layer.
Before I go code it, I'd like to do a quick literature search to see if there are any papers that have solved this problem in a particularly clever or efficient way, since I have severe realtime and memory constraints. I can't seem to find any papers describing this though; a Scholar search on every permutation of [threads, concurrent, reorder buffer, reassembly, io, serialize] hasn't yielded anything useful. I feel like I must just not be searching the right terms.
Is there a common academic name or keyword for this kind of pattern that I can search on?

The Enterprise Integration Patterns book calls this a Resequencer (p282/web).

Actually, you shouldn't need to accumulate the chunks. Most operating system and languages provide a random-access file abstraction that would allow each thread to independently write its output data to the correct position in the file without affecting the output data from any of the other threads.
Or are you writing to truly serial output file like a socket?

I wouldn't use a reorderable buffer at all, personally. I'd create one 'job' object per job, and, depending on your environment, either use message passing or mutexes to receive completed data from each job in order. If the next job isn't done, your 'writer' process waits until it is.

I would use a ringbuffer that has the same lenght as the number of threads you are using. The ringbuffer would also have the same number of mutexes.
The rinbuffer must also know the id of the last chunk it has written to the file. It is equivalent to the 0 index of your ringbuffer.
On add to the ringbuffer, you check if you can write, ie index 0 is set, you can then write more than one chunk at a time to the file.
If index 0 is not set, simply lock the current thread to wait. -- You could also have a ringbuffer 2-3 times in lenght than your number of threads and lock only when appropriate, ie : when enough jobs to full the buffer have been launched.
Don't forget to update the last chunk written tough ;)
You could also use double buffering when writting to the file.

Have the output queue contain futures rather than the actual data. When you retrieve an item from the input queue, immediately post the corresponding future onto the output queue (taking care to ensure that this preserves the order --- see below). When the worker thread has processed the item it can then set the value on the future. The output thread can read each future from the queue, and block until that future is ready. If later ones become ready early this doesn't affect the output thread at all, provided the futures are in order.
There are two ways to ensure that the futures on the output queue are in the correct order. The first is to use a single mutex for reading from the input queue and writing to the output queue. Each thread locks the mutex, takes an item from the input queue, posts the future to the output queue and releases the mutex.
The second is to have a single master thread that reads from the input queue, posts the future on the output queue and then hand the item off to a worker thread to execute.
In C++ with a single mutex protecting the queues this would look like:
#include <thread>
#include <mutex>
#include <future>
struct work_data{};
struct result_data{};
std::mutex queue_mutex;
std::queue<work_data> input_queue;
std::queue<std::future<result_data> > output_queue;
result_data process(work_data const&); // do the actual work
void worker_thread()
{
for(;;) // substitute an appropriate termination condition
{
std::promise<result_data> p;
work_data data;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(input_queue.empty())
{
continue;
}
data=input_queue.front();
input_queue.pop();
std::promise<result_data> item_promise;
output_queue.push(item_promise.get_future());
p=std::move(item_promise);
}
p.set_value(process(data));
}
}
void write(result_data const&); // write the result to the output stream
void output_thread()
{
for(;;) // or whatever termination condition
{
std::future<result_data> f;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(output_queue.empty())
{
continue;
}
f=std::move(output_queue.front());
output_queue.pop();
}
write(f.get());
}
}

Related

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

Kernel threads vs Timers

I'm writing a kernel module which uses a customized print-on-screen system. Basically each time a print is involved the string is inserted into a linked list.
Every X seconds I need to process the list and perform some operations on the strings before printing them.
Basically I have two choices to implement such a filter:
1) Timer (which restarts itself in the end)
2) Kernel thread which sleeps for X seconds
While the filter is performing its stuff nothing else can use the linked list and, of course, while inserting a string the filter function shall wait.
AFAIK timer runs in interrupt context so it cannot sleep, but what about kernel threads? Can they sleep? If yes is there some reason for not to use them in my project? What other solution could be used?
To summarize: my filter function has got only 3 requirements:
1) Must be able to printk
2) When using the list everything else which is trying to access the list must block until the filter function finishes execution
3) Must run every X seconds (not a realtime requirement)
kthreads are allowed to sleep. (However, not all kthreads offer sleepful execution to all clients. softirqd for example would not.)
But then again, you could also use spinlocks (and their associated cost) and do without the extra thread (that's basically what the timer does, uses spinlock_bh). It's a tradeoff really.
each time a print is involved the string is inserted into a linked list
I don't really know if you meant print or printk. But if you're talking about printk(), You would need to allocate memory and you are in trouble because printk() may be called in an atomic context. Which leaves you the option to use a circular buffer (and thus, you should be tolerent to drop some strings because you might not have enough memory to save all the strings).
Every X seconds I need to process the list and perform some operations on the strings before printing them.
In that case, I would not even do a kernel thread: I would do the processing in print() if not too costly.
Otherwise, I would create a new system call:
sys_get_strings() or something, that would dump the whole linked list into userspace (and remove entries from the list when copied).
This way the whole behavior is controlled by userspace. You could create a deamon that would call the syscall every X seconds. You could also do all the costly processing in userspace.
You could also create a new device says /dev/print-on-screen:
dev_open would allocate the memory, and print() would no longer be a no-op, but feed the data in the device pre-allocated memory (in case print() would be used in atomic context and all).
dev_release would throw everything out
dev_read would get you the strings
dev_write could do something on your print-on-screen system

How can I use spinlocks on list entries inside the linux kernel?

I'm developing a patch for the linux kernel. I have to use several
lists and I have to protect'em against concurrent modification on a
multicore machine. I'm trying to use spinlocks for this goal, but
there's something I can't understand. I have to lock the entries of a
list (I'm using linux default implementation of linked lists) and it
can happen that a process invokes a syscall to remove one element of
the list while the same element which is locked because some
modification is actually being made on it. If I insert a spinlock
inside the list entry, what happens if a process manage to remove it
while someone is spinlocking on it?? Should I lock the entire list?
I'm looking for a piece of code that can explain how to do handle this
situation.
For example, this code shouldn't work (see comment on the last line of
code):
struct lista{
int c;
spinlock_t lock;
struct list_head;
}
spinlock_t list_lock;
struct lista lista;
//INSERT
struct lista* cursor;
struct lista* new = (struct lista*) kmalloc(sizeof(struct lista),GFP_KERNEL);
/*do something*/
spin_lock(&list_lock); //Lock on the whole list
list_for_each_entry(cursor,&lista.list,list){
if (cursor->c == something ){
...
spin_unlock(&list_lock) //unlock
spin_lock(&cursor->lock) // Lock on list entry
list_add(&new->list, &lista.list);
spin_unlock(&cursor->lock) // unlock of the list entry
...
}
}
//REMOVAL
struct lista* cursor;
spin_lock(&list_lock);
list_for_each_entry(cursor,&lista.list,list){
if (cursor->c == something ){
...
spin_unlock(&list_lock) //unlock
spin_lock(&cursor->lock) // Lock on list entry
list_del(&cursor.list,&lista.list);
spin_unlock(&cursor->lock) // unlock of the list entry
kfree(cursor); //WHEN THE ENTRY IS FREED SOMEONE COULD HAVE TAKEN THE LOCK SINCE IT IS UNLOCKED
...
}
}
Can you help me??
Don't release list_lock until you're done removing the item.
You might end up with the slightly awkward procedure of:
Acquire list lock (this will block other incoming threads)
Acquire item lock, release item lock (this insures that all earlier threads are done)
Remove item
Release list lock.
Variation: use a reader-writer lock for the list lock.
Threads looking to modify list items take the reader lock; this allows multiple threads to operate on the list in parallel.
Threads looking to remove list items take the writer lock; this waits for all readers to exit and blocks them until you release it. In this case you still have to hold the list lock until you're done removing the item,
In this way you can avoid step 2 above. This might seem conceptually clearer, as you don't need to explain the pointless-looking lock/release.
You almost certainly shouldn't be using spinlocks at all, unless there's concurrent access from hard IRQ context. Use mutexes instead.
The easiest option for your list is just to lock the entire list while you operate on it. Don't worry about per-item locks unless and until you find that there's sufficient contention on the list lock that you need it (and in that case, you probably want to look at using RCU instead, anyway).
Your list head doesn't need to be a struct lista, it should just be a struct list_head. Notice that you keep using &lista.list, that should just be a list_head called "list" or something. See for example the code in drivers/pci/msi.c, notice that dev->msi_list is just a list_head, not a struct msi_desc.
You can't safely drop the list lock and then grab the cursor lock. It's possible that after you dropped the list lock but before you get the cursor lock someone else came in and free'd your cursor. You can juggle the locks, but that is very easy to get wrong.
You almost definitely just want one lock, for the whole list, and no per-item locks. And the list lock should be a mutex unless you need to manipulate the list from interrupt context.
If u are not dealing with devices and or the critical sections of kernel where spin lock is a must(as it disables preemption and interrupt (on request)), then Y 2 use spin locks which will un-necessarily close off ur preemption & interrupts.
Using semaphore or mutex that too on list & not list item looks better solution.

How to wait/block until a semaphore value reaches 0 in windows

Using the semop() function on unix, it's possible to provide a sembuf struct with sem_op =0. Essentially this means that the calling process will wait/block until the semaphore's value becomes zero. Is there an equivalent way to achieve this in windows?
The specific use case I'm trying to implement is to wait until the number of readers reaches zero before letting a writer write. (yes, this is a somewhat unorthodox way to use semaphores; it's because there is no limit to the number of readers and so there's no set of constrained resources which is what semaphores are typically used to manage)
Documentation on unix semop system call can be found here:
http://codeidol.com/unix/advanced-programming-in-unix/Interprocess-Communication/-15.8.-Semaphores/
Assuming you have one writer thread, just have the writer thread gobble up the semaphore. I.e., grab the semaphore via WaitForSingleObject for however many times you initialized the semaphore count to.
A Windows semaphore counts down from the maximum value (the maximum number of readers allowed) to zero. WaitXxx functions wait for a non-zero semaphore value and decrement it, ReleaseSemaphore increments the semaphore (allowing other threads waiting on the semaphore to unblock). It is not possible to wait on a Windows semaphore in a different way, so a Windows semaphore is probably the wrong choice of synchronization primitive in your case. On Vista/2008 you could use slim read-write locks; if you need to support earlier versions of Windows you'll have to roll your own.
I've never seen any function similar to that in the Win32 API.
I think the way to do this is to call WaitForSingleObject or similar and get a WAIT_OBJECT_0 the same number of times as the maximum count specified when the semaphore was created. You will then hold all the available "slots" and anyone else waiting on the semaphore will block.
The specific use case I'm trying to implement
is to wait until the number of readers reaches
zero before letting a writer write.
Can you guarantee that the reader count will remain at zero until the writer is all done?
If so, you can implement the equivalent of SysV "wait-for-zero" behavior with a manual-reset event object, signaling the completion of the last reader. Maintain your own (synchronized) count of "active readers", decrementing as readers finish, and then signal the patiently waiting writer via SetEvent() when that count is zero.
If you can't guarantee that the readers will be well behaved, well, then you've got an unhappy race to deal with even with SysV sems.

Resources