Circular queue is obviously better because it helps us to use the empty space left by popping out the elements. It also saves time that may have been used to do lateral shift of elements after each pop.
But is there any use case where queue would be preferred than using a circular queue?
Definition of Queue = We will go with the linear array implementation. Follows FIFO and no overwrites
Definition of Circular Queue = Ring Buffer Implementation. Follows FIFO. No overwrites.
Note: In many languages a queue is just an interface and doesn't say anything about the implementation.
When using an array based circular queue, a.k.a ring buffer, you must handle the situation where you push to a full buffer. You could:
Ignore the insertion
Overwrite the oldest entry
Block until there's space again
(Re)Allocate memory and copy all the content
Use an oversized buffer so this situation never happens
Each of these options have downsides. If you can live with them or you know that you will never fill the buffer, then ring buffer is the way to go.
Options 3 & 4 will induce stuttering. Depending on your use case, you might prefer longer but stable access time and reliability over occasional spikes and therefore opt for a linked list or some other sort of dynamic implementation, like a deque, instead.
Example use cases are tasks, where you have to achieve a stable frame/sampling rate or throughput and you can't tolerate stutters, like:
Realtime video and audio processing
Realtime rendering
Networking
Thread pools when you don't want the threads to block for too long when pushing new jobs.
However, a queue based on a linear array will suffer from the same downsides. I don't see a reason for choosing a linear queue over a circular queue.
(Besides the slightly higher implementation complexity.)
std::queue in C++ uses a deque as underlaying container by default. deque is essentially a dynamic array of arrays which seems like a good base for most use cases because it allocates memory in small chunks and hence induces less stuttering.
Related
With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.
All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.
Why do we need a deque for work-stealing? (e.g. in Cilk) The owner works on the top and the thief steals from the bottom. Why is it useful?
We might have multiple thieves stealing from the bottom. So, don't we need a lock anyway?
I have read somewhere that larger jobs (for example created in a tree) are added to the bottom. So, stealing from bottom is more efficient (less communication, as the thieves become more busy by stealing them). Is that it?
Work stealing actually really needs a deque. In the original paper, they have proved the maximum used memory on a system with P processors. The limit is given by the maximum size of any stack times the number of processors. That is actually only possible by following the busy leaves theorem. Also, another important feature of work stealing is that:
When a worker does a spawn, it immediately saves the spawner on the deque and starts working on the child. For more information regarding their proofs, please read their original paper, in which they explain all I am saying. http://supertech.csail.mit.edu/papers/steal.pdf
Concurrency control in the work stealing deque accesses are not related to the work stealing scheduler, and in fact, much research has been made towards removing the locks from the deque (by using lock free structures) and also to minimize as much as possible memory barriers. For example in this paper (that i am sorry if can not access, but you can read the abstract anyways to get the idea): http://dl.acm.org/citation.cfm?id=1073974 the authors create a new deque for improving the afore mentioned aspects.
The steals are made from the side that the worker is not working on for possibly several reasons:
Since the deque acts as a stack for each worker (the owner of the deque) the "bigger" jobs should be on top of it (as you can understand by reading the paper). When I say bigger I want to mean that those are probably the ones that will have more computation to do. Also, another important aspect is that by doing so (stealing from the deque owner's opposite work side) reduces the contention as in some new deque's both a victim and a thief may be working at the same time on the same deque.
The details of the THE protocol are described in section 5 of "The Implementation of the Cilk-5 Multithreaded Language" which is available from MIT: http://supertech.csail.mit.edu/papers/cilk5.pdf
You do not need a deque for work-stealing. It is possible (and people have done it) to use a concurrent data structure to store the pool of tasks. But the problem is that push/pop operations from workers and steal requests from thieves all have to be synchronized.
Since steals are expected to be relatively rare events, it is possible to design a data structure such that synchonization is performed mianly during steal attempts and even then when it is likely that there might be a conflict in popping an item from the data structure. This is exactly why deques were used in Cilk - to minimize synchronization. Workers treat their own deques as a stack, pushing and popping threads from the bottom, but treat the deque of another busy worker as a queue, stealing threads only from the top, whenever they have no local threads to execute. Since steal operation are synchronized, it is okay for multiple thieves to attempt to steal from the same victim.
Larger jobs being added to the bottom is common in divide-and-conquer style algorithms, but not all. There is a wide variety of strategies in place for what to do during stealing. Steal one task, few tasks, half the tasks, and so on. Each of these variants work well for some applications and not so well in others.
I'm using I/O Completion Ports in Windows, I have an object called 'Stream' that resembles and abstract an HANDLE (so it can be a socket, a file, and so on).
When I call Stream::read() or Stream::write() (so, ReadFile()/WriteFile() in the case of files, and WSARecv()/WSASend() in the case of sockets), I allocate a new OVERLAPPED structure in order to make a pending I/O request that will be completed in the IOCP loop by some other thread.
Then, when the OVERLAPPED structure will be completed by the IOCP loop, it will be destroyed there. If that's the case, Stream::read() or Stream::write() are called again from the IOCP loop, they will instance new OVERLAPPED structures, and it will go forever.
This works just fine. But now I want to improve this by adding caching of OVERLAPPED objects:
when my Stream object does a lot of reads or writes, it absolutely makes sense to cache the OVERLAPPED structures.
But now arise a problem: when I deallocate a Stream object, I must deallocate the cached OVERLAPPED structures, but how I can know if they've been completed or are still pending and one of the IOCP loops will complete that lately? So, an atomic reference count is needed here, but now the problem is that if I use an atomic ref counter, I have to increase that ref counter for each read or write operations, and decrease on each IOCP loop completion of the OVERLAPPED structure or Stream deletion, which in a server are a lot of operations, so I'll end up by increasing/decreasing a lot of atomic counters a lot of times.
Will this impact very negatively the concurrency of multiple threads? This is my only concern that blocks me to put this atomic reference counter for each OVERLAPPED structure.
Are my concerns baseless?
I thought this is an important topic to point out, and a question on SO, to see other's people thoughts on this and methods for caching OVERLAPPED structures with IOCP, is worth it.
I wish to find out a clever solution on this, without using atomic ref counters, if it is possible.
Assuming that you bundle a data buffer with the OVERLAPPED structure as a 'per operation' data object then pooling them to avoid excessive allocation/deallocation and heap fragmentation is a good idea.
If you only ever use this object for I/O operations then there's no need for a ref count, simply pull one from the pool, do your WSASend/WSARecv with it and then release it to the pool once you're done with it in the IOCP completion handler.
If, however, you want to get a bit more complicated and allow these buffers to be passed out to other code then you may want to consider ref counting them if that makes it easier. I do this in my current framework and it allows me to have generic code for the networking side of things and then pass data buffers from read completions out to customer code and they can do what they want with them and when they're done they release them back to the pool. This currently uses a ref count but I'm moving away from that as a minor performance tweak. The ref count is still there but in most situations it only ever goes from 0 -> 1 and then to 0 again, rather than being manipulated at various layers within my framework (this is done by passing the ownership of the buffer out to the user code using a smart pointer).
In most situations I expect that a ref count is unlikely to be your most expensive operation (even on NUMA hardware in situations where your buffers are being used from multiple nodes). More likely the locking involved in putting these things back into a pool will be your bottleneck; I've solved that one so am moving on to the next higher fruit ;)
You also talk about your 'per connection' object and caching your 'per operation' data locally there (which is what I do before pushing them back to the allocator), whilst ref counts aren't strictly required for the 'per operation' data, the 'per connection' data needs, at least, an atomically modifiable 'num operations in progress' count so that you can tell when you can free IT up. Again, due to my framework design, this has become a normal ref count for which customer code can hold refs as well as active I/O operations. I've yet to work a way around the need for this counter in a general purpose framework.
It seems true, but my thought has been muddy.
can someone give a clear explanation and some crucial cases in which it always works without locking? thanks!
The real trick behind the single producer - single consumer circular queue is that the head and tail pointers are modified atomically. This means that if a position in memory is changed from value A to value B, an observer (i.e. reader) that reads the memory while its value is changed will get either A or B as a result, nothing else.
So your queue will not work if, for example, you are using 16-bit pointers but you are changing them in two 8-bit steps (this may happen depending on your CPU architecture and memory alignment requirements). The reader in this case may read an entirely wrong transient value.
So make sure that your pointers are modified atomically in your platform!
This surely depends on the implementation of the cyclic queue. However, if it is as I imagine it you have two indices - the head and the tail of the queue. The producer works with the tail and the consumer works with the head. They share the message array, but use two different pointers.
The only case in which the producer and the consumer might run into conflict is the one in which e.g. the consumer checks for new message and it arrives just after the check. However in such case the consumer will wait a bit and check once more. The correctness of the program will not be lost.
The reason why it works ok with single producer single consumer is mainly because the two users do not share much of memory. In case of multiple producers e.g. you will have more than a single thread accessing the head and conflicts might arraise.
EDIT as dasblinkenlight mentions in his comment my reasoning holds true only if both threads increment/ decrement their respective counters as last operation of their consuming/producing.
I've been trying to get a understanding of concurrency, and I've been trying to work out what's better, one big IORef lock or many TVars. I've came to the following guidelines, comments will be appreciated, regarding whether these are roughly right or whether I've missed the point.
Lets assume our concurrent data structure is a map m, accessed like m[i]. Lets also say we have two functions, f_easy and f_hard. The f_easy is quick, f_hard takes a long time. We'll assume the arguments to f_easy/f_hard are elements of m.
(1) If your transactions look roughly like this m[f_easy(...)] = f_hard(...), use an IORef with atomicModifyIORef. Laziness will ensure that m is only locked for a short time as it's updated with a thunk. Calculating the index effectively locks the structure (as something is going to get updated, but we don't know what yet), but once it's known what that element is, the thunk over the entire structure moves to a thunk only over that particular element, and then only that particular element is "locked".
(2) If your transactions look roughly like this m[f_hard(...)] = f_easy(...), and the don't conflict too much, use lots of TVars. Using an IORef in this case will effectively make the app single threaded, as you can't calculate two indexes at the same time (as there will be an unresolved thunk over the entire structure). TVars let you work out two indexes at the same time, however, the negative is that if two concurrent transactions both access the same element, and one of them is a write, one transaction must be scrapped, which wastes time (which could have been used elsewhere). If this happens a lot, you may be better with locks that come (via blackholing) from IORef, but if it doesn't happen very much, you'll get better parallelism with TVars.
Basically in case (2), with IORef you may get 100% efficiency (no wasted work) but only use 1.1 threads, but with TVar if you have a low number of conflicts you might get 80% efficiency but use 10 threads, so you still end up 7 times faster even with the wasted work.
Your guidelines are somewhat similar to the findings of [1] (Section 6) where the performance of the Haskell STM is analyzed:
"In particular, for programs that do not perform much work inside transactions, the commit overhead appears to be very high. To further observe this overhead, an analysis needs to be conducted on the performance of commit-time course-grain and fine-grain STM locking mechanisms."
I use atomicModifyIORef or an MVar when all the synchronization I need is something that simple locking will ensure. When looking at concurrent accesses to a data structure, it also depends on how this data structure is implemented. For example, if you store your data inside a IORef Data.Map and frequently perform read/write access then I think atmoicModifyIORef will degrade to a single thread performance, as you have conjectured, but the same will be true for a TVar Data.Map. My point is that it's important to use a data structure that is suitable for concurrent programming (balanced trees aren't).
That said, in my opinion the winning argument for using STM is composability: you can combine multiple operations into a single transactions without headaches. In general, this isn't possible using IORef or MVar without introducing new locks.
[1] The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
http://dx.doi.org/10.1145/1366230.1366241
Answer to #Clinton's comment:
If a single IORef contains all your data, you can simply use atomicModifyIORef for composition. But if you need to process lots of parallel read/write requests to that data, the performance loss might become significant, since every pair of parallel read/write requests to that data might cause a conflict.
The approach that I would try is to use a data structure where the entries themselves are stored inside a TVar (vs putting the whole data structure into a single TVar). That should reduce the possibility of livelocks, as transactions won't conflict that often.
Of course, you still want to keep your transactions as small as possible and use composability only if it's absolutely necessary to guarantee consistency. So far I haven't encountered a scenario where combining more than a few insert/lookup operations into a single transaction was necessary.
Beyond performance, I see a more fundamental reason to using TVar--the type system ensures you dont do any "unsafe" operations like readIORef or writeIORef. That your data is shared is a property of the type, not of the implementation. EDIT: unsafePerformIO is always unsafe. readIORef is only unsafe if you are also using atomicModifyIORef. At the very least wrap your IORef in a newtype and only expose a wrapped atomicModifyIORef
Beyond that, don't use IORef, use MVar or TVar
The first usage pattern you describe probably does not have nice performance characteristics. You likely end up being (almost) entirely single threaded--because of laziness no actual work happens each time you update the shared state, but whenever you need to use this shared state, the entire accumulated pile of thunks needs to be forced, and has a linear data dependency structure.
Having 80% efficiency but substantially higher parallelism allows you to exploit growing number of cores. You can expect minimal performance improvements over the coming years on single threaded code.
Many word CAS is likely coming to a processor near you in the form of "Hardware Transactional Memory" allowing STMs to become far more efficient.
Your code will be more modular--every piece of code has to be changed if you add more shared state when your design has all shared state behind a single reference. TVars and to a lesser extent MVars support natural modularity.