As in the title, are read and write operations regarding uint8, atomic?
Logically it must be a single cpu instruction obviously to read and write for a 8 bit variable. But in any case, two cores could simultaneously read and write from the memory, is it possible to create a stale data this way?
There's no guarantee that the access on native types are on any platform atomic. This is why there is sync/atomic. See also the advice in the memory model documentation.
Example for generic way of atomically setting a value (Play)
var ax atomic.Value // may be globally accessible
x := uint8(5)
// set atomically
ax.Store(x)
x = ax.Load().(uint8)
Probably more efficient solution for uint8 (Play):
var ax int64 // may be globally accessible
x := uint8(5)
atomic.StoreInt64(&ax, 10)
x = uint8(atomic.LoadInt64(&ax))
fmt.Printf("%T %v\n", x, x)
No. If you want atomic operations, you can use the sync/atomic package.
If you mean "would 8bit operations be atomic even if I ignore the Go memory model?", then the answer is still, it depends probably not.
If the hardware guarantees atomicity of read/write operations, then it might be atomic. But that still doesn't guarantee cache coherence, or compiler optimizations from reordering operations. You need to serialize the operations somehow, with the primitives Go provides in the "atomic" package, and using the "sync" package and channels to coordinate between goroutines.
Related
in goroutine A, I do this
mu.Lock()
n++
mu.UnLock()
can I see the changed n in goroutine B immediately?
in other word, will mutex guarantee memory sync
and in goroutine B
fmt.Println(n)
If immediately means after the mu.Unlock() then yes, but keep in mind, that before reading value you should also acquire the lock. For instance, in a multiprocessor (multicore) system, each core has its own cache, and there may be situations when variable values in different core's caches are different.
For this specific situation of a single int variable perhaps using "sync/atomic" package is more beneficial. To update use atomic.AddUint64(&n, 1) and to read atomic.LoadUint64(&n)
If the struct of interest is map then you could use sync.Map with embedded RWMutex. RWMutex exposes different locks for update (m.Lock()) and read (m.RLock()). And you could have safe parallel reads and after m.RUnlock() sequential updates.
Say I have two go routines:
var sequence int64
// writer
for i := sequence; i < max; i++ {
doSomethingWithSequence(i)
sequence = i
}
// reader
for {
doSomeOtherThingWithSequence(sequence)
}
So can I get by without atomic?
Some potential risks I can think of:
reorder (for the writer, updating sequence happens before doSomething) could happen, but I can live with that.
sequence is not properly aligned in memory so the reader might observe a partially updated i. Running on (recent kernel) linux with x86_64,
can we rule that out?
go compiler 'cleverly optimizes' the reader, so the access to i never goes to memory but cached in a register. Is that possible in go?
Anything else?
Go's motto: Do not communicate by sharing memory; instead, share memory by communicating. Which is an effective best-practice most of the time.
If you care about ordering, you care about synchronizing the two goroutines.
I don't think they are possible. Anyway, those are not things you should worry about if you properly design the synchronization.
The same as above.
Luckily, Go has a data race detector integrated. Try to run your example with go run -race. You will probably see the race condition happening on sequence variable.
I am porting a lock free queue from c++11 to go and i came across things such as
auto currentRead = writeIndex.load(std::memory_order_relaxed);
and in some cases std::memory_order_release and std::memory_order_aqcuire
also the equivelent for the above in c11 is something like
unsigned long currentRead = atomic_load_explicit(&q->writeIndex,memory_order_relaxed);
the meaning of those is described here
is there an equivalent to such thing in go or do i just use something like
var currentRead uint64 = atomic.LoadUint64(&q.writeIndex)
after porting i benchmarked and just using LoadUint64 it seems to work as expected but orders of magnitude slower and i wonder how much effect dose those specialized ops have on performance.
further info from the link i attached
memory_order_relaxed:Relaxed operation: there are no synchronization
or ordering constraints, only atomicity is required of this operation.
memory_order_consume:A load operation with this memory order performs
a consume operation on the affected memory location: no reads in the
current thread dependent on the value currently loaded can be
reordered before this load. This ensures that writes to data-dependent
variables in other threads that release the same atomic variable are
visible in the current thread. On most platforms, this affects
compiler optimizations only.
memory_order_acquire:A load operation with this memory order performs the acquire operation on the affected memory location: no
memory accesses in the current thread can be reordered before this
load. This ensures that all writes in other threads that release the
same atomic variable are visible in the current thread.
memory_order_release:A store operation with this memory order performs the release operation: no memory accesses in the current
thread can be reordered after this store. This ensures that all writes
in the current thread are visible in other threads that acquire or the
same atomic variable and writes that carry a dependency into the
atomic variable become visible in other threads that consume the same
atomic.
You need to read The Go Memory Model
You'll discover that Go has nothing like the control that you have in C++ - there isn't a direct translation of the C++ features in your post. This is a deliberate design decision by the Go authors - the Go motto is Do not communicate by sharing memory; instead, share memory by communicating.
Assuming that the standard go channel isn't good enough for what you want to do, you'll have 2 choices for each memory access, using the facilities in sync/atomic or not, and whether you need to use them or not will depend on a careful reading of the Go Memory Model and analysis of your code which only you can do.
Could someone please mention the flaws and performance drawbacks in the Queue like implementation?
type Queue struct {
sync.Mutex
Items []interface{}
}
func (q *Queue) Push(item interface{}) {
q.Lock()
defer q.Unlock()
q.Items = append(q.Items, item)
}
func (q *Queue) Pop() interface{} {
q.Lock()
defer q.Unlock()
if len(q.Items) == 0 {
return nil
}
item := q.Items[0]
q.Items = q.Items[1:]
return item
}
I also have methods like PopMany and PushMany, and what I am concerned about is: Is too much re-slicing that bad?
You could simply use a buffered channel.
var queue = make(chan interface{}, 100)
The size of the buffer could to be determined empirically to be large enough for the high-water mark for the rate of pushes vs rate of pops. It should ideally not be much larger than this, to avoid wasting memory.
Indeed, a smaller buffer size will also work, provided the interacting goroutines don't deadlock for other reasons. If you use a smaller buffer size, you are effectively getting queueing via the run-queue of the goroutine time-slice engine, part of the Go runtime. (Quite possible, a buffer size of zero could work in many circumstances.)
Channels allow many reader goroutines and many writer goroutines. The concurrency of their access is handled automatically by the Go runtime. All writes into the channel are interleaved so as to be a sequential stream. All the reads are also interleaved to extract values sequentially in the same order they were enqueued. Here's further discussion on this topic.
The re-slicing is not an issue here. It will also make no difference whether you have a thread-safe or unsafe version as this is pretty much how the re-sizing is meant to be done.
You can alleviate some of the re-sizing overhead by initializing the queue with a capacity:
func NewQueue(capacity int) *Queue {
return &Queue {
Items: make([]interface{}, 0, capacity),
}
}
This will initialize the queue. It can still grow beyond the capacity, but you will not be having any unnecessary copying/re-allocation until that capacity is reached.
What may potentially cause problems with many concurrent accesses, is the mutex lock. At some point, you will be spending more time waiting for locks to be released than you are actually doing work. This is a general problem with lock contention and can be solved by implementing the queue as a lock-free data structure.
There are a few third-party packages out there which provide lock free implementations of basic data structures.
Whether this will actually be useful to you can only be determined with some benchmarking. Lock-free structures can have a higher base cost, but they scale much better when you get many concurrent users. There is a cutoff point at which mutex locks become more expensive than the lock-free approach.
I think the best way to approach this is to use a linked list, there is already one available for you in standard package here
The answer marked correct says re-slicing is not an issue. That is not correct, it is an issue. What Dave is suggesting is right, we should mark that element as nil.
Refer more about slices here: https://go.dev/blog/slices-intro
I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function
(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.
http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)