Say I have two go routines:
var sequence int64
// writer
for i := sequence; i < max; i++ {
doSomethingWithSequence(i)
sequence = i
}
// reader
for {
doSomeOtherThingWithSequence(sequence)
}
So can I get by without atomic?
Some potential risks I can think of:
reorder (for the writer, updating sequence happens before doSomething) could happen, but I can live with that.
sequence is not properly aligned in memory so the reader might observe a partially updated i. Running on (recent kernel) linux with x86_64,
can we rule that out?
go compiler 'cleverly optimizes' the reader, so the access to i never goes to memory but cached in a register. Is that possible in go?
Anything else?
Go's motto: Do not communicate by sharing memory; instead, share memory by communicating. Which is an effective best-practice most of the time.
If you care about ordering, you care about synchronizing the two goroutines.
I don't think they are possible. Anyway, those are not things you should worry about if you properly design the synchronization.
The same as above.
Luckily, Go has a data race detector integrated. Try to run your example with go run -race. You will probably see the race condition happening on sequence variable.
Related
I have a big and deeply nested shared struct. Each goroutine of my program may use a different part of the struct, a slice, a map, etc. To make things worse, all these goroutines do long operations, which means it may not be a good idea to use a big lock for that shared struct.
Therefore, I have come up with an idea which is to lock the struct before accessing one part of the struct and then encode it, as soon as the encoding is done the goroutine can release the lock and decodes the data. In this way, one single goroutine won't hold the lock for a long time.
The question is: I'm not sure if this is a good practice, is there a better way to solve this kind of problem? Or is there any better ideology to such kinds of problems?
You may use following techniques instead of sync.Mutex:
sync.RWMutex - read-write lock allows to read the structure by multiple goroutine at the same time and only when write lock is not acquired. If your problem is that multiple goroutines cannot code different parts of struct in parallel, this may be the best choice.
You can granulate locks as you mentioned in the question, but you must be careful to properly acquire all involved locks to preserve consistency when you modify different parts of the struct.
You can take a snapshot of the interesting part of struct under lock and outside lock do some heavy read operation. This will of course put extra pressure on the GC, but it may be worth it.
You could also mix them to achieve better results.
I'm currently writing an encoder and (obviously) want to make it fast.
I have a working system for doing the encoding (so every goroutine is doing the same thing) but I am struggling with finding the right amount of goroutines to run the code in. I basically want to decide on a maximum amount of Goroutines that keeps the CPU busy.
The following thoughts crossed my mind:
If a file is only <1 kB it's not useful to run the code in a lot of goroutines
The amount of goroutines should be influenced by the cores/threads available
running 16 goroutines on a 4x4 GHz CPU will not be a problem but what with a 4x1 GHz CPU?
hard to determine reliably cross-platform
The CPU should be busy but not as busy as to keep other programs from responding (~70-ish %?)
hard to decide beforehand due to clockspeed and other parameters
Now I've tried to decide based on these factors on how many goroutines to use but I'm not quite sure how to do so cross-platform and in a reliable way.
Attempts already made:
using a linear function to determine based on filesize
requires different functions based on CPU
parsing CPU-specs from lscpu
not cross-platform
requires another function to determine based on frequency
Which have not been satisfactory.
You mention in a comment that
every goroutine is reading the file that is to be encoded
But of course the file—any file—is already encoded in some way: as plain text, perhaps, or UTF-8 (stream of bytes), perhaps assembled into units of "lines". Or it might be an image stream, such an an mpeg file, consisting of some number of frames. Or it might be a database, consisting of records. Whatever its input form is, it contains some sort of basic unit that you could feed to your (re-)encoder.
That unit, whatever it may be, is a sensible place to divide work. (How sensible, depends on what it is. See the idea of chunking below.)
Let's say the file consists of independent lines: then use scanner.Scan to read them, and pass each line to a channel that takes lines. Spin off N, for some N, readers that read the channel, one line at a time:
ch := make(chan string)
for i := 0; i < n; i++ {
go readAndEncode(ch)
}
// later, or immediately:
for s := bufio.NewScanner(os.Stdin); s.Scan(); {
ch <- s.Text()
}
close(ch)
If there are 100 lines, and 4 readers, the first four ch <- s.Text() operations go fast, and the fifth one pauses until one of the readers is done encoding and goes back to reading the channel.
If individual lines are too small a unit, perhaps you should read a "chunk" (e.g., 1 MB) at a time. If the chunk has a partial line at the end, back up, or read more, until you have a whole line. Then send the entire data chunk.
Because channels copy the data, you may wish to send a reference to the chunk instead.1 This would be true of any larger data unit. (Lines tend to be short, and the overhead of copying them is generally not very large compared to the overhead of using channels in the first place. If your lines have type string, well, see the footnote.)
If line, or chunk-of-lines, are not the correct unit of work here, figure out what is. Think of goroutines as people (or busy little gophers) who each get one job to do. They can depend on someone else—another person or gopher—to do a smaller job, whatever that might be; and having ten people, or gophers, working on sub-tasks allows a supervisor to manage them. If you need to do the same job N times, and N is not unbounded, you can spin off N goroutines. If N is potentially unbounded, spin off a fixed number (maybe based on #cpus) and feed them work through a channel.
1As Burak Serdar notes, some copies can be elided automatically: e.g., strings are in effect read-only slices. Slice types have three parts: a pointer (reference) to the underlying data, a length, and a capacity. Copying a slice copies these three parts, but not the underlying data. The same goes for strings: string headers omit the capacity, so sending a string through a channel copies only the two header words. Hence many of the obvious and easy-to-code ways of chunking data will already be pretty efficient.
As in the title, are read and write operations regarding uint8, atomic?
Logically it must be a single cpu instruction obviously to read and write for a 8 bit variable. But in any case, two cores could simultaneously read and write from the memory, is it possible to create a stale data this way?
There's no guarantee that the access on native types are on any platform atomic. This is why there is sync/atomic. See also the advice in the memory model documentation.
Example for generic way of atomically setting a value (Play)
var ax atomic.Value // may be globally accessible
x := uint8(5)
// set atomically
ax.Store(x)
x = ax.Load().(uint8)
Probably more efficient solution for uint8 (Play):
var ax int64 // may be globally accessible
x := uint8(5)
atomic.StoreInt64(&ax, 10)
x = uint8(atomic.LoadInt64(&ax))
fmt.Printf("%T %v\n", x, x)
No. If you want atomic operations, you can use the sync/atomic package.
If you mean "would 8bit operations be atomic even if I ignore the Go memory model?", then the answer is still, it depends probably not.
If the hardware guarantees atomicity of read/write operations, then it might be atomic. But that still doesn't guarantee cache coherence, or compiler optimizations from reordering operations. You need to serialize the operations somehow, with the primitives Go provides in the "atomic" package, and using the "sync" package and channels to coordinate between goroutines.
Could someone please mention the flaws and performance drawbacks in the Queue like implementation?
type Queue struct {
sync.Mutex
Items []interface{}
}
func (q *Queue) Push(item interface{}) {
q.Lock()
defer q.Unlock()
q.Items = append(q.Items, item)
}
func (q *Queue) Pop() interface{} {
q.Lock()
defer q.Unlock()
if len(q.Items) == 0 {
return nil
}
item := q.Items[0]
q.Items = q.Items[1:]
return item
}
I also have methods like PopMany and PushMany, and what I am concerned about is: Is too much re-slicing that bad?
You could simply use a buffered channel.
var queue = make(chan interface{}, 100)
The size of the buffer could to be determined empirically to be large enough for the high-water mark for the rate of pushes vs rate of pops. It should ideally not be much larger than this, to avoid wasting memory.
Indeed, a smaller buffer size will also work, provided the interacting goroutines don't deadlock for other reasons. If you use a smaller buffer size, you are effectively getting queueing via the run-queue of the goroutine time-slice engine, part of the Go runtime. (Quite possible, a buffer size of zero could work in many circumstances.)
Channels allow many reader goroutines and many writer goroutines. The concurrency of their access is handled automatically by the Go runtime. All writes into the channel are interleaved so as to be a sequential stream. All the reads are also interleaved to extract values sequentially in the same order they were enqueued. Here's further discussion on this topic.
The re-slicing is not an issue here. It will also make no difference whether you have a thread-safe or unsafe version as this is pretty much how the re-sizing is meant to be done.
You can alleviate some of the re-sizing overhead by initializing the queue with a capacity:
func NewQueue(capacity int) *Queue {
return &Queue {
Items: make([]interface{}, 0, capacity),
}
}
This will initialize the queue. It can still grow beyond the capacity, but you will not be having any unnecessary copying/re-allocation until that capacity is reached.
What may potentially cause problems with many concurrent accesses, is the mutex lock. At some point, you will be spending more time waiting for locks to be released than you are actually doing work. This is a general problem with lock contention and can be solved by implementing the queue as a lock-free data structure.
There are a few third-party packages out there which provide lock free implementations of basic data structures.
Whether this will actually be useful to you can only be determined with some benchmarking. Lock-free structures can have a higher base cost, but they scale much better when you get many concurrent users. There is a cutoff point at which mutex locks become more expensive than the lock-free approach.
I think the best way to approach this is to use a linked list, there is already one available for you in standard package here
The answer marked correct says re-slicing is not an issue. That is not correct, it is an issue. What Dave is suggesting is right, we should mark that element as nil.
Refer more about slices here: https://go.dev/blog/slices-intro
I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function
(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.
http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)