Go: Seek+Write vs WriteAt performance - performance

I've just started to study Go's filesystem operations. It seems like there are at least two ways to perform random file writes:
// 1. First set the offset, then write data
f.Seek(offset, whence)
f.Write(data)
// 2. Write by offset in one step
f.WriteAt(data, offset)
All of three functions (Seek, Write, WriteAt) are implemented with the use of different syscalls: on Unix systems Write is implemented via syscall.Write and WriteAt has syscall.Pwrite inside.
Since Seek+Write perform two syscalls, whereas WriteAt requires only one syscall, should the second method be preferred for the sake of better performance?

seek()+read() and seek() + write() are both a pair of sys-calls while pread() and pwrite() are single sys-calls. Less sys-calls - more efficiency.
You can definitly go for the WriteAt

Related

Use big.Rat with Go to get Abs() value

I am a beginner with Go and a java developer.
I am currently working with big.Rat.
I need to get the Abs of a Rat n for which I have to write something like
n.Abs(n) or something like big.Rat{}.Abs(n)
Why didn't go provide something like just n.Abs()?
Or am I going wrong somewhere?
Go's big package is concerned with memory allocation when it comes to its function signatures. A big.Rat consists of two big.Ints which each contain an array of uints. Unlike an int (native 32 or 64 bit integer), a big.Int must thus be allocated dynamically, depending on its value. For large values this means more elements in the array.
Your proposed function signature n.Abs() would mean that a new array of the same size as n's would have to be allocated for this operation. In reality we often have the case that the original n is no longer needed, thus we can reuse its existing memory. To allow this, the Abs function takes a pointer to an existing big.Rat which might be n itself. The implementation can now reuse the memory. The caller is now in full control of what memory to use for these operations.
This might not make the nicest API for all use cases, in fact if you just want to do a quick calculation for a few large numbers, on a computer with Gigabytes of RAM, you might have preferred the n.Abs() version, but if you do numerically expensive computations with a lot of large numbers, you must be able to control your memory. Imagine doing some image manipulation on a Raspberry for example, where you are more constraint by the available memory. In this case the existing API allows you to be more efficient.

Is it safe to read a function pointer concurrently without a lock?

Suppose I have this:
go func() {
for range time.Tick(1 * time.Millisecond) {
a, b = b, a
}
}()
And elsewhere:
i := a // <-- Is this safe?
For this question, it's unimportant what the value of i is with respect to the original a or b. The only question is whether reading a is safe. That is, is it possible for a to be nil, partially assigned, invalid, undefined, ... anything other than a valid value?
I've tried to make it fail but so far it always succeeds (on my Mac).
I haven't been able to find anything specific beyond this quote in the The Go Memory Model doc:
Reads and writes of values larger than a single machine word behave as
multiple machine-word-sized operations in an unspecified order.
Is this implying that a single machine word write is effectively atomic? And, if so, are function pointer writes in Go a single machine word operation?
Update: Here's a properly synchronized solution
Unsynchronized, concurrent access to any variable from multiple goroutines where at least one of them is a write is undefined behavior by The Go Memory Model.
Undefined means what it says: undefined. It may be that your program will work correctly, it may be it will work incorrectly. It may result in losing memory and type safety provided by the Go runtime (see example below). It may even crash your program. Or it may even cause the Earth to explode (probability of that is extremely small, maybe even less than 1e-40, but still...).
This undefined in your case means that yes, i may be nil, partially assigned, invalid, undefined, ... anything other than either a or b. This list is just a tiny subset of all the possible outcomes.
Stop thinking that some data races are (or may be) benign or unharmful. They can be the source of the worst things if left unattended.
Since your code writes to the variable a in one goroutine and reads it in another goroutine (which tries to assign its value to another variable i), it's a data race and as such it's not safe. It doesn't matter if in your tests it works "correctly". One could take your code as a starting point, extend / build on it and result in a catastrophe due to your initially "unharmful" data race.
As related questions, read How safe are Golang maps for concurrent Read/Write operations? and Incorrect synchronization in go lang.
Strongly recommended to read the blog post by Dmitry Vyukov: Benign data races: what could possibly go wrong?
Also a very interesting blog post which shows an example which breaks Go's memory safety with intentional data race: Golang data races to break memory safety
In terms of Race condition, it's not safe. In short my understanding of race condition is when there're more than one asynchronous routine (coroutines, threads, process, goroutines etc.) trying to access the same resource and at least one is a writing operation, so in your example we have 2 goroutines reading and writing variables of type function, I think what's matter from a concurrent point of view is those variables have a memory space somewhere and we're trying to read or write in that portion of memory.
Short answer: just run your example using the -race flag with go run -race
or go build -race and you'll see a detected data race.
The answer to your question, as of today, is that if a and b are not larger than a machine word, i must be equal to a or b. Otherwise, it may contains an unspecified value, that is most likely to be an interleave of different parts from a and b.
The Go memory model, as of the version on June 6, 2022, guarantees that if a program executes a race condition, a memory access of a location not larger than a machine word must be atomic.
Otherwise, a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.
The happen-before relationship here is defined in the memory model in the previous section.
The result of a racy read from a larger memory location is unspecified, but it is definitely not undefined as in the realm of C++.
Reads of memory locations larger than a single machine word are encouraged but not required to meet the same semantics as word-sized memory locations, observing a single allowed write w. For performance reasons, implementations may instead treat larger operations as a set of individual machine-word-sized operations in an unspecified order. This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

go atomic Load and Store

func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.LoadInt32(&MinimumElectionTimeoutMS)
oldMax := atomic.LoadInt32(&maximumElectionTimeoutMS)
atomic.StoreInt32(&MinimumElectionTimeoutMS, int32(newMin))
atomic.StoreInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
I got a go code function like this.
The thing I am confused is: why do we need atomic here? What is this preventing from?
Thanks.
Atomic functions complete a task in an isolated way where all parts of the task appear to happen instantaneously or don't happen at all.
In this case, LoadInt32 and StoreInt32 ensure that an integer is stored and retrieved in a way where someone loading won't get a partial store. However, you need both sides to use atomic functions for this to function correctly. The raft example appears incorrect for at least two reasons.
Two atomic functions do not act as one atomic function, so reading the old and setting the new in two lines is a race condition. You may read, then someone else sets, then you set and you are returning false information for the previous value before you set it.
Not everyone accessing MinimumElectionTimeoutMS is using atomic operations. This means that the use of atomics in this function is effectively useless.
How would this be fixed?
func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.SwapInt32(&MinimumElectionTimeoutMS, int32(newMin))
oldMax := atomic.SwapInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
This would ensure that oldMin is the minimum that existed before the swap. However, the entire function is still not atomic as the final outcome could be an oldMin and oldMax pair that was never called with resetElectionTimeoutMS. For that... just use locks.
Each function would also need to be changed to do an atomic load:
func minimumElectionTimeout() time.Duration {
min := atomic.LoadInt32(&MinimumElectionTimeoutMS)
return time.Duration(min) * time.Millisecond
}
I recommend you carefully consider the quote VonC mentioned from the golang atomic documentation:
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
If you want to understand atomic operations, I recommend you start with http://preshing.com/20130618/atomic-vs-non-atomic-operations/. That goes over the load and store operations used in your example. However, there are other uses for atomics. The go atomic package overview goes over some cool stuff like atomic swapping (the example I gave), compare and swap (known as CAS), and Adding.
A funny quote from the link I gave you:
it’s well-known that on x86, a 32-bit mov instruction is atomic if the memory operand is naturally aligned, but non-atomic otherwise. In other words, atomicity is only guaranteed when the 32-bit integer is located at an address which is an exact multiple of 4.
In other words, on common systems today, the atomic functions used in your example are effectively no-ops. They are already atomic! (They are not guaranteed though, if you need it to be atomic, it is better to specify it explicitly)
Considering that the package atomic provides low-level atomic memory primitives useful for implementing synchronization algorithms, I suppose it was intended to be used as:
MinimumElectionTimeoutMS isn't modified while being stored in oldMin
MinimumElectionTimeoutMS isn't modified while being set to a new value newMin.
But, the package does come with the warning:
These functions require great care to be used correctly.
Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
Share memory by communicating; don't communicate by sharing memory.
In this case (server.go from the Raft distributed consensus protocol), synchronizing directly on the variable might be deemed faster than putting a Mutex on the all function.
Except, as Stephen Weinberg's answer illustrate (upvoted), this isn't how you use atomic. It only makes sure that oldMin is accurate while doing the swap.
See another example at "Is the two atomic style code in sync/atomic.once.go necessary?", in relation with the "memory model".
OneOfOne mentions in the comments using atomic CAS as a spinlock (very fast locking):
BenchmarkSpinL-8 2000 708494 ns/op 32315 B/op 2001 allocs/op
BenchmarkMutex-8 1000 1225260 ns/op 78027 B/op 2259 allocs/op
See:
sync/spinlock.go
sync/spinlock_test.go

Using MPI_Send/Recv to handle chunk of multi-dim array in Fortran 90

I have to send and receive (MPI) a chunk of a multi-dimensional array in FORTRAN 90. The line
MPI_Send(x(2:5,6:8,1),12,MPI_Real,....)
is not supposed to be used, as per the book "Using MPI..." by Gropp, Lusk, and Skjellum. What is the best way to do this? Do I have to create a temporary array and send it or use MPI_Type_Create_Subarray or something like that?
The reason not to use array sections with MPI_SEND is that the compiler has to create a temporary copy with some MPI implementations. This is due to the fact that Fortran can only properly pass array sections to subroutines with explicit interfaces and has to generate temporary "flattened" copies in all other cases, usually on the stack of the calling subroutine. Unfortunately in Fortran before the TR 29113 extension to F2008 there is no way to declare subroutines that take variable type arguments and MPI implementations usually resort to language hacks, e.g. MPI_Send is entirely implemented in C and relies on Fortran always passing the data as a pointer.
Some MPI libraries work around this issue by generating huge number of overloads for MPI_SEND:
one that takes a single INTEGER
one that takes an 1-d array of INTEGER
one that takes an 2-d array of INTEGER
and so on
The same is then repeated for CHARACTER, LOGICAL, DOUBLE PRECISION, etc. This is still a hack as it does not cover cases where one passes user-defined type. Further it greatly complicates the C implementation as it now has to understand the Fortran array descriptors, which are very compiler-specific.
Fortunately times are changing. The TR 29113 extension to Fortran 2008 includes two new features:
assumed-type arguments: TYPE(*)
assumed-dimension arguments: DIMENSION(..)
The combination of both, i.e. TYPE(*), DIMENSION(..), INTENT(IN) :: buf, describes an argument that can both be of varying type and have any dimension. This is already being taken advantage of in the new mpi_f08 interface in MPI-3.
Non-blocking calls present bigger problems in Fortran that go beyond what Alexander Vogt has described. The reason is that Fortran does not have the concept of suppressing compiler optimisations (i.e. there is no volatile keyword in Fortran). The following code might not run as expected:
INTEGER :: data
data = 10
CALL MPI_IRECV(data, 1, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, req, ierr)
! data is not used here
! ...
CALL MPI_WAIT(req, MPI_STATUS_IGNORE, ierr)
! data is used here
One might expect that after the call to MPI_WAIT data would contain the value received from rank 0, but this might very well not be the case. The reason is that the compiler cannot know that data might change asynchronously after MPI_IRECV returns and therefore keep its value in a register instead. That's why non-blocking MPI calls are generally considered as dangerous in Fortran.
TR 29113 has solution for that second problem too with the ASYNCHRONOUS attribute. If you take a look at the mpi_f08 definition of MPI_IRECV, its buf argument is declared as:
TYPE(*), DIMENSION(..), INTENT(OUT), ASYNCHRONOUS :: buf
Even if buf is a scalar argument, i.e. no temporary copy is created, a TR 29113 compliant compiler would not resort to register optimisations for the buffer argument.
EDIT: As Hristo Iliev pointed out MPI_Send is always blocking, but might choose to send data asynchronously. From here:
MPI_Send will not return until you can use the send buffer.
Non-blocking communications (like MPI_Send), might pose a problem with Fortran when non-contiguous arrays are involved. Then, the compiler creates a temporary array for the dummy variable and passes it to the subroutine. Once the subroutine is finished, the compiler is at liberty to free the memory of that copy.
That's fine as long as you use blocking communication (MPI_Send), because then the message has been sent when the subroutine returns. For the non-blocking communication (MPI_Isend), however, the temporary array is the send buffer, and the subroutine returns before it has been sent.
So it might happen, that MPI will send data from a memory location that holds no valid data any more.
So, either you create a copy yourself (so that your send buffer is contiguous in memory), or you create a sub-array (i.e. tell MPI the addresses in memory of elements you want to send). There are further alternatives out there, like MPI_Pack, but I have no experience with them.
Which way is faster? Well, that depends:
On the actual implementation of your MPI library
On the data and its distribution
On your compiler
On your hardware
See here for a detailed explanation and further options.

Resources