How to decide on the amount of concurrent actions? - go

I'm currently writing an encoder and (obviously) want to make it fast.
I have a working system for doing the encoding (so every goroutine is doing the same thing) but I am struggling with finding the right amount of goroutines to run the code in. I basically want to decide on a maximum amount of Goroutines that keeps the CPU busy.
The following thoughts crossed my mind:
If a file is only <1 kB it's not useful to run the code in a lot of goroutines
The amount of goroutines should be influenced by the cores/threads available
running 16 goroutines on a 4x4 GHz CPU will not be a problem but what with a 4x1 GHz CPU?
hard to determine reliably cross-platform
The CPU should be busy but not as busy as to keep other programs from responding (~70-ish %?)
hard to decide beforehand due to clockspeed and other parameters
Now I've tried to decide based on these factors on how many goroutines to use but I'm not quite sure how to do so cross-platform and in a reliable way.
Attempts already made:
using a linear function to determine based on filesize
requires different functions based on CPU
parsing CPU-specs from lscpu
not cross-platform
requires another function to determine based on frequency
Which have not been satisfactory.

You mention in a comment that
every goroutine is reading the file that is to be encoded
But of course the file—any file—is already encoded in some way: as plain text, perhaps, or UTF-8 (stream of bytes), perhaps assembled into units of "lines". Or it might be an image stream, such an an mpeg file, consisting of some number of frames. Or it might be a database, consisting of records. Whatever its input form is, it contains some sort of basic unit that you could feed to your (re-)encoder.
That unit, whatever it may be, is a sensible place to divide work. (How sensible, depends on what it is. See the idea of chunking below.)
Let's say the file consists of independent lines: then use scanner.Scan to read them, and pass each line to a channel that takes lines. Spin off N, for some N, readers that read the channel, one line at a time:
ch := make(chan string)
for i := 0; i < n; i++ {
go readAndEncode(ch)
}
// later, or immediately:
for s := bufio.NewScanner(os.Stdin); s.Scan(); {
ch <- s.Text()
}
close(ch)
If there are 100 lines, and 4 readers, the first four ch <- s.Text() operations go fast, and the fifth one pauses until one of the readers is done encoding and goes back to reading the channel.
If individual lines are too small a unit, perhaps you should read a "chunk" (e.g., 1 MB) at a time. If the chunk has a partial line at the end, back up, or read more, until you have a whole line. Then send the entire data chunk.
Because channels copy the data, you may wish to send a reference to the chunk instead.1 This would be true of any larger data unit. (Lines tend to be short, and the overhead of copying them is generally not very large compared to the overhead of using channels in the first place. If your lines have type string, well, see the footnote.)
If line, or chunk-of-lines, are not the correct unit of work here, figure out what is. Think of goroutines as people (or busy little gophers) who each get one job to do. They can depend on someone else—another person or gopher—to do a smaller job, whatever that might be; and having ten people, or gophers, working on sub-tasks allows a supervisor to manage them. If you need to do the same job N times, and N is not unbounded, you can spin off N goroutines. If N is potentially unbounded, spin off a fixed number (maybe based on #cpus) and feed them work through a channel.
1As Burak Serdar notes, some copies can be elided automatically: e.g., strings are in effect read-only slices. Slice types have three parts: a pointer (reference) to the underlying data, a length, and a capacity. Copying a slice copies these three parts, but not the underlying data. The same goes for strings: string headers omit the capacity, so sending a string through a channel copies only the two header words. Hence many of the obvious and easy-to-code ways of chunking data will already be pretty efficient.

Related

Does seeking on a large file repeatedly and writing affect writing speed?

I need to speed up downloading a large file and I'm using multiple connections for that. I'm using a single goroutine with access to disk and it receives data from multiple goroutines using a channel, as I was advised here.
file, _ := os.Create(filename)
down.destination = file
for info := range down.copyInfo {
down.destination.Seek(info.start, 0)
io.CopyN(down.destination, info.from, info.length)
}
}
The problem is, seeking, when used repeatedly, on a large file, seems to make the operation slower. When info.length is larger, it has to seek less number of times, and it seems to do the job faster. But I need to make info.length smaller. Is there a way to make seeking faster? Or should I just download each part to separate temp files and concatenate them at last?
A seek itself does not do any I/O but just sets the position in the file for the next read or write. The number of seeks by themselves thus likely don't matter. This can also be easily be tested by adding dummy seeks without any following read or write operation.
The problem is likely not the number of seeks but the number of write operations. With many small fragments it will need more I/O operations to write the data than with a few large fragments. And each of these I/O operations has a significant overhead. There is the overhead of the system call itself. Then there might be overhead if the fragment is not aligned at the block boundaries of the underlying storage. And it might be overhead with rotating disk to position to the actual sector.

Golang assignment safety with single reader and single writer

Say I have two go routines:
var sequence int64
// writer
for i := sequence; i < max; i++ {
doSomethingWithSequence(i)
sequence = i
}
// reader
for {
doSomeOtherThingWithSequence(sequence)
}
So can I get by without atomic?
Some potential risks I can think of:
reorder (for the writer, updating sequence happens before doSomething) could happen, but I can live with that.
sequence is not properly aligned in memory so the reader might observe a partially updated i. Running on (recent kernel) linux with x86_64,
can we rule that out?
go compiler 'cleverly optimizes' the reader, so the access to i never goes to memory but cached in a register. Is that possible in go?
Anything else?
Go's motto: Do not communicate by sharing memory; instead, share memory by communicating. Which is an effective best-practice most of the time.
If you care about ordering, you care about synchronizing the two goroutines.
I don't think they are possible. Anyway, those are not things you should worry about if you properly design the synchronization.
The same as above.
Luckily, Go has a data race detector integrated. Try to run your example with go run -race. You will probably see the race condition happening on sequence variable.

Is it efficient to read, process and write one line at a time?

I am working on a project requires reading a file, making some manipulations on each line and generate a new file. I am a bit concerned about performance. Which algorithm is more efficient? I wrote some pseudocode below.
Store everything to an array, close the file, manipulate each line and store new array to output file:
openInputFile()
lineArray[] = readInput()
closeInputFile()
for (i in lineArray) // i:current line
manipulate i
newArray[] += i // store manipulted line to new array
openOutputFile()
writeOutput(newArray)
closeOutput()
Get each line in a loop, after manipulation write new line to the output
openInputFile()
openOutputFile()
for (i in inputFile) // i:current line
manipulate i
print manipulated line to output
closeInputFile()
closeOutputFile()
Which one should I choose?
It depends on how large the input file is:
If it is small, it doesn't matter which approach you use.
If it is large enough, then the overhead of holding the entire input file and the entire output file in memory at the same time can have significant performance impacts. (Increased paging load, etcetera.)
If it is really large, you will run out of memory and the application will fail.
If you cannot predict the number of lines there will be, then preallocating the line array is problematic.
Provided that you use buffered input and output streams, the second version will be more efficient, will use less memory, and won't break if the input file is too big.
In both cases you read from each file once, and write to each file once. From that perspective, there isn't much difference in efficiency. Filesystems are good at buffering and serialising IO, and your disks are almost always the limiting factor in this sort of thing.
In an extreme case, you do sometimes gain a bit of efficiency with batching your write operations - a single large write is more efficient than lots of small ones. This is very rarely relevant on a modern operating system though, as they'll already be doing that behind the scenes.
So the key difference between the two approaches is memory use - in the former case, you have a much larger memory footprint, and gain no advantage from doing it. You should therefore go for the second choice*.
* Unless you actually need to reference elsewhere in the array, e.g. if you need to sort your data, because you then do need to pull your whole file into memory to manipulate it.

How to test algorithm performance on devices with low resources?

I am interested in using atmel avr controllers to read data from LIN bus. Unfortunately, messages on such bus have no beginning or end indicator and only reasonable solution seems to be brute force parsing. Available data from bus is loaded into circular buffer, and brute force method finds valid messages in buffer.
Working with 64 byte buffer and 20MHZ attiny, how can I test the performance of my code in order to see if buffer overflow is likely to occur? Added: My concern is that algorith will be running slow, thus buffering even more data.
A bit about brute force algorithm. Second element in a buffer is assumed to be message size. For example, if assumed length is 22, first 21 bytes are XORed and tested against 22nd byte in buffer. If checksum passes, code checks if first (SRC) and third (DST) byte are what they are supposed to be.
AVR is one of the easiest microcontrollers for performance analysis, because it is a RISC machine with a simple intruction set and well-known instruction execution time for each instruction.
So, the beasic procedure is that you take the assembly coude and start calculating different scenarios. Basic register operations take one clock cycle, branches usually two cycles, and memory accesses three cycles. A XORing cycle would take maybe 5-10 cycles per byte, so it is relatively cheap. How you get your hands on the assembly code depends on the compiler, but all compilers tend to give you the end result in a reasonable legible form.
Usually without seeing the algorithm and knowing anything about the timing requirements it is quite impossible to give a definite answer to this kind of questions. However, as the LIN bus speed is limited to 20 kbit/s, you will have around 10 000 clock cycles for each byte. That is enough for almost anything.
A more difficult question is what to do with the LIN framing which is dependent on timing. It is not a very nice habit, as it really requires some time extra effort from the microcontroller. (What on earth is wrong with using the 9th bit?)
The LIN frame consists of a
break (at least 13 bit times)
synch delimiter (0x55)
message id (8 bits)
message (0..8 x 8 bits)
checksum (8 bits)
There are at least four possible approaches with their ups and downs:
(Your apporach.) Start at all possible starting positions and try to figure out where the checksummed message is. Once you are in sync, this is not needed. (Easy but returns ghost messages with a probability 1/256. Remember to discard the synch field.)
Use the internal UART and look for the synch field; try to figure out whether the data after the delimiter makes any sense. (This has lower probability of errors than the above, but requires the synch delimiter to come through without glitches and may thus miss messages.)
Look for the break. Easiest way to do this to timestamp all arriving bytes. It is quite probably not required to buffer the incoming data in any way, as the data rate is very low (max. 2000 bytes/s). Nominally, the distance between the end of the last character of a frame and the start of the first character of the next frame is at least 13 bits. As receiving a character takes 10 bits, the delay between receiving the end of the last character in the previous message and end of the first character of the next message is nominally at least 23 bits. In order to allow some tolerance for the bit timing, the limit could be set to, e.g. 17 bits. If the distance in time between "character received" interrupts exceeds this limit, the characters belong to different frame. Once you have detected the break, you may start collecting a new message. (This works almost according to the official spec.)
Do-it-yourself bit-by-bit. If you do not have a good synchronization between the slave and the master, you will have to determine the master clock using this method. The implementation is not very straightforward, but one example is: http://www.atmel.com/images/doc1637.pdf (I do not claim that one to be foolproof, it is rather simplistic.)
I would go with #3. Create an interrupt for incoming data and whenever data comes you compare the current timestamp (for which you need a counter) to the timestamp of the previous interrupt. If the inter-character time is too long, you start a new message, otherwise append to the old message. Then you may need double buffering for the messages (one you are collecting, another you are analyzing) to avoid very long interrupt routines.
The actual implementation depends on the other structure of your code. This shouldn't take much time.
And if you cannot make sure your clock is well enough synchronized (+- 4%) to the moster clock, then you'll have to look at #4, which is probably much more instructive but quite tedious.
Your fundamental question is this (as I see it):
how can I test the performance of my code in order to see if buffer overflow is likely to occur?
Set a pin high at the start of the algorithm, set it low at the end. Look at it on an oscilloscope (I assume you have one of these - embedded development is very difficult without it.) You'll be able to measure the max time the algorithm takes, and also get some idea of the variability.

Merge sort with goroutines vs normal Mergesort

I am wrote two versions of mergesort in Go. One with goroutines and the other one without. I am comparing the performance of each and I keep seeing
https://github.com/denniss/goplayground/blob/master/src/example/sort.go#L69
That's the one using goroutines. And this is the one without
https://github.com/denniss/goplayground/blob/master/src/example/sort.go#L8
I have been trying to figure out why the goroutine implementation performs much worse than the one without. This is the number that I see locally
go run src/main.go
[5 4 3 2 1]
Normal Mergesort
[1 2 3 4 5]
Time: 724
Mergesort with Goroutines
[1 2 3 4 5]
Time: 26690
Yet I still have not been able to figure out why. Wondering if you guys can give me suggestions or ideas on what to do/look at. It seems to me that the implementation with goroutines should perform at least somewhat better. I say so mainly, because of the following lines
go MergeSortAsync(numbers[0:m], lchan)
go MergeSortAsync(numbers[m:l], rchan)
Using concurrency does not necessarily make an algorithm run faster. In fact, unless the algorithm is inherently parallel, it can slow down the execution.
A processor (CPU) can only do one thing at a time even if, to us, it seems to be doing two things at once. The instructions of two goroutines may be interleaved, but that does not make them run any faster than a single goroutine. A single instruction from only one of the goroutines is ever being executed at any given moment (there are some very low-level exceptions to this, depending on hardware features) -- unless your program is running on more than one core.
As far as I know, the standard merge sort algorithm isn't inherently parallel; some modifications need to be made to optimize it for parallel execution on multiple processors. Even if you're using multiple processors, the algorithm needs to be optimized for it.
These optimizations usually relate to the use of channels. I wouldn't agree that "writing to channels has a big overhead" of itself (Go makes it very performant), however, it does introduce the likely possibility that a goroutine will block. It's not the actual writing to a channel that slows down the program significantly, it's scheduling/synchronizing: the waiting and waking of either goroutine to write or read from the channel is probably the bottleneck.
To complement Not_a_Golfer's answer, I will agree that goroutines definitely shine when executing I/O operations--even on a single core--since these occur far away from the CPU. While one goroutine is waiting on I/O, the scheduler can dispatch another CPU-bound goroutine to run in the meantime. However, goroutines also shine for CPU-intensive operations when deployed across multiple processors/cores.
As others have explained, there is a cost of parallelism. You need to see enough benefit to compensate for that cost. This only happens when the unit of work is larger than the cost of making channels and goroutines receiving the results.
You can experiment to determine what the unit of work should be. Suppose the unit of work is sorting 1000 elements. In that case, you can change your code very easily like so:
func MergeSortAsync(numbers [] int, resultChan chan []int) {
l := len(numbers)
if l <= 1000 {
resultChan <- Mergesort(numbers)
return
}
In other words, once the unit of work is too small to justify using goroutines and channels, use your simple Mergesort without those costs.
There are two main reasons:
Writing to channels has a big overhead. Just as a reference - I tried using channels and goroutines as iterators. They were ~100 times slower than calling methods repeatedly. Of course if the operation that is being piped via the channel takes a long time to perform (say crawling a web page), that difference is negligible.
Goroutines really shine for IO based concurrency, and less so for CPU parallelism.
Considering these 2 issues, you'll need a lot of CPUs or longer, less blocking operations per goroutine, to make this faster.

Resources