Advantages of epoll integration - performance

On one job interview I had to answer this question - which advantages do you have owing to use epoll integration/implementation in Go.
I just know what can do epoll and that complexity for any descriptors count is O(1), but have no idea why Go is better than other languages.
I found this branch https://news.ycombinator.com/item?id=15624586 and guy say that reason, maybe, that Go don't use stack switching. It's hard to understand for me. Which part of program don't use stack switching? Every goroutine has his own stack.

That's not the netpoller integration per se which makes Go strong in its field—it's rather the way that integration is done: instead of being bolted-on as a library, in Go, the netpoller is tightly integrated right into the runtime and the scheduler (which decides which goroutine to run, and when).
The coupling of super-light-weight threads of execution—goroutines—with the netpoller allows for callback-free programming. That is, once your service gets another client connected, you just hand this connection to a goroutine which merely reads the data from it (and writes its response stream to it). As soon as there's no data available when the goroutine wants to read it, the scheduler suspends the goroutine and unblocks it once the netpoller reports there's data available; the same happens when the goroutine wants to write data but the sending buffer is full.
To recap, the netpoller in Go is intertwined with the goroutine scheduler which allows goroutines transparently wait for data availability without requiring the programmer to explicitly code the event loop and callbacks or deal with "futures" and "promises" which are mere callbacks wrapped in pretty objects.
I invite you to read this classic essay which explains this stuff with much more beautiful words.

Related

Go goroutine under the hood

I'm trying to understand golang architecture and what "lightweight thread" means. I've already read something, but want to ask question to clarify it.
Am I right if I'll say what "go" keyword under the hood just puts following function in queue of inner thread pool, but for user it looks like creation of thread?
This is copied from the Go FAQ:
Why goroutines instead of threads?
Goroutines are part of making concurrency easy to use. The idea, which has been around for a while, is to multiplex independently executing functions—coroutines—onto a set of threads. When a coroutine blocks, such as by calling a blocking system call, the run-time automatically moves other coroutines on the same operating system thread to a different, runnable thread so they won't be blocked. The programmer sees none of this, which is the point. The result, which we call goroutines, can be very cheap: they have little overhead beyond the memory for the stack, which is just a few kilobytes.
What's lacking here is the definition of thread. If we resort to Wikipedia, we find:
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, ...
but that's just a description of, well, the same thing that a goroutine is. The problem here is that the word thread tends to refer to kernel thread and/or user thread (both defined on that same Wikipedia page) and these threads are heavier-weight than the goroutine threads. Which brings us right back to this:
I'm trying to understand golang architecture and what "lightweight thread" means ...
To cut to the chase, this means "lighter than the OS-provided ones". That's really all it means. There are OS-provided threads (on multiple OSes on which Go runs), but they generally do too much and cost too much to switch between so Go provides its own language-level ones that it calls "goroutines" that are much lighter.
From comments:
Why need to move tasks from one thread to another by some planner ...
This is an implementation detail, which involves another aspect of the OS-provided kernel threads:
I can't understand how [a goroutine] can be preempted if single thread process [is] blocked by [a] system call to read [a] long file
The current Go runtime goroutine / thread / processor scheduler (see What is relationship between goroutine and thread in kernel and user state and note that there have been more than just the current implementation) predicts that some system call will block, and makes sure to assign that system call its own OS-level kernel thread (see also JimB's comment). These threads do not count against the GOMAXPROCS setting. This is in fact sometimes a problem, as it's possible for the Go runtime to try to spin off more threads than the OS allows: it might be nice if there were a system-call-thread-pool here (though there are also obvious problems with this).
So, the current runtime creates up to GOMAXPROCS kernel-style OS-level threads and uses those to multiplex up to that many goroutines onto the CPUs, but creates extra kernel-style OS-level threads whenever it wants to. As the blog post linked in the question above notes, the P entities act as queues to hold goroutines (Gs) on a per-processor basis for localized cache lookup (remember that on some systems, especially NUMA ones, it's expensive to reach out "across" CPUs: the scheduler is still willing to do this, but won't do it too often, for some definition of "too often").
Earlier versions of the current scheduler required explicit yields (runtime.Gosched()) calls or various other runtime operations to cause a switch from the current goroutine to some other goroutine. See What exactly does runtime.Gosched do? for example. In Go 1.14, some OSes provide automatic goroutine preemption; see Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

Golang dispatch same data to go routins

There is one go routine generating data. Also are many go routines that handles http response. I want generated data to be passed to all http handler routines. All dispatched data are same.
I thought two solutions. Using channel pipeline to fan-out or using mutex and condition variable.
I concern if former way needs memory allocation to put data in channel.
What should I choose?
Your use case sounds like it benefits from channels. In general channels are preferred when communication between go routines is needed. Sounds like a classic example of a worker pool.
Mutexes are used to protect a piece of memory, so only 1 goroutine can access/modify it at a time. Often times this is the opposite of what people want, which is to parallelize execution.
A good rule of thumb is to not worry about optimization(memory allocation or not) until it actually becomes an issue. Premature optimization is a common anti-pattern.

Why are goroutines much cheaper than threads in other languages?

In his talk - https://blog.golang.org/concurrency-is-not-parallelism, Rob Pike says that go routines are similar to threads but much much cheaper. Can someone explain why?
See "How goroutines work".
They are cheaper in:
memory consumption:
A thread starts with a large memory as opposed to a few Kb.
Setup and teardown costs
(That is why you have to maintain a pool of thread)
Switching costs
Threads are scheduled preemptively, and during a thread switch, the scheduler needs to save/restore ALL registers.
As opposed to Go where the the runtime manages the goroutines throughout from creation to scheduling to teardown. And the number of registers to save is lower.
Plus, as mentioned in "Go’s march to low-latency GC", a GC is easier to implement when the runtime is in charge of managing goroutines:
Since the introduction of its concurrent GC in Go 1.5, the runtime has kept track of whether a goroutine has executed since its stack was last scanned. The mark termination phase would check each goroutine to see whether it had recently run, and would rescan the few that had.
In Go 1.7, the runtime maintains a separate short list of such goroutines. This removes the need to look through the entire list of goroutines while user code is paused, and greatly reduces the number of memory accesses that can trigger the kernel’s NUMA-related memory migration code.

Is MPI_Bcast() blocking?

Is MPI_Bcast() blocking or nonblocking? In other word, when the root sends a data, do all processors block until every processor has received this data? If not, how to synchronized (block) all of them so that no one proceeds until all receives the same data.
You need to be a bit careful about terminology here as what MPI means by "blocking" may not be how you have seen it used in other contexts.
In MPI terms, Bcast is blocking. Blocking means that, when the function returns, it has completed the operation it was meant to do. In this case, it means that on return from Bcast it is guaranteed that the receive buffer in every process contains the data you want to broadcast. The non-blocking version is Ibcast.
In MPI terms, what you are asking is whether the operation is synchronous, i.e. implies synchronisation amongst processes. For a point-to-point operation such as Send, this refers to whether or not the sender waits for the receive to be posted before returning from the send call. For collective operations, the question is whether there is a barrier (as pointed out by #Vladimir). Bcast does not necessarily imply a barrier.
However, the reason I am posting is that, in almost all MPI programs written using the standard Send/Recv calls (as opposed to single-sided Put/Get) you do not care if there is a synchronisation after the barrier. All each process cares about is that it has received the data it needs - why would it matter what the other processes are doing? If you subsequently want to communicate with any other process then the MPI routines are designed so that the required synchronisation happens automatically. If you issue a receive and another process is slow, you wait; if you issue a send and the other process has not issued a receive, everything will still work correctly (this is assuming you don't call Rsend - you should never call Rsend!). Whether or not there is synchronisation has effects on performance, but rarely affects whether a program is correct or not.
Unless processes are interacting via some other mechanism (e.g. all accessing the same file) then it is hard to come up with a real example where you care whether or not the Bcast synchronises. Of course you can always construct some edge case, but in real practical applications of MPI it almost never matters.
Many MPI programs are littered with barriers and in my experience they are almost never required for correctness; the only common use case is to ensure meaningful timings for performance measurements.
No, this kind of blocking (waiting for the other processes to finish their part of the job) would be very bad for performance. Every process continues as soon as it has all it need -- that means that the data it was to receive are there, or the data to be sent are at least copied to some buffer.
You can use an MPI_Barrier to synchronize processes if you need to be sure all processes finished. As already said, it can slowdown the program significantly. I use it only for certain diagnostic logging when initializing my code. Not during the actual integration.

Handling user interface in a multi-threaded application (or being forced to have a UI-only main thread)

In my application, I have a 'logging' window, which shows all the logging, warnings, errors of the application.
Last year my application was still single-threaded so this worked [quite] good.
Now I am introducing multithreading. I quickly noticed that it's not a good idea to update the logging window from different threads. Reading some articles on keeping the UI in the main thread, I made a communication buffer, in which the other threads are adding their logging messages, and from which the main thread takes the messages and shows them in the logging window (this is done in the message loop).
Now, in a part of my application, the memory usage increases dramatically, because the separate threads are generating lots of logging messages, and the main thread cannot empty the communication buffer quickly enough. After the while the memory decreases again (if the other threads have finished their work and the main thread gradually empties the communication buffer).
I solved this problem by having a maximum size on the communication buffer, but then I run into a problem in the following situation:
the main thread has to perform a complex action
the main thread takes some parts of the action and let's separate threads execute this
while the seperate threads are executing their logic, the main thread processes the results from the other threads and continues with its work if the other threads are finished
Problem is that in this situation, if the other threads perform logging, there is no UI-message loop, and so the communication buffer is filled, but not emptied.
I see two solutions in solving this problem:
require the main thread to do regular polling of the communication buffer
only performing user interface logic in the main thread (no other logic)
I think the second solution seems the best, but this may not that easy to introduce in a big application (in my case it performs mathematical simulations).
Are there any other solutions or tips?
Or is one of the two proposed the best, easiest, most-pragmatic solution?
Thanks,
Patrick
Let's make some order first.
you may not hold UI processing for any time U would have noticed, or he will be frustrated
you may still perform long operations in the UI thread. this is done by means of PeakMessage loop. If you design one or more proper peakmessage loops, you do not need multithreading, unless for performance optimization.
you may consider MsgWaitForSingleObject() loop instead of GetMessage if you want to communicate with threads efficiently (always better than polling)
Therefore, if you do not redesign your message loop
There's no way you can perform syncronous requests from other threads
You may design a separate thread for the logging
All non-UI logic will have to be elsewhere.
About the memory problem:
it is a bad design to have one thread able to allocate all memory if another thread is stuck. Such dependency is a clear recipe for a disaster.
If the buffer is limited, you need to decide what happens when it's overrun. You have two options - suspend the thread or discard the message.
UI code:
It is possible to design logger code that would display messages with incredible speed. Such designs are complicated, rely on sophisticated caching and arranging data for fast access, viewport management and rendering only the part that corresponds to actual pixels that the user is looking at.
For most applications it is just a gimmick, because users do not read very fast. Most of the time it is better to design a different approach to showing logs, perhaps a stateful UI to let user choose what is interesting to him at the moment. Spy++ for example, some sysinternals tools like regmon, filemon are incredibly fast in showing their own logs. You can have a look at their source code.

Resources