Best way to share read intensive data across goroutines

Best way to share read intensive data across goroutines - go

I need to share a large tree (but for simplicity we can think it as a slice of strings) across multiple goroutines (http handlers). The tree is very rarely written, and only by one goroutine, but each http handler needs to read it.
Options I envisioned:
Use a mutex: very expensive and high latency for my use case. Handlers will fight to get a lock even if 99% of the time is not needed, being a read mostly struct.
Use channels: It's hard for me to imagine how I could use channels efficiently inside an http handler: it would need a good bit of boilerplate and it would copy the tree for each invocation, which is expensive.
Use lazy pointers? At invocation the handler get a pointer to the current tree structure, new writes would happen by updating a new copy of the tree, and atomically updating the tree pointer. I should also keep the old tree available until all the running goroutines return. Seems a bit tricky to code.
A mix of the last two? I could use channels to get the latest pointer to the tree, instead of the tree itself. Still a bit hard to imagine how would I write this down.
Is there any other way I'm not seeing? Any suggestion or tip?

Naive answer:
The simplest approach is to use a RWLock as showed in the official doc.
The problem is that after Spectre and Meltdown RWLock are significantly slower than atomics:
Benchmark_RWMutex_parallel-6 66796699 17.96 ns/op
Benchmark_Atomic_parallel-6 1000000000 0.5528 ns/op
The situation becomes exponentially worse with high thread count (above 32 threads) or intel CPUs. You can find more discussion about this here.
Modern answer:
Here's an example using atomics. It can still be improved by using pointers instead of structs, but it's a very good starting point.

Related

Haskell: Concurrent data structure guidelines

I've been trying to get a understanding of concurrency, and I've been trying to work out what's better, one big IORef lock or many TVars. I've came to the following guidelines, comments will be appreciated, regarding whether these are roughly right or whether I've missed the point.
Lets assume our concurrent data structure is a map m, accessed like m[i]. Lets also say we have two functions, f_easy and f_hard. The f_easy is quick, f_hard takes a long time. We'll assume the arguments to f_easy/f_hard are elements of m.
(1) If your transactions look roughly like this m[f_easy(...)] = f_hard(...), use an IORef with atomicModifyIORef. Laziness will ensure that m is only locked for a short time as it's updated with a thunk. Calculating the index effectively locks the structure (as something is going to get updated, but we don't know what yet), but once it's known what that element is, the thunk over the entire structure moves to a thunk only over that particular element, and then only that particular element is "locked".
(2) If your transactions look roughly like this m[f_hard(...)] = f_easy(...), and the don't conflict too much, use lots of TVars. Using an IORef in this case will effectively make the app single threaded, as you can't calculate two indexes at the same time (as there will be an unresolved thunk over the entire structure). TVars let you work out two indexes at the same time, however, the negative is that if two concurrent transactions both access the same element, and one of them is a write, one transaction must be scrapped, which wastes time (which could have been used elsewhere). If this happens a lot, you may be better with locks that come (via blackholing) from IORef, but if it doesn't happen very much, you'll get better parallelism with TVars.
Basically in case (2), with IORef you may get 100% efficiency (no wasted work) but only use 1.1 threads, but with TVar if you have a low number of conflicts you might get 80% efficiency but use 10 threads, so you still end up 7 times faster even with the wasted work.

Your guidelines are somewhat similar to the findings of [1] (Section 6) where the performance of the Haskell STM is analyzed:
"In particular, for programs that do not perform much work inside transactions, the commit overhead appears to be very high. To further observe this overhead, an analysis needs to be conducted on the performance of commit-time course-grain and fine-grain STM locking mechanisms."
I use atomicModifyIORef or an MVar when all the synchronization I need is something that simple locking will ensure. When looking at concurrent accesses to a data structure, it also depends on how this data structure is implemented. For example, if you store your data inside a IORef Data.Map and frequently perform read/write access then I think atmoicModifyIORef will degrade to a single thread performance, as you have conjectured, but the same will be true for a TVar Data.Map. My point is that it's important to use a data structure that is suitable for concurrent programming (balanced trees aren't).
That said, in my opinion the winning argument for using STM is composability: you can combine multiple operations into a single transactions without headaches. In general, this isn't possible using IORef or MVar without introducing new locks.
[1] The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
http://dx.doi.org/10.1145/1366230.1366241
Answer to #Clinton's comment:
If a single IORef contains all your data, you can simply use atomicModifyIORef for composition. But if you need to process lots of parallel read/write requests to that data, the performance loss might become significant, since every pair of parallel read/write requests to that data might cause a conflict.
The approach that I would try is to use a data structure where the entries themselves are stored inside a TVar (vs putting the whole data structure into a single TVar). That should reduce the possibility of livelocks, as transactions won't conflict that often.
Of course, you still want to keep your transactions as small as possible and use composability only if it's absolutely necessary to guarantee consistency. So far I haven't encountered a scenario where combining more than a few insert/lookup operations into a single transaction was necessary.

Beyond performance, I see a more fundamental reason to using TVar--the type system ensures you dont do any "unsafe" operations like readIORef or writeIORef. That your data is shared is a property of the type, not of the implementation. EDIT: unsafePerformIO is always unsafe. readIORef is only unsafe if you are also using atomicModifyIORef. At the very least wrap your IORef in a newtype and only expose a wrapped atomicModifyIORef
Beyond that, don't use IORef, use MVar or TVar
The first usage pattern you describe probably does not have nice performance characteristics. You likely end up being (almost) entirely single threaded--because of laziness no actual work happens each time you update the shared state, but whenever you need to use this shared state, the entire accumulated pile of thunks needs to be forced, and has a linear data dependency structure.
Having 80% efficiency but substantially higher parallelism allows you to exploit growing number of cores. You can expect minimal performance improvements over the coming years on single threaded code.
Many word CAS is likely coming to a processor near you in the form of "Hardware Transactional Memory" allowing STMs to become far more efficient.
Your code will be more modular--every piece of code has to be changed if you add more shared state when your design has all shared state behind a single reference. TVars and to a lesser extent MVars support natural modularity.

What are some examples of algorithms and/or data structures are difficult or impossible to implement correctly without garbage collection?

I've heard it mentioned in very general terms that such things exist, but the details are rarely discussed. What are your favourites? What makes it difficult?

The standard implementation of many lock-free structures like a concurrent hash table often are nearly impossible to write without a garbage collector. These structures work by storing long linked lists of elements, then changing the heads of the lists whenever a new value is added or removed. That way, one thread can make a change to the structure so that new threads see the change (they traverse a new linked list) while older threads continue reading older linked lists. It's crucial that the memory be reclaimed by a garbage collector, since otherwise the thread that changed the linked list would have to somehow clean up the list it just replaced, but that list is in use by other threads, this either leads to data races or requires the use of locks, both of which are bad.
A similar argument can be made for multithreaded lock free binary search trees, which use a related trick.

Anything is possible without GC, because computer hardware works without GC and it computes all algorithms. :-) But sometimes it is easier to implement a local GC and use it instead of writing a huge complicated code to do the same without GC. In real-world scenarions, algorithms using GC are often much simpler than their non-GC counterparts.
Anything which generates a lot of temporal data/variables is a big problem (i.e. difficult, not technically impossible) without some sort of garbage collection. For example imagine a web server with php or asp.net support where the engine of php/c# scripts has no garbage collection. Each web request from a browser creates a lot of temporal data on web server and if you had just a small memory leak in a script on server, the whole web server ends in a horrible death...
I mean if many people put scripts or plugins to server, there can be memory leaks. These scenarios require some sort of garbage collection.
Also LINQ in C# creates a lot of temporal objects. It would be paint to use it without garbage collection.

I conjecture that you mean concurrent data structures. If so, a lot of lockfree algorithms require Garbage Collection (or deferred/safe memory reclamation). And the most simplest example is classical lock-free stack:
[search Wikipedia by "ABA problem" - the site prohibits me from posting the link]
(ABA and safe memory reclamation are closely related problems, if you solve the latter than you solve the former too)
Is it impossible to implement them without GC? No, it's not impossible. However, it's definitely more difficult.
One solutions is to implement limited form of GC. For example, strongly thread-safe reference counting:
http://www.1024cores.net/home/lock-free-algorithms/object-life-time-management/differential-reference-counting
Another solution is to design an algorithm around the requirement, so that it just does not require a GC. For example, some lockfree producer-consumer queues (most notable the M&S queue) do require a GC. And here is a simple an efficient queue algorithm that is especially designed around GC requirement:
http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

var v = FunctionThatAllocatesMemory2(FunctionThatAllocatesMemory1());
Without GC there is no way to deallocate the memory returned by FunctionThatAllocatesMemory1().

When should we use a scatter/gather(vectored) IO?

Windows file system supports scatter/gather IO.(Of course, other platform does)
But I don't know when do I use the IO mechanism.
Could you explain me a proper case?
And what benefit can we get from using the I/O mechanism?(Just a little IO request?)

You use Scatter/Gather IO when you are doing lots of random (i.e. non-sequential) reads / writes, and you want to save on context switches / syscalls - Scatter/Gather is a form of batching in this sense. However, unless you've got a very fast disk (or more likely, a large array of disks), the syscall cost is negligible.
If you were writing a Database server, you might care about this, but anything less than a big-iron machine handling thousands or millions of requests a second won't see any benefit.

Paul -- one extra note: one additional advantage is that you hand multiple requests to the disk driver at the same time. The driver then can sort the requests and issue them in the optimal order. While syscall time is small, seek time (many milliseconds) can be punitive (that's less than 1000 I/O's/sec).
Chris's comment about demonstrating the efficiency is pragmatic. Mother nature never lies. Well, almost never.

I would imagine that you would use scatter gatehr IO when you (a) suspected your application had a performance bottleneck, and (b) you built a performance analysis framework that could show significant improvment using it.
Unless you can show a provable improvement, the additional code complexity is just a risk, and theres no magic recipe that says that, when some condition is met, and application will automatically benefit in a significant way from some programming cleverness.
Or - to put it another way - dont base major architectural decisions based on the statements of 'some guy on an internet forum'. Create a test, and find out.

in posix, readv and writev read from or write to discontinuous memory but to read and write discontinuous file ranges from discontinuous memory in one go you want readx and writex which were one of the proposed posix additions
doing a readx is faster then doing a lot of reads as it's only one system call and it lets the disk scheduler have the most io's to reorder i remember some one saying that for the ext2/3/.. fsck program that they wanted this as it knows what ranges it wants

What is "overhead"?

I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?

It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".

The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.

You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.

Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.

Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).

For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.

Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.

You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.

You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.

A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.

Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.

To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.

its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..

Are there any concurrent algorithms that in use that work correctly without any synchronization?

All of the concurrent programs I've seen or heard details of (admittedly a small set) at some point use hardware synchronization features, generally some form of compare-and-swap. The question is: are there any concurrent programs in the wild where the thread interact throughout there life and get away without any synchronization?
Example of what I'm thinking of include:
A program that amounts to a single thread running a yes/no test on a large set of cases and a big pile of threads tagging cases based on a maybe/no tests. This doesn't need synchronization because dirty data will only effect performance rather than correctness.
A program that has many threads updating a data structure where any state that is valid now, will always be valid, so dirty reads or writes don't invalidate anything. An example of this is (I think) path compression in the union-find algorithm.

If you can break work up into completely independent chunks, then yes there are concurrent algorithms whose only synchronisation point is the one at the end of the work where all threads join. Parallel speedup is then a factor of being able to break into tasks whose sizes are as similiar as possible.

Some indirect methods for solving systems of linear equations, like Successive over-relaxation ( http://en.wikipedia.org/wiki/Successive_over-relaxation ), don't really need the iterations to be synchronized.

I think it's a bit trick question because e.g. if you program in C, malloc() must be multi-thread safe and uses hardware synchronization, and in Java the garbage collector requires hardware synchronization anyway. All Java programs require the GC, and hardly any C program makes it without malloc() (or C++ program / new() operator).

There is a whole class of algorithms which are sometimes referred to as "embarallel" (contraction of "embarrassingly parallel"). Many image processing algorithms fall into this class, where each pixel may be processed independently (which makes implementation with e.g. SIMD or GPGPU very straightforward).

Well, without any synchronization at all (even at the end of the algorithm) you obviously can't do anything useful because you can't even transfer the results of concurrent computations to the main thread: suppose that they were on remote machines without any communication channels to the main machine.

The simplest example is inside java.lang.String which is immutable and lazily caches its hash code. This cache is written to without synchronization because (a) its cheaper, (b) the value is recomputable, and (c) JVM guarantees no tearing. The tolerance of data races in purely functional contexts allows tricks like this to be used safely without explicit synchronization.

I agree with Mitch's answer. I would like to add that the ray tracing algorithm can work without synchronization until the point where all threads join.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio