Are relaxed data structures used in real applications? - performance

I am learning about concurrent data structures, and I have come across a broad research literature on relaxed data structures, such as k-relaxed queues, stacks, counters, etc. See, e.g., here
https://github.com/cksystemsgroup/scal#data-structures
or here
http://www.faculty.idc.ac.il/gadi/MyPapers/2015ST-RelaxedDataStructures.pdf
The idea is that a relaxed data structure offers weaker guarantees (e.g., a dequeue operation of a k-relaxed FIFO queue is free to give any of the first k elements). Ideally, the relaxed data structures offer better performance than the strict ones.
The problem is that, as far as I see, these data structures are evaluated only on some micro-benchmarks (as in the first link above), measuring only pure performance/throughput of data structure operations, and not looking at how the relaxation influences the overall application design and performance.
Have you ever used such relaxed data structures in real applications, or can you imagine plugging in relaxed data structures into your application to obtain performance gains? Or are those data structures a purely academic exercise?

Quick Answer
Here's a quick answer by analogy. Relaxed structures are comparable to normal0 concurrent structures, in the same way that normal concurrent structures are comparable to normal non-concurrent structures.
Most uses of data structures in an application are the plain-jane non-current ones, either because they are not shared, or because some type of coarse-grained locking is sufficient to share them. They offer a simple API and are easy to reason about, even in the presence of concurrent access when coarse-grained locking is used. To give a specific example, you'll probably find 20 to 100 uses of plain Map implementations for every use ConcurrentMap.
That doesn't make ConcurrentMap useless! In the 5% of places it is used, there may be no real substitute, performance wise!
Similarly, you are usually going to do just find with the non-relaxed concurrent structures in those places, but there may be an additional 5% (so now 0.25% of the total Map use) that could benefit from the relaxed structures (at the cost of additional complexity when reasoning about the system, and possibly theoretically visible service order changes for clients).
Long Answer
I think there are two separate implied questions here:
Are relaxed data structures actually faster than their non-relaxed equivalents in actual use?
Are related data structures actually useful in real-world applications given their weaker guarantees?
I can speak about my personal experience for both of them. I have used "relaxed structures" before I even knew thew were called relaxed structures.
Are Relaxed Structures Actually Faster?
First, maybe you are asking if relaxed structures are actually faster than their non-relaxed equivalents?
The easy answer is "Yes, but only at the extremes of concurrent access".
Microbenchmarks already show the extremes of performance. You usually have one structure which is continually accessed (i.e., with a duty cycle near 100%) from multiple threads. At these extremes of contention, relaxed data structures may show an order of magnitude improvement over their non-relaxed cousins. You have to realize, however, that the vast majority of structures are not accessed in such a fashion. Even structures which are already using concurrent data structures (think ConcurrentHashMap in Java), are usually accessed rarely as a percentage of total execution time, and the use of concurrent structure is necessary more for correctness and high-sigma response time than performance1.
Usually No
That said, just because the vast majority of structures in your application don't need higher concurrent performance, it doesn't mean that it's worthless: the few structures (if any) that do need it, may constitute a high percentage of total accesses. What I mean is, if you look at the code level: "How many declaration of map objects in my application need a fast concurrent structure?", then the answer is likely to be "not many" - but if you look at the runtime level: "How many accesses to map objects in my application need a fast concurrent structure?" - then the answer might be a "a lot of them". It depends if you weight by "use in code" or "use at runtime".
Sometimes Yes
Once you start tuning a highly concurrent application, you are very likely to run into a concurrency bottleneck at some point. Even if you application is very parallel - such as processing many independent incoming requests in parallel, you'll usually want to introduce some type of shared state for caching, logging, security checks or simply because your application does have some amount of shared writable state. Whatever structure you use there may indeed suffer from high contention.
As a final point on the first question: a lot of it also depends on the hardware you use. A typical behavior of contended structure is that tend to flatline in throughput under concurrent access, regardless of the number of cores accessing the structure - i.e., the maximum throughput of this structure is X, regardless of concurrency. Often that is in fact the best case: you may instead reach peak throughput a moderate amount of concurrency and then total throughput drops after that.
It Is Hardware Dependent
The impact of this depends on your hardware. If you are on a single-socket development machine with 4 cores, then this behavior isn't too terrible. You effectively drop down to 1 core for the concurrent parts, but that still leaves 25% of the maximum power available (one core). So if the inherently concurrent part of your load is 10% of the total load, then you still achieve 1 / (0.9 + 0.1 * 4) = 0.77 = 77% of the total throughput, despite your contention. No not a problem you'd reorganize your application over. Then may you deploy to production and you are running on a 2-socket, 18-core box and it looks like 1 / (0.9 + 0.1 / 36) = 0.22 = 22% - i.e., you are getting less than a quarter of the ideal performance level because of your concurrency bottleneck. That is a problem.
That's with the "simple" view of concurrent scaling, where performance is constant for the contended part - in reality increased contention may decrease performance, making the performance worse.
Summary
So the long answer then finishes in the same way as the short answer - yes, there are places in highly concurrent applications that can benefit from structures that perform better in contention, but they are small in number. You have to first remove all the places where false sharing is occurring, where you have non-contention bottlenecks, and then you'll find the places where true sharing occurs and may benefit.
Can Relaxed Structures Be Used in Practice?
Absolutely, yes. This is the easy part. There are many times when you need a structure that that offers weaker guarantees than the underly structure you want. For example, if you are just going to store a bunch of objects with no duplicates, in no particular order, you might use an array/vector or a linked list or a set-like structure. Each of the choices gives some advantage over the other, at the cost of some disadvantage in other scenarios (for example, arrays allow constant-time access to any element given its index, but cannot insert elements anywhere other than the end without paying an O(n) cost).
You don't often find structures with even weaker guarantees than the existing ones, because outside of concurrency there aren't really any structures are even faster than the basic ones, with weaker guarantees. Try it - choose a "collection" API that has the lowest-common-denominator guarantees out of arrays, linked lists, hash tables, trees, etc. I.e., it doesn't need to have fast insertion in the middle, it doesn't offer fast search by element, it doesn't offer fast indexing, etc, etc - can you make any basic operation faster?
This scenario is very common: scan through the source code for anything you've written lately and check out how you used List in Java or vector in C++, or just plain arrays: in each case, which of the high performance operations did you actually need? Often you are just holding a collection of elements and iterating over them later - that's it!
Once you introduce the new axis of concurrent performance - you can use weaker guarantees to make a faster structure. Arrays are problematic because to preserve exact insertion order you always end up contenting on some type of shared size variable. Linked lists are problematic because the head/tail nodes are contended and so on.
There there is a place for weaker structures. For example, a "bad" of elements, that doesn't offer FIFO insertion/iteration order, doesn't offer fast search by equality, etc. Many uses are fine with this weakening.
Concretely
Here are a couple of places I've actually used or seen used relaxed structures in high-performance concurrent systems:
For statistics collection: high-concurrent systems processing mostly independent requests often scale well because the requests do not share much mutable data. Once you start collecting system-wide fine-grained statistics, however, you have a contention point. Usually you don't need any kind of FIFO ordering for statistics and don't need searchability (at the collect point anyway) - so the in-process statistics collector code is an obvious place where relaxed structures are useful.
For logging: in the same way as stasitics above, many systems have a global logging facility, which might become a point of contention. Usually you want log messages to appear in FIFO order, since out of order messages are very confusing, but you might not care much about the order of messages from independent threads in the log, or you might know exactly which messages need to occur in-order wrt other threads. Here again relaxed structures are a natural fit.
The general problem of processing or redistributing a high-volume stream of incoming requests. Imagine what any web-server does - it processes a large number of incoming quests. Usually those requests come from many different users and it doesn't matter if the requests are processed in the exact order they arrive. There are some ordering cases that do matter however - usually among requests from the same user or session (e.g., image a user who submits some change with a POST and then issues a GET to reload the page: she would expect to see the changes from the previous POST reflected in the GET). So many of the front-end structures to handle the requests would relax the usual ordering requirements for performance.
0 By normal concurrent structures, I mean structures like ConcurrentHashMap which are designed for better performance under concurrency, but usually keep their usual ordering guarantees and so on. That is, they are made about as concurrent as possible withing the boundaries of the existing API.
1 Admittedly, this glosses over the difference between average performance, and other "long tail" performance issues. I.e., a structure may be accessed infrequently on average, but under certain scenarios access may skyrocket and it can become a bottleneck under this particular load.

Related

Elixir immutability in a game context

I'm aware that, in order to ensure that all threads reading a memory access read the exact same value, Elixir never overwrites an address in use. Instead, if a var is changed, it's written in a new address.
What I want to know is how that would affect real time games. For instance, moving in a 3D game would generate a huge number of different values needing to be newly allocated and the old values to be released in a timely manner. How better or worse is this, for a game, compared to simply rewriting the values in memory as needed?
This is a very generic question. Like very generic.
The first and foremost, BEAM (Elixir and Erlang VM) is prioritizing throughput and predictable responsiveness over latency, but in the games the latency is a king, more latency you have, less FPS there will be.
Second, BEAM was designed primarily for fault-tolerance and concurrency not performance, so performance-wise C/C++ will be faster on doing direct memory access and computations.
In general, there are advantages in immutable data structures (safe concurrency, simpler reasoning about programs, less headache during debugging and simpler algorithms in general: i.e. it is much easier construct new RB tree than implement concurrent deletion from it) at cost of raw performance.

Spinlock implementation reasoning

I want to improve the performance of a program by replacing some of the mutexes
with spinlocks. I have found a spinlock implementation in
http://www.boost.org/doc/libs/1_36_0/boost/detail/spinlock_sync.hpp
which I intend to reuse. I believe this implementation is safer than simpler implementations in which threads keep trying forever like the one found here
http://www.boost.org/doc/libs/1_54_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock.implementation
But i need to clarify some things on the yield function found here
http://www.boost.org/doc/libs/1_36_0/boost/detail/yield_k.hpp
First of all I can assume that the numbers 4,16,32 are arbitrary. I actually tested some other values and I have found that I got best performance in my case by using other values.
But can someone explain the reasoning behind the yield code. Specifically why do we need all three
BOOST_SMT_PAUSE
sched_yield and
nanosleep
Yes, this concept is known as "adaptive spinlock" - see e.g. https://lwn.net/Articles/271817/.
Usually the numbers are chosen for exponential back-off: https://geidav.wordpress.com/tag/exponential-back-off/
So, the numbers aren't arbitrary. However, which "numbers" work for your case depend on your application patterns, requirements and system resources.
The three methods to introduce "micro-delays" are designed explicitly to balance the cost and the potential gain:
zero-cost is to spin on high-CPU, but it results in high power consumption and wasted cycles
a small "cheap" delay might be able to prevent the cost of a context-switch while reducing the CPU load relative to a busy-spin
a simple yield might allow the OS to avoid a context switch depending on other system load (e.g. if the number of threads < number logical cores)
The trade-offs with these are important for low-latency applications where the effect of a context switch or cache misses are significant.
TL;DR
All trade-offs try to find a balance between wasting CPU cycles and losing cache/thread efficiency.

Upper bound on speedup

My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.

Haskell: Concurrent data structure guidelines

I've been trying to get a understanding of concurrency, and I've been trying to work out what's better, one big IORef lock or many TVars. I've came to the following guidelines, comments will be appreciated, regarding whether these are roughly right or whether I've missed the point.
Lets assume our concurrent data structure is a map m, accessed like m[i]. Lets also say we have two functions, f_easy and f_hard. The f_easy is quick, f_hard takes a long time. We'll assume the arguments to f_easy/f_hard are elements of m.
(1) If your transactions look roughly like this m[f_easy(...)] = f_hard(...), use an IORef with atomicModifyIORef. Laziness will ensure that m is only locked for a short time as it's updated with a thunk. Calculating the index effectively locks the structure (as something is going to get updated, but we don't know what yet), but once it's known what that element is, the thunk over the entire structure moves to a thunk only over that particular element, and then only that particular element is "locked".
(2) If your transactions look roughly like this m[f_hard(...)] = f_easy(...), and the don't conflict too much, use lots of TVars. Using an IORef in this case will effectively make the app single threaded, as you can't calculate two indexes at the same time (as there will be an unresolved thunk over the entire structure). TVars let you work out two indexes at the same time, however, the negative is that if two concurrent transactions both access the same element, and one of them is a write, one transaction must be scrapped, which wastes time (which could have been used elsewhere). If this happens a lot, you may be better with locks that come (via blackholing) from IORef, but if it doesn't happen very much, you'll get better parallelism with TVars.
Basically in case (2), with IORef you may get 100% efficiency (no wasted work) but only use 1.1 threads, but with TVar if you have a low number of conflicts you might get 80% efficiency but use 10 threads, so you still end up 7 times faster even with the wasted work.
Your guidelines are somewhat similar to the findings of [1] (Section 6) where the performance of the Haskell STM is analyzed:
"In particular, for programs that do not perform much work inside transactions, the commit overhead appears to be very high. To further observe this overhead, an analysis needs to be conducted on the performance of commit-time course-grain and fine-grain STM locking mechanisms."
I use atomicModifyIORef or an MVar when all the synchronization I need is something that simple locking will ensure. When looking at concurrent accesses to a data structure, it also depends on how this data structure is implemented. For example, if you store your data inside a IORef Data.Map and frequently perform read/write access then I think atmoicModifyIORef will degrade to a single thread performance, as you have conjectured, but the same will be true for a TVar Data.Map. My point is that it's important to use a data structure that is suitable for concurrent programming (balanced trees aren't).
That said, in my opinion the winning argument for using STM is composability: you can combine multiple operations into a single transactions without headaches. In general, this isn't possible using IORef or MVar without introducing new locks.
[1] The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
http://dx.doi.org/10.1145/1366230.1366241
Answer to #Clinton's comment:
If a single IORef contains all your data, you can simply use atomicModifyIORef for composition. But if you need to process lots of parallel read/write requests to that data, the performance loss might become significant, since every pair of parallel read/write requests to that data might cause a conflict.
The approach that I would try is to use a data structure where the entries themselves are stored inside a TVar (vs putting the whole data structure into a single TVar). That should reduce the possibility of livelocks, as transactions won't conflict that often.
Of course, you still want to keep your transactions as small as possible and use composability only if it's absolutely necessary to guarantee consistency. So far I haven't encountered a scenario where combining more than a few insert/lookup operations into a single transaction was necessary.
Beyond performance, I see a more fundamental reason to using TVar--the type system ensures you dont do any "unsafe" operations like readIORef or writeIORef. That your data is shared is a property of the type, not of the implementation. EDIT: unsafePerformIO is always unsafe. readIORef is only unsafe if you are also using atomicModifyIORef. At the very least wrap your IORef in a newtype and only expose a wrapped atomicModifyIORef
Beyond that, don't use IORef, use MVar or TVar
The first usage pattern you describe probably does not have nice performance characteristics. You likely end up being (almost) entirely single threaded--because of laziness no actual work happens each time you update the shared state, but whenever you need to use this shared state, the entire accumulated pile of thunks needs to be forced, and has a linear data dependency structure.
Having 80% efficiency but substantially higher parallelism allows you to exploit growing number of cores. You can expect minimal performance improvements over the coming years on single threaded code.
Many word CAS is likely coming to a processor near you in the form of "Hardware Transactional Memory" allowing STMs to become far more efficient.
Your code will be more modular--every piece of code has to be changed if you add more shared state when your design has all shared state behind a single reference. TVars and to a lesser extent MVars support natural modularity.

What is "overhead"?

I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?
It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".
The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.
You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.
Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.
Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).
For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.
Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.
You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.
You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.
A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.
Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.
To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.
its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..

Resources