Golang - Concurrency vs Parallelism vs Sequential - go

I am learning Go at the moment and got quite frustrated understanding the different between Concurrency vs Parallelism vs Sequential.
Let's say we have a process that scraps a slice of 5 URLs and paste the content in a text file. The process takes 2 seconds per URL.
Sequentially -> it takes 10 seconds since it does one after the other
Parallel -> takes less than 10 seconds since it does them simultaneously but using multiple threads or processors.
Concurrently -> takes less than 10 seconds but it does not require multiple threads or processors.
Not sure if I am right until here. My questions are:
I read that parallelism does things simultaneously (running and listening to music for example) and concurrency handles things at the same time (getting breakfast done while ironing shirts for example).
But if that's the case, why is concurrency not taking 10 seconds to complete since at the end of the day you are not doing everything at the same time but just doing bits of everything until you complete it all?

Here's an analogy to explain.
You need to fry 5 eggs, sunny side up. To cook an egg you crack it onto the griddle, wait for a few minutes, then take it off.
The sequential approach is to fry the first egg to completion, then fry the second egg to completion, and so on, until you have 5 fried eggs.
The parallel approach is to hire 5 cooks, tell each of them to fry an egg, and wait until they are all finished.
The concurrent approach is that you cook all 5 eggs yourself the way you would actually do it. That is, you quickly crack each egg onto the pan, then take each one off when it's ready.
The reason you're able to save time without having to hire 5 cooks is because the number of cooks wasn't what was limiting you from going faster. It takes a couple minutes to cook an egg, but it only occupies your attention and your hands for a few seconds at the beginning and end.
The Go runtime and modern OS runtimes are similarly smart. They know that while your thread is waiting to receive a network response, the processor can look for other things to occupy it's attention.
The larger picture of concurrency is concerned not primarily with the number of processors, but with resource contention in general. The execution of tasks demands resources, and we cannot use more resources than are available. Processors are one resource, but there is also memory storage, memory bandwidth, network bandwidth, file handles, and the list goes on.

Related

Why isn't my application's performance scaling with more CPUs?

I am running a piece of software that is very parallel. There are about 400 commands I need to run that don't depend on each other at all, so I just fork them off and hope and that having more CPUs means more processes executed per unit time.
Code:
foreach cmd ($CMD_LIST)
$cmd & #fork it off
end
Very simple. Here are my testing results:
On 1 CPU, this takes 1006 seconds, or 16 mins 46 seconds.
With 10 CPUs, this took 600s, or 10 minutes!
Why wouldn't the time taken divide (roughly) by 10? I feel cheated here =(
edit - of course I'm willing to provide additional details you would want to know, just not sure what's relevant because in simplest terms this is what I'm doing.
You are assuming your processes are 100% CPU-bound.
If your processes do any disk or network I/O, the bottleneck will be on those operations, which cannot be parallelised (eg one process will download a file at 100k/s, 2 processes at 50k/s each so you would not see any improvement at all, furthermore you could experience a degrade in performance because of overheads).
See: Amdahl's_law - this allows you to estimate the improvement in performance when parallelising tasks, knowing the proportion between the parallelisable part and the non-parallelisable)

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

On throughput of linux scheduler when there are more threads than cores

I have done some measurement over linux scheduler. The linux is "Linux version 2.6.18-194.el5 (mockbuild#x86-005.build.bos.redhat.com)" and machine is with 8 cpus. The measurement is the only workload on that machine.
The measurement is two sets. In the first set, 8 threads are set up and each of same computation costs. Second set is to split one thread into two, resulting in totally 9 threads (2 out of which is half in cost of the other 7 threads).
When I run the two measurement sets, I expect the throughput is the same, for the total computation costs are the same and linux scheduler should (though I'm not sure) schedule those two smaller threads in one core. The results turn out to be there is dramatic decrease in throughput from 8 threads to 9 threads. Anyone has ideas what could be the reason.
Edit: #Waldheinz. Those threads are set up in order (say 0, 1 ... 7) and a (endless) stream of tuples go through from thread 0, 1 to thread 7. Each tuple spend sometime on each thread, doing some computation. All 8 threads are of the same computation costs as in the first set of measurement.
Updates: If the number of threads changed to 16, meaning every core has two threads, throughput is improved to the case of 8 threads...
Linux 2.6.18 is quite old now, dating to 2006, and multi-core systems were not as common or important back then. It's possible that your benchmark exercises some of the deficiencies of the O(1) scheduler that the kernel used up until 2.6.23. I forget exactly what those problems were, but it sounds plausible. The O(1) part refers to the fact that overhead of scheduling is essentially constant, but even though that was the case, the scheduler made poor decisions in some situations.
If you can, try a more recent kernel (after 2.6.23) and see if the new completely fair scheduler makes a difference.
Nine women can have nine babies in nine months, a rate of nine months per baby per person. One woman can have one baby in nine months, again a rate of nine months per baby per person. But nine women still need eighteen months to have ten babies, a much worse rate of more than sixteen months per baby per person!
You are assigning your threads chunks of work that are too large and not running your test for long enough to smooth out the chunk size.

Proving that replacing hardware will improve developer performance

Now the machines we are forced to use are 2GB Ram, Intel Core 2 Duo E6850 # 3GHz CPU...
The policy within the company is that everyone has the same computer no matter what and that they are on a 3 year refresh cycle... Meaning I will have this machine for the next 2 years... :S
We have been complaining like crazy but they said they want proof that upgrading the machines will provide exactly X time saving before doing anything... And with that they are only semi considering giving us more RAM...
Even when you put forward that developer resources are much more expensive than hardware, they firstly say go away, then after a while they say prove it. As far as they are concerned paying wages comes from a different bucket of money to the machines and that they don't care (i.e. the people who can replace the machines, because paying wages doesn't come from their pockets)...
So how can I prove that $X benefit will be gained by spending $Y on new hardware...
The stack I'm working with is as follows: VS 2008, SQL 2005/2008. As duties dictate we are SQL admins as well as Web/Winform/WebService Developers. So its very typical to have 2 VS sessions and at least one SQL session open at the same time.
Cheers
Anthony
Actually, the main cost for your boss is not the lost productivity. It is that his developers don't enjoy their working conditions. This leads to:
loss of motivation and productivity
more stress causing illness
external opportunities causing developers to go away
That sounds like a decent machine for your stack. Have you proven to yourself that you're going to get better performance, using real-world tests?
Check with your IT people to see if you can get the disks benchmarked, and max out the memory. Management should be more willing to take these incremental steps first.
The machine looks fine apart from the RAM.
If you want to prove this sort of thing time all the things you wait for (typically load times and compile times), add it all up and work how much it costs you to sit around. From that make some sort of guess how much time you'll save (it'll have to be a guess unless you can compare like with like, which is difficult if they won't upgrade your systems). You'll probably find that they'll make the money back on the RAM at least in next to no time - and that's before you even begin to factor in the loss of productivity from people's minds wandering whilst they wait for stuff to happen.
Unfortunately if they're skeptical then it's unlikely you can prove it to them in a quantitative way alone. Even if you came up with numbers, they'll probably question the methodology. I suggest you see if they're willing to watch a 10 minute demo (maybe call it a presentation), and show them the experience of switching between VS instances (while explaining why you'd need to switch and how often), show them the build process (again explaining why you'd need to create a build and how often), etc.
Ask them if you're allowed to bring your own hardware. If you're really convinced it would make you more productive, upgrade it yourself and when you start producing more ask for a raise or to be reimbursed.
Short of that though..
I have to ask: what else are you running? I'm not really that familiar with that stack, but it really shouldn't be that taxing. Are they forcing you to run some kind of system-slowing monitoring or antivirus app?
You'd probably have better luck convincing them to let you change that than getting them to roll out new updates.
If you really must convince them, your best bet is to benchmark your machine as accurately as you can and price out exactly what you need upgraded. Its a lot easier to get them to agree to an exact (and low) dollar amount than some open-ended upgrade
Even discussion this with them for more than five minutes will cost more than just calling out to your local PC dealer and buy the RAM out of your own pocket. Ask you project lead whether they can put it on the tab of the project as another "development tool". If (s)he can't, don't bother and cough up the
When they come complaining, then put the time of the meetings for this on their budget (since they come crying). See how long they can take this.
When we had the same issue, my boss bought better gfx cards for the whole team out of his own pockets and went to the PC guys to get each of us a second monitor. A few days later, he went again to get each of us 2GB more RAM, too.
The main cost from slow developer machines comes from the slow builds and the 'context switching', ie the time that it takes you to switch between the tasks required of you:
Firing up the second instance of VS and waiting for it to load and build
Checking out or updating a source tree
Starting up another instance of VS or checking out a clean source tree to 'have a quick look at' some bug that's been assigned
Multiple build/debug cycles to fix difficult bugs
The mental overhead in switching between different tasks, which shouldn't be underestimated
I made a case a while ago for new hardware after doing a breakdown of the amount of time that was wasted waiting for the machine to catch up. In a typical day we might need to do 2 or 3 full builds at half an hour each. The link time was around 3 minutes, and in a build/debug cycle you might do that 40 times a day. So that's 3.5 hours a day waiting for the machine. The bulk of that is in small 2 or 3 minute pockets which isn't long enough for you to context switch and do something else. It's long enough to check your mail, check stackoverflow, blow your nose and that's about it. So there's nothing else productive you can do with that time.
If you can show that a new machine will build the full project in 15 minutes and link in 1 minute then that's theoretically given you an extra 2 hours of productivity a day (or more realistically, the potential for more build cycles).
So I would get some objective timings that show how long it takes for different parts of your work cycle, then try to do comparative timings on machines with 4GB of RAM, a second drive (eg something fast like a WD Raptor), an SSD, whatever, to come up with some hard figures to support your case.
EDIT: I forgot to mention: present this as your current hardware is making you lose productivity, and put a cost on the amount of time lost by multiplying it by a typical developer hourly rate. On this basis I was able to show that a new PC would pay for itself in about month.
Take a task you do regularly that would be improved with faster hardware - ex: running the test suite, running a build, booting and shutting down a virtual machine - and measure the time it takes with current hardware and with better hardware.
Then compute the monthly, or yearly cost: how many times per month x time gained x hourly salary, and see if this is enough to make a case.
For instance, suppose you made $10,000/month, and gained 5 minutes a day with a better machine, the loss to your company per month would be around (5/60 hours lost a day) x 20 work days/month x $10,000 / 8 hours/day = $105 / month. Or about $1200/year lost because of the machine (assuming I didn't mess up the math...). Now before talking to your manager, think about whether this number is significant.
Now this is assuming that 1) you can measure the improvement, even though you don't have a better machine, and 2) while you are "wasting" your 5 minutes a day, you are not doing anything productive, which is not obvious.
For me, the cost of a slow machine is more psychological, but it's hard to quantify - after a few days of having to wait for things to happen on the PC, I begin to get cranky, which is both bad for my focus, and my co-workers!
It’s easy; hardware is cheap, developers are expensive. Throwing reasonable amounts of money at the machinery should be an absolute no brainer and if your management doesn’t understand that and won’t be guided by your professional opinion then you might be in the wrong job.
As for your machine, throw some more RAM at it and use a fast disk (have a look at how intensive VS is on disk IO using the resource monitor – it’s very hungry). Lots of people going towards 10,000 RPM or even SSD these days and they make a big difference to your productivity.
Try this; take the price of the hardware you need (say fast disk and more RAM), split it across a six month period (a reasonable time period in which to recoup the investment) and see what it’s worth in “developer time” each day. You’ll probably find it only needs to return you a few minutes a day to pay for itself. Once again, if your management can’t understand or support this then question if you’re in the right place.

Mac os X : load average

I'm wondering I always see my load average on my computer, 1.76,1.31,1.08 at the moment. What does it mean ?
If your CPU were a hot dog stand, the load averages would tell you the average number of people standing in line to get served. A load of less than 1.0 means that the hot dog vendor has some spare time between customers -- 1.0 means that, while a line never piles up, some customer is always talking to the vendor (who knows where all the hot dogs are going...). Having a "low" load average doesn't mean that your computer isn't doing anything though, there could be customers who take a really long time to eat their hot dog before getting in line again (that is to say some process might be waiting on the disk or network to wake them up with some data when it arrives), having a faster disk or net connection could improves your total dog sales.
Enjoy.
The load average tries to measure the number of active processes at any time. As a measure of CPU utilization, the load average is simplistic, poorly defined, but far from useless. High load averages usually mean that the system is being used heavily and the response time is correspondingly slow. What's high? ... Ideally, you'd like a load average under, say, 3, ... Ultimately, 'high' means high enough so that you don't need uptime to tell you that the system is overloaded.
When seeing the results of the load averages, they are for the past 1, 5, and 15 minute
The load indicates how man processes are waiting in the queue to be executed. You would have a load of 1 if your have that much processes that at least one process has always to queue up before it gets executed (is not executed immediately).
The numbers in uptime are average over some period.

Resources