i didn't understand this sentence plz explain me in detail and use an easy english to do that
Go routines are cooperatively scheduled, rather than relying on the kernel to manage their time sharing.
Disclaimer: this is a rough and inaccurate description of the scheduling in the kernel and in the go runtime aimed at explaining the concepts, not at being an exact or detailed explanation of the real system.
As you may (or not know), a CPU can't actually run two programs at the same time: a CPU only have one execution thread, which can execute one instruction at a time. The direct consequence on early systems was that you couldn't run two programs at the same time, each program needing (system-wise) a dedicated thread.
The solution currently adopted is called pseudo-parallelism: given a number of logical threads (e.g multiple programs), the system will execute one of the logical threads during a certain amount of time then switch to the next one. Using really small amounts of time (in the order of milliseconds), you give the human user the illusion of parallelism. This operation is called scheduling.
The Go language doesn't use this system directly: it itself implement a scheduler that run on top of the system scheduler, and schedule the execution of the goroutines itself, bypassing the performance cost of using a real thread for each routine. This type of system is called light/green thread.
Related
Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).
I've been going through this tutorial on parallel pipelines and noticed that, while there is definitely a considerable difference in throughput, couldn't it be even better if the compression stage also took on a read job since it's just waiting around anyway? The same thing goes for the write stage... I mean, why not take on a third compression and then switch over to writing two, and then have one of those cores go back to compressing while the other wraps up the third write, and so on?
I apologize if this is obvious. I imagine this is standard practice and is called something, I'm just not sure what. Is their any overhead involved with switching jobs like this?
And I know this might be the wrong forum for this last question, but can the GPU switch jobs like this or should the programmable shaders/CUDA cores pretty much be left alone after being programmed?
EDIT: I guess I also don't understand how taking the same six-cores used in the 2 cores/stage example would be faster than just giving each of the six cores all three stages. Sure, there would be two cores that would do two, but that's still faster than the top scenario. I would understand it better in the GPU's case since there is specialized hardware involved for certain computations, but generally speaking, I don't see it. Maybe this example is weak or something because I know the parallel processing is here to stay.
This is definitely an issue with pipelining and there are a number of different ways to try and mitigate it.
With specialized hardware the hardware will often be tuned to try and balance the time taken in each stage for typical workloads. Fixed function stages in GPUs for example are typically balanced around the needs of a sample of representative game rendering workloads with transistors being allocated to try and balance the time taken in each stage. With static balancing like this there will usually be some wasted performance still however.
An alternative approach that can be used in both software and hardware to balance a pipeline is to break the longer stages down into multiple shorter steps. This is a common strategy in CPU instruction pipelines but can also be useful in software. In your example, the longer running compression step could potentially be broken down into multiple shorter pipeline stages. Depending on the task this may be difficult or impossible to do efficiently however.
Task scheduling systems can be used to help balance workloads across CPUs in a software pipeline. In a task scheduling system, you have a number of worker threads (usually around one per hardware thread) and any task can run on any worker thread. You have an API to set up dependencies between tasks and the task scheduler is responsible for scheduling tasks to run wherever CPU time is available once their dependencies are satisfied. In your example, the cores with idle time running the Read and Write tasks could help out with Compress tasks rather than sitting idle as long as the Compress tasks had their Read task dependencies satisfied.
Traditional OS thread schedulers can give some of the same benefits of a task scheduling system. In your example, if the Read threads waited on a semaphore when their work queues were empty (to be signalled when new work was added to the queues), the OS could schedule Compress threads to run on those idle cores. This can work reasonably well for relatively long running pipeline stages (10s of milliseconds) but for shorter pipeline stages (sub 1ms) the overhead of the OS thread scheduling and the length of the thread time slice will likely mean a task scheduling system would give better performance.
Your points are valid. The tutorial is lacking.
If the read, compress, and write operations can all occur at once, independently, the simple non-pipelined case would be the fastest for the six cores. Also notice that in the six core diagram, the reads and writes never overlap, so they could be the same ones. You only need four cores.
But consider a case where the reads all access the same disk so issuing too many read operations in parallel makes the reads take longer because they interfere with each other. In this case you can gain by pipelining the reads since you start the first compress steps sooner and they limit
the overall performance.
What could be a typical or real problem for using parallel programming? It can be quite challenging to implement. On the internet they explain how to use it but not why.
Performance is the most common reason to use parallel programming. But: Not all programs will become faster by using parallel programming. In most cases your algorithm consists of parts that are parallelizable and parts, that are inherently sequential. You always have to reason about the potential performance gain of using parallel programming. In some cases the overhead for using it will actually make your program slower. Have a look at Amdahl's law to learn more about the potential performance improvements you can reach.
If you only want some examples of usage of parallel computations: There are some classes of algorithms that are inherently parallel, see this article the dwarfs of berkeley
Another reason for using a multithreaded application architecture is it's responsiveness. There are certain functions which block program execution for a certain amount of time, i.e. reads from files, network, waiting for user inputs, etc. While waiting like this does not consume CPU power, it often blocks or slows program flow.
Using threads in such case is simply a good practice to make the code clearer. Instead of using (often complex or unintuitive) checks for inputs, integrating those checks into program flow, manual switching between handling input and other tasks, a programmer may choose to use threads and let one thread wait for input, and the other i.e. to perform calculations.
In other words, multiple threads sometimes allow for better use of different resources at your computer's disposal: network, disk, input devices or simply monitor.
Generalization: using multiple threads (including parallel data processing) is advisable when the speed and responsiveness gains outweigh the synchronization costs and work required to parallelize the application.
The reason why there is increased interest in parallel programming is partly because the hardware we use today is more parallel. (multicore processors, many-core GPU). To fully benefit from this hardware you need to program in parallel.
Interestingly, parallel processing also improves battery life:
Having 4 cores at 1Ghz draws less power than one single core at 4Ghz.
A phone with a multicore CPU will try to run as much tasks as possible simultaneously, so it can turn off the CPU when all work is done. This is sometimes called "the rush to idle".
Now, some programs are more easy parallelize than others. You should not randomly try to parallelize your entire code base. But it can be a useful excersise to do so even if there is no business reason: then you will be more ready the day when you really need it.
There are very few problems which can't be solved more quickly by a parallel program than by a serial program. There are very few computers which do not have multiple processing units.
I conclude, therefore, that you should use parallel programming all the time.
I have a CSV file with over 1 million rows. I also have a database that contains such data in a formatted way.
I want to check and verify the data in the CSV file and the data in the database.
Is it beneficial/reduces time to thread reading from the CSV file and use a connection pool to the database?
How well does Ruby handle threading?
I am using MongoDB, also.
It's hard to say without knowing some more details about the specifics of what you want the app to feel like when someone initiates this comparison. So, to answer, some general advice that should apply fairly well regardless of the problem you might want to thread.
Threading does NOT make something computationally less costly
Threading doesn't make things less costly in terms of computation time. It just lets two things happen in parallel. So, beware that you're not falling into the common misconception that, "Threading makes my app faster because the user doesn't wait for things." - this isn't true, and threading actually adds quite a bit of complexity.
So, if you kick off this DB vs. CSV comparison task, threading isn't going to make that comparison take any less time. What it might do is allow you to tell the user, "Ok, I'm going to check that for you," right away, while doing the comparison in a separate thread of execution. You still have to figure out how to get back to the user when the comparison is done.
Think about WHY you want to thread, rather than simply approaching it as whether threading is a good solution for long tasks
Like I said above, threading doesn't make things faster. At best, it uses computing resources in a way that is either more efficient, or gives a better user experience, or both.
If the user of the app (maybe it's just you) doesn't mind waiting for the comparison to run, then don't add threading because you're just going to add complexity and it won't be any faster. If this comparison takes a long time and you'd rather "do it in the background" then threading might be an answer for you. Just be aware that if you do this you're then adding another concern, which is, how do you update the user when the background job is done?
Threading involves extra overhead and app complexity, which you will then have to manage within your app - tread lightly
There are other concerns as well, such as, how do I schedule that worker thread to make sure it doesn't hog the computing resources? Are the setting of thread priorities an option in my environment, and if so, how will adjusting them affect the use of computing resources?
Threading and the extra overhead involved will almost definitely make your comparison take LONGER (in terms of absolute time it takes to do the comparison). The real advantage is if you don't care about completion time (the time between when the comparison starts and when it is done) but instead the responsiveness of the app to the user, and/or the total throughput that can be achieved (e.g. the number of simultaneous comparisons you can be running, and as a result the total number of comparisons you can complete within a given time span).
Threading doesn't guarantee that your available CPU cores are used efficiently
See Green Threads vs. native threads - some languages (depending on their threading implementation) can schedule threads across CPUs.
Threading doesn't necessarily mean your threads wind up getting run in multiple physical CPU cores - in fact in many cases they definitely won't. If all your app's threads run on the same physical core, then they aren't truly running in parallel - they are just splitting CPU time in a way that may make them look like they are running in parallel.
For these reasons, depending on the structure of your app, it's often less complicated to send background tasks to a separate worker process (process, not thread), which can easily be scheduled onto available CPU cores at the OS level. Separate processes (as opposed to separate threads) also remove a lot of the scheduling concerns within your app, because you essentially offload the decision about how to schedule things onto the OS itself.
This last point is pretty important. OS schedulers are extremely likely to be smarter and more efficiently designed than whatever algorithm you might come up with in your app.
Is here any performance benefit to using multiple threads on a computer with a single CPU that does not having hyperthreading?
In terms of speed of computation, No. In fact things will slow down due to the overhead of managing the threads.
In terms of responsiveness, yes. You can for example have one thread wait on an IO operation and have another run a GUI at the same time.
It depends on your application. If it spends all its time using the CPU, then multithreading will just slow things down - though you may be able to use it to be more responsive to the user and thus give the impression of better performance.
However, if your code is limited by other things, for example using the file system, the network, or any other resource, then multithreading can help, since it allows your application to behave asynchronously. So while one thread is waiting for a file to load from disk, another can be querying a remote webserver and another redrawing the GUI, while another is doing various calculations.
Working with multiple threads can also simplify your business logic, since you don't have to pay so much attention to how various independent tasks need to interleave. If the operating system's scheduling logic is better than yours, then you may indeed see improved performance.
You can consider using multithreading on a single CPU
If you use network resources
If you do high-intensive IO operations
If you pull data from a database
If you exploit other stuff with possible delays
If you want to make your app with ultraspeed reaction
When you should not use multithreading on a single CPU
High-intensive operations which do almost 100% CPU usage
If you are not sure how to use threads and synchronization
If your application cannot be divided into several parallel processes
Yes, there is a benefit of using multiple threads (or processes) on a single CPU - if one thread is busy waiting for something, others can continue doing useful work.
However this can be offset by the overhead of task switching. You'll have to benchmark and/or profile your application on production grade hardware to find out.
Regardless of the number of CPUs available, if you require preemptive multitasking and/or applications with asynchronous components (i.e. pretty much anything that combines a responsive GUI with a non-trivial amount of computation or continuous I/O processing), multithreading performs much better than the alternative, which is to use multiple processes for each application.
This is because threads in the same process can exchange data much more efficiently than can multiple processes, because they share the same memory context.
See this Wikipedia article on computer multitasking for a fairly concise discussion of these issues.
Absolutely! If you do any kind of I/O, there is great advantage to having a multithreaded system. While one thread wait for an I/O operation (which are relatively slow), another thread can do useful work.