Should I enable SMP on heterogeneous multi-threaded CPU's? - linux-kernel

I'm building the Linux kernel for a big.LITTLE board and I've been wondering about the CONFIG_SMP option, which enables the kernel's Symmetric-processing support.
Linux's documentation says this should be enabled on Multi-Threaded processors, but I wonder if Symmetric Multi processing wouldn't only work properly on processors that are actually symmetric.
I understand what SMP is, but I haven't found any hint or documentation saying anything about it's use on Linux built for ARM's big.LITTLE.

Yes, if you want to use more than a single core you have to enable CONFIG_SMP. This in itself will make all cores (both big and little ones) available to the kernel.
Then, you have two options (I'm assuming you are using the mainline Linux kernel or something not excessively different from it, e.g. not an Android kernel):
If you also enable CONFIG_BL_SWITCHER (-> Kernel Features -> big.LITTLE support -> big.LITTLE switcher support) and CONFIG_ARM_BIG_LITTLE_CPUFREQ (-> CPU Power Management -> CPU Frequency scaling -> CPU Frequency scaling -> Generic ARM big LITTLE CPUfreq driver), each big core in your SoC will be paired to a little core, and only one of the cores in each pair will be active at any given time, depending on the CPU load. So basically the number of logical cores will be half the number of physical cores, and each logical core will combine one physical big core and one physical little core (unless the total number of big cores differs from the number of little cores, in which case there will be non-paired physical cores that are also logical cores). For each logical core, switching between the big and little physical core will be managed by the cpufreq governor and will be conceptually equivalent to CPU frequency switching.
If you don't enable the above two configuration options, then all physical cores will be available as logical cores, can be active at the same time and are treated by the scheduler as if they were identical.
The first option is more suited if you are aiming at low power consumption, while the second option allows you to get the most out of the CPU.
This will change when Heterogeneous Multi-Processing (HMP) support is integrated in the mainline kernel.

Related

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.
Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.
The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.
Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.
In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?
Is this a known issue to someone? Or someone can validate my findings?
To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.
Update:
Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:
While setting KMP_AFFINITY=scatter, I got
I will try the latest ICC/Clang OpenMP runtime and see how it works.
TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.
Long answer:
Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.
NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.
My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.
Here is an example of a resulting command line so as to run your application:
numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app
PS: be careful about PU/core IDs and logical/physical IDs.

Weight-factor to determine Virtual Cores for a Virtual Machine

I have to design an algorithm which will decide to assign virtual cores to a VM.
e.g. I have 2 options to create a machines. That could be physical/virtual. Let's consider 2 cases:
If I require 1-core of 2.3 GHz, which means I require a processor having the ability to run 2.3 * 10^9 instructions. In the case of assigning a processor having these capabilities to a physical machine, it is ok.
But when I want to assign 1-core of 2.3 GHz to a virtual machine, I want to use a constant weight-factor of value 0.8. I divide the "number of instructions" i.e. 2.3 * 10^9 with the weight-factor 0.8. So, the required processing capability for the virtual-machine should be scaled by this factor. The value turns out to be 2.875 * 10^9.
I want to make sure that from you that, is this a correct way to scale the required processing capabilities by the use of a weight-factor, in the case of virtual-machines.
If yes, are there any related studies or proof of concepts to use this mechanism of determining the number of processors required for a virtual-machine?
In general; for SMT (e.g. hyper-threading) on 80x86 CPUs; with all logical CPUs within a core doing work:
If all logical CPUs are using different resources (e.g. maybe one using SSE instructions and another using general purpose integer instructions); each logical CPU may be as fast as it would be if it was the only logical CPU using the core
If all logical CPUs are fighting for the same resource/s, the performance of the core may be equally divided by the logical CPUs (e.g. with 2 logical CPUs per core, each logical CPU might get half of the performance it would have if it was the only logical CPU using the core).
Note that this may also apply to AMD's Bulldozer (even though it's not considered SMT), where the FPU is shared between cores but the rest of the core is not (in other words, if both cores are pounding the FPU at the same time then performance of both cores will be effected).
This means that (e.g.) for a 2.3 GHz core with 2 logical CPUs per core, each logical CPU may get (the crude equivalent of) anything from 0.75 GHz to 3.4 Ghz; depending on the exact code that each logical CPU happens to be executing and various power management conditions (thermal throttling, turbo-boost, etc).
However, actual performance also depends on things like caches (and cache sharing), RAM chip bandwidth, and virtual machine overheads (which vary from "extreme" for code causing a huge number of VMEXITs to almost nothing). With this in mind (e.g.) for a 2.3 GHz core, each logical CPU may get (the crude equivalent of) anything from a few hundred MHz to 3.4 Ghz; depending on many factors.
Essentially; your "weight" should be any random number from 0.1 to 1.0 depending on a bunch of stuff that you can't/won't know.
Fortunately, any code running inside the virtual machine is likely to be designed to handle a wide variety of different CPUs each with a wide variety of speeds; so it's enough to just assign any CPU to the virtual machine and let the software running inside the virtual machine adapt to whatever performance it was given.
Alternatively (if you need to guarantee some kind of performance or you want to try to hide the timing differences so that code in the VM doesn't know it's not running on real hardware); you can keep track of "virtual time" and "wall clock time" and try to keep these times roughly in sync. For example, if "virtual time" is moving too slow (e.g. because the code inside the VM is causing lots of VMEXITs) you can pretend that the virtual CPU got hot and started thermal throttling to create a plausible/realistic excuse that allows "virtual time" to catch up to "wall clock time"; and if something can happen sooner than it should (e.g. you know that guest is waiting for a virtual timer that will expire in 100 milliseconds and can pretend that 100 milliseconds passed when it didn't) you can deliberately slow down the virtual machine until "wall clock time" catches up to "virtual time". In this case it would be a good idea to give yourself some room to move (pretend the virtual CPU is slower than it could be, because it's easier to slow the virtual machine down than it is to speed it up). Of course this can also be used to hide timing differences caused by SMT, and can hide timing difference caused by sharing CPUs between VMs (e.g. when there's more virtual cores than real cores).
Note: The "alternative alternative" is to say that "virtual time" has nothing to do with "wall clock time" at all. This allows you to (e.g.) emulate a 6 GHz CPU when all you have is an old 1 GHz CPU - it'd just mean that 1 "virtual second" takes about 6 "wall clock seconds".
Also note that with all the security problems in the last 18+ months (e.g. spectre) I'd strongly consider using "cores" as the minimum assignable unit, such that at any point in time a VM gets all logical CPUs belonging to a core or none of the logical CPUs belonging to the core (and refuse to allow logical CPUs within the same core to be assigned to different virtual machines at the same time, because data will probably leak across any of the many side-channels from one VM to another).

is it possible to a vCPU to use different CPUs from two different hardware computers

I'v searched about this but i don't seem to get fair answer.
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
is it possible to create one vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
Update 1: lets say i'm using virtualBox, and the term vCPU is referring to virtual cpu, and it's a well known term.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters, and i wan't to know if the word "dynamic" really means allocating virtual cpus dynamically from different hardwares, like bare-metal servers. i don't know if i was searching in the wrong place but no one really answers this question in the docs.
Unfortunately, I have to start by saying that I completely disagree with the answer from OSGX (and I have to start with that as the rest of my answer depends on it). There are documented cases where aggregating CPU power of multiple physical systems into a single system image work great. Even about the comment regarding ScaleMP ...solutions can be ranged from "make target application slower" to "make target application very-very slow" ... - all one needs to do to invalidate that claim is to check the top-rated machines in the SPEC CPU benchmark lists to see machines using ScaleMP are in the top 5 SMPs ever built for performance on this benchmark.
Also, from computer architecture perspective, all large scale machines are essentially a collection of smaller machines with a special fabric (Xbar, Numalink, etc.) and some logic/chipset to manage cache coherence. today's standard fabrics (PCIe Switching, InfiniBand) are just as fast, if not faster, than those proprietary SMP interconnects. Will OSGX claim those SMPs are also "very-very-slow"?
The real question, as with any technology, is what are you trying to achieve. Most technologies are a good fit for one task but not the other. If you are trying to build a large machine (say, combine 16 servers, each with 24 cores, into a 384-core SMP), on-top of which you will be running small VMs, each using single digit number of vCPUs, then this kind of SSI solution would probably work very nicely as to the underlying infrastructure you are merely running a high-throughput computing (HTC) job - just like SPEC CPU is. However, if you are running a thread-parallel software that excessively uses serializing elements (barriers, locks, etc) that require intensive communication between all cores - then maybe you won't see any benefit.
As to the original question on the thread, or rather, the "Update 2" by the author:
...I'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters...
Indeed, there is not a lot of technology out there that enables the creation of a single system from CPUs across a cluster. The technology mentioned earlier, from ScaleMP, does this but only at a physical server granularity (so, if you have a cluster of 100 servers and each cluster node has 24 cores, then you can "dynamically" create virtual machines of 48 cores (2 cluster nodes), 72 cores (3 cluster nodes), and so on, but you could not create a machine with 36 cores (1.5 cluster nodes), nor combine a few vacant CPUs from across different nodes - you either use all the cores from a node to combine into a virtual SMP, or none at all.
I'll use term vCPU as virtual cores and pCPU as physical cores, as it is defined by virtualbox documentation: https://www.virtualbox.org/manual/ch03.html#settings-processor
On the "Processor" tab, you can set how many virtual CPU cores the guest operating systems should see. Starting with version 3.0, VirtualBox supports symmetrical multiprocessing (SMP) and can present up to 32 virtual CPU cores to each virtual machine. You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
And I will try to answer your questions:
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
If you want to create virtual machine (with single OS image, SMP machine) all virtual cores should have shared memory. Two physical machines each of 5 cores have in sum 10 cores, but they have no shared memory. So, with classic virtualization software (qemu, kvm, xen, vmware, virtualbox, virtualpc) you is not able to convert two physical machine into single virtual machine.
is it possible to create that vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
No.
Regular physical machine have one or more CPU chips (sockets) and each chip has one or more cores. First PC had 1 chip with one core; there were servers with two sockets with one core in each. Later multicore chips were made, and huge servers may have 2, 4, 6 or sometimes even 8 sockets, with some number of cores per socket. Also, physical machine has RAM - dynamic computer memory, which is used to store data. Earlier multisocket systems had single memory controller, current multisocket systems have several memory controllers (MC, 1-2 per socket, every controller with 1, 2, or sometimes 3 or 4 channels of memory). Both multicore and multisocket systems allow any CPU core to access any memory, even if it is controlled by MC of other socket. And all accesses to the system memory are coherent (Memorycoherence, Cachecoherence) - any core may write to memory and any other core will see writes from first core in some defined order (according to Consistency model of the system). This is the shared memory.
"two physical" chips of two different machines (your PC and your laptop) have not connected their RAM together and don't implement in hardware any model of memory sharing and coherency. Two different computers interacts using networks (Ethernet, Wifi, .. which just sends packets) or files (store file on USB drive, disconnect from PC, connect to laptop, get the file). Both network and file sharing are not coherent and are not shared memory
i'm using virtualBox
With VirtualBox (and some other virtualization solutions) you may allocate 8 virtual cores for the virtual machine even when your physical machine has 4 cores. But VMM will just emulate that there are 8 cores, scheduling them one after one on available physical cores; so at any time only programs from 4 virtual cores will run on physical cores (https://forums.virtualbox.org/viewtopic.php?f=1&t=30404 " core i7, this is a 4 core .. I can use up to 16 VCPU on virtual Machine .. Yes, it means your host cores will be over-committed. .. The total load of all guest VCPUs will be split among the real CPUs."). In this case you will be able to start 10 core virtual machine on 5 core physical, and application which want to use 10 cores will get them. But performance of the application will be not better as with 5 real CPUs, and it will be less, because there will be "virtual CPU switching" and frequent synchronization will add extra overhead.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning
If you want to research about "dynamic provisioning", ask about it, not about "running something unknown on two PC at the same time)
in HPC clusters,
There are no single type of "HPC" or "HPC clusters". Different variants of HPC will require different solutions and implementations. Some HPC tasks needs huge amounts of memory (0.25, 0.5, 1, 2 TB) and will run only on shared-memory 4- or 8-socket machines, filled with hugest memory DIMM modules. Other HPC tasks may use GPGPU a lot. Third kind will combine thread parallelism (OpenMP) and process parallelism (MPI), so applications will use shared memory while threads of it runs on single machine, and they will send and receive packets over network to work collectively on one task while running on several (thousands) physical machines. Fourth kind of HPC may want to have 100 or 1000 TB of shared memory; but there are no SMP / NUMA machines with such amounts, so application can be written in Distributed shared memory paradigm/model (Distributed global address space DGAS, Partitioned global address space PGAS) to run on special machines or on huge clusters. Special solutions are used, and in PGAS the global shared memory of 100s TB is emulated from many computers which are connected with network. Program is written in special language or just use special library functions to access memory (list of special variants from Wikipedia: PGAS "Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM"). If the address or the request is in local memory, use it; if it is in memory of other machine, send packet to that machine to request data from memory. Even with fastest (100 Gbit/s) special networks with RDMA capability (network adapter may access memory of the PC without any additional software processing of incoming network packet) the difference between local memory and memory of remote computer is speed: you have higher latency of access and you have lower bandwidth when memory is remote (remote memory is slower than local memory).
If you say "vCPU must have 10 cores" we can read this as "there is application which want 10 core of shared memory system". In theory it is possible to emulate shared memory for application (and it can be possible to create virtualization solution which will use resources from several PC to create single virtual pc with more resources), but in practice this is very complex task and the result probably will has too low performance. There is commercial ScaleMP (very high cost; Wikipedia: ScaleMP "The ScaleMP hypervisor combines x86 servers to create a virtual symmetric multiprocessing system. The process is a type of hardware virtualization called virtualization for aggregation.") and there was commercial Cluster OpenMP from Intel (https://software.intel.com/sites/default/files/1b/1f/6330, https://www.hpcwire.com/2006/05/19/openmp_on_clusters-1/) to convert OpenMP programs (uses threads and shared memory) into MPI-like software with help of library and OS-based handlers of access to remote memory. Both solutions can be ranged from "make target application slower" to "make target application very-very slow" (internet search of scalemp+slow and cluster+openmp+slow), as computer network is always slower that computer memory (network has greater distance than memory - 100m vs 0.2m, network has narrow bus of 2, 4 or 8 high-speed pairs while memory has 64-72 high-speed pairs for every memory channel; network adapter will use external bus of CPU when memory is on internal interface, most data from network must be copied to the memory to become available to CPU).
and i wan't to know if the word "dynamic" really means
no one really answers this question in the docs.
If you want help from other people, show us the context or the docs you have with the task. It can be also useful to you to better understand some basic concepts from computing and from cluster computing (Did you have any CS/HPC courses?).
There are some results from internet search request like "dynamic+provisioning+in+HPC+clusters", but we can't say is it the same HPC variant as you want or not.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Resources