Is there a rough-and-ready ratio for converting AWS's EC2 vCPU to Google's GCE vCPU?
For instance, what would be the Google equivalent of an Amazon c5.4xlarge with 16 vCPU? Is this as simple as a 1:1 ratio, and so the equivalent would be Google's c2-standard-16? Or is there some multiplier lurking in the threads . . .
They're exactly the same thing on GCP and AWS for Intel/AMD machines -- a hyperthread, a single hardware thread on a CPU core.
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware hyper-thread on one of the available CPU Platforms.
https://cloud.google.com/compute/docs/faq
Each vCPU is a thread of either an Intel Xeon core or an AMD EPYC core, except for M6g instances, A1 instances, T2 instances, and m3.medium.
Each vCPU on M6g instances is a core of the AWS Graviton2 processor.
Each vCPU on A1 instances is a core of an AWS Graviton Processor.
https://aws.amazon.com/ec2/instance-types/
The Graviton2 processor is an AWS exclusive ARM CPU that is marketed as having lower cost/performance ratio. Presumably "each vCPU is a core" there because the cores in that chip only actually have one thread each.
Related
I'v searched about this but i don't seem to get fair answer.
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
is it possible to create one vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
Update 1: lets say i'm using virtualBox, and the term vCPU is referring to virtual cpu, and it's a well known term.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters, and i wan't to know if the word "dynamic" really means allocating virtual cpus dynamically from different hardwares, like bare-metal servers. i don't know if i was searching in the wrong place but no one really answers this question in the docs.
Unfortunately, I have to start by saying that I completely disagree with the answer from OSGX (and I have to start with that as the rest of my answer depends on it). There are documented cases where aggregating CPU power of multiple physical systems into a single system image work great. Even about the comment regarding ScaleMP ...solutions can be ranged from "make target application slower" to "make target application very-very slow" ... - all one needs to do to invalidate that claim is to check the top-rated machines in the SPEC CPU benchmark lists to see machines using ScaleMP are in the top 5 SMPs ever built for performance on this benchmark.
Also, from computer architecture perspective, all large scale machines are essentially a collection of smaller machines with a special fabric (Xbar, Numalink, etc.) and some logic/chipset to manage cache coherence. today's standard fabrics (PCIe Switching, InfiniBand) are just as fast, if not faster, than those proprietary SMP interconnects. Will OSGX claim those SMPs are also "very-very-slow"?
The real question, as with any technology, is what are you trying to achieve. Most technologies are a good fit for one task but not the other. If you are trying to build a large machine (say, combine 16 servers, each with 24 cores, into a 384-core SMP), on-top of which you will be running small VMs, each using single digit number of vCPUs, then this kind of SSI solution would probably work very nicely as to the underlying infrastructure you are merely running a high-throughput computing (HTC) job - just like SPEC CPU is. However, if you are running a thread-parallel software that excessively uses serializing elements (barriers, locks, etc) that require intensive communication between all cores - then maybe you won't see any benefit.
As to the original question on the thread, or rather, the "Update 2" by the author:
...I'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters...
Indeed, there is not a lot of technology out there that enables the creation of a single system from CPUs across a cluster. The technology mentioned earlier, from ScaleMP, does this but only at a physical server granularity (so, if you have a cluster of 100 servers and each cluster node has 24 cores, then you can "dynamically" create virtual machines of 48 cores (2 cluster nodes), 72 cores (3 cluster nodes), and so on, but you could not create a machine with 36 cores (1.5 cluster nodes), nor combine a few vacant CPUs from across different nodes - you either use all the cores from a node to combine into a virtual SMP, or none at all.
I'll use term vCPU as virtual cores and pCPU as physical cores, as it is defined by virtualbox documentation: https://www.virtualbox.org/manual/ch03.html#settings-processor
On the "Processor" tab, you can set how many virtual CPU cores the guest operating systems should see. Starting with version 3.0, VirtualBox supports symmetrical multiprocessing (SMP) and can present up to 32 virtual CPU cores to each virtual machine. You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
And I will try to answer your questions:
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
If you want to create virtual machine (with single OS image, SMP machine) all virtual cores should have shared memory. Two physical machines each of 5 cores have in sum 10 cores, but they have no shared memory. So, with classic virtualization software (qemu, kvm, xen, vmware, virtualbox, virtualpc) you is not able to convert two physical machine into single virtual machine.
is it possible to create that vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
No.
Regular physical machine have one or more CPU chips (sockets) and each chip has one or more cores. First PC had 1 chip with one core; there were servers with two sockets with one core in each. Later multicore chips were made, and huge servers may have 2, 4, 6 or sometimes even 8 sockets, with some number of cores per socket. Also, physical machine has RAM - dynamic computer memory, which is used to store data. Earlier multisocket systems had single memory controller, current multisocket systems have several memory controllers (MC, 1-2 per socket, every controller with 1, 2, or sometimes 3 or 4 channels of memory). Both multicore and multisocket systems allow any CPU core to access any memory, even if it is controlled by MC of other socket. And all accesses to the system memory are coherent (Memorycoherence, Cachecoherence) - any core may write to memory and any other core will see writes from first core in some defined order (according to Consistency model of the system). This is the shared memory.
"two physical" chips of two different machines (your PC and your laptop) have not connected their RAM together and don't implement in hardware any model of memory sharing and coherency. Two different computers interacts using networks (Ethernet, Wifi, .. which just sends packets) or files (store file on USB drive, disconnect from PC, connect to laptop, get the file). Both network and file sharing are not coherent and are not shared memory
i'm using virtualBox
With VirtualBox (and some other virtualization solutions) you may allocate 8 virtual cores for the virtual machine even when your physical machine has 4 cores. But VMM will just emulate that there are 8 cores, scheduling them one after one on available physical cores; so at any time only programs from 4 virtual cores will run on physical cores (https://forums.virtualbox.org/viewtopic.php?f=1&t=30404 " core i7, this is a 4 core .. I can use up to 16 VCPU on virtual Machine .. Yes, it means your host cores will be over-committed. .. The total load of all guest VCPUs will be split among the real CPUs."). In this case you will be able to start 10 core virtual machine on 5 core physical, and application which want to use 10 cores will get them. But performance of the application will be not better as with 5 real CPUs, and it will be less, because there will be "virtual CPU switching" and frequent synchronization will add extra overhead.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning
If you want to research about "dynamic provisioning", ask about it, not about "running something unknown on two PC at the same time)
in HPC clusters,
There are no single type of "HPC" or "HPC clusters". Different variants of HPC will require different solutions and implementations. Some HPC tasks needs huge amounts of memory (0.25, 0.5, 1, 2 TB) and will run only on shared-memory 4- or 8-socket machines, filled with hugest memory DIMM modules. Other HPC tasks may use GPGPU a lot. Third kind will combine thread parallelism (OpenMP) and process parallelism (MPI), so applications will use shared memory while threads of it runs on single machine, and they will send and receive packets over network to work collectively on one task while running on several (thousands) physical machines. Fourth kind of HPC may want to have 100 or 1000 TB of shared memory; but there are no SMP / NUMA machines with such amounts, so application can be written in Distributed shared memory paradigm/model (Distributed global address space DGAS, Partitioned global address space PGAS) to run on special machines or on huge clusters. Special solutions are used, and in PGAS the global shared memory of 100s TB is emulated from many computers which are connected with network. Program is written in special language or just use special library functions to access memory (list of special variants from Wikipedia: PGAS "Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM"). If the address or the request is in local memory, use it; if it is in memory of other machine, send packet to that machine to request data from memory. Even with fastest (100 Gbit/s) special networks with RDMA capability (network adapter may access memory of the PC without any additional software processing of incoming network packet) the difference between local memory and memory of remote computer is speed: you have higher latency of access and you have lower bandwidth when memory is remote (remote memory is slower than local memory).
If you say "vCPU must have 10 cores" we can read this as "there is application which want 10 core of shared memory system". In theory it is possible to emulate shared memory for application (and it can be possible to create virtualization solution which will use resources from several PC to create single virtual pc with more resources), but in practice this is very complex task and the result probably will has too low performance. There is commercial ScaleMP (very high cost; Wikipedia: ScaleMP "The ScaleMP hypervisor combines x86 servers to create a virtual symmetric multiprocessing system. The process is a type of hardware virtualization called virtualization for aggregation.") and there was commercial Cluster OpenMP from Intel (https://software.intel.com/sites/default/files/1b/1f/6330, https://www.hpcwire.com/2006/05/19/openmp_on_clusters-1/) to convert OpenMP programs (uses threads and shared memory) into MPI-like software with help of library and OS-based handlers of access to remote memory. Both solutions can be ranged from "make target application slower" to "make target application very-very slow" (internet search of scalemp+slow and cluster+openmp+slow), as computer network is always slower that computer memory (network has greater distance than memory - 100m vs 0.2m, network has narrow bus of 2, 4 or 8 high-speed pairs while memory has 64-72 high-speed pairs for every memory channel; network adapter will use external bus of CPU when memory is on internal interface, most data from network must be copied to the memory to become available to CPU).
and i wan't to know if the word "dynamic" really means
no one really answers this question in the docs.
If you want help from other people, show us the context or the docs you have with the task. It can be also useful to you to better understand some basic concepts from computing and from cluster computing (Did you have any CS/HPC courses?).
There are some results from internet search request like "dynamic+provisioning+in+HPC+clusters", but we can't say is it the same HPC variant as you want or not.
What is the difference between a host processor and coprocessor? Specifically Xeon Phi coprocessor and Xeon Phi host processor?
I have some performance results on these machines (a parallelized OpenMP code of diffusion equation was being run) which shows that the host processor works much faster when the same number of threads are working. I would like to know differences and relate them to my results.
Just to re-iterate what Jeff said in the comments, you have a Xeon host with an attached Xeon Phi coprocessor. The current generation of Xeon Phi (Knight's Corner) is only available as a coprocessor, not as a standalone Xeon Phi host (which should be available next generation with Knight's Landing).
When you run your program without offloading from your host Xeon, from this website, it looks like you'll be able to run with up to 16 threads. Note that the speed of each of your cores is about 2.2 GHz.
When you run your program in native execution mode on your Xeon Phi coprocessor, you should be able to run with a lot more threads. The optimal number of threads to use depends on the model of Xeon Phi you have (some work best with 56, others with 60). But note that each Xeon Phi core (roughly 1.2 GHz) is noticeably weaker than a single Xeon core (roughly 2.2 GHz). The benefit of the many-core Xeon Phi technology is exactly that: you can run across many cores.
The last very important thing to consider is that the Xeon Phi has a 512-bit wide SIMD instruction set. Thus, you can support much better SIMD vectorization running on the Xeon Phi coprocessor than on the host. In your case, I believe your Xeon host only has a 256-bit SIMD vector processing unit. Therefore, if you haven't already, you can improve your performance (up to x16 if you're dealing in single-precision) on your Xeon Phi taking advantage of SIMD vectorization. Your Xeon host will only give up to x8 performance. Just to start you on a google trek, OpenMP 4.0 allows you to write things like #pragma omp simd in order to tell the compiler when to vectorize lower-level loops throughout your code. If you really want maximum performance from the Xeon Phi, adding SIMD vectorization is a necessity.
So to directly answer your question: comparing the performance results between your Xeon host and Xeon Phi coprocessor using the same number of cores is useless. We already know that each Xeon Phi core is slower than each Xeon core. You should be comparing the results using the maximum number of cores each allows (60, and 16 respectively) and taking maximum advantage of the vector processing unit if you want a direct comparison.
If you are talking about the current generation (KNC) and not the next (KNL), these are the definitions.
Host processor: The ~8 core/ ~16 thread Xeon that is hosting the coprocessor, meaning the Xeon host off of which the coprocessor is connected via the PCIe bus.
Coprocessor: The ~60 core/~240 thread coprocessor that is hanging off of your Xeon host on the Xeon's PCIe bus.
The host farms off highly parallel / vectorizeable jobs to the coprocessor using either offload instructions or by running them natively using some distributed programming paradigm such as MPI.
As to the comment about the next generation host processor, the commenter is referring to the fact that the next generation Xeon Phi (KNL) can be configured either as a coprocessor hanging off the PCIe bus (like the 1st gen Xeon Phi, KNC) or as a normal processor that you plug into a motherboard.
I have 2 instances on Amazon EC2. The one is a t2.micro machine as web cache server, the other is a performance test tool.
When I started a test, TPS (transactions per second) was about 3000. But a few minutes later TPS has been decreased to 300.
At first I thought that the CPU credit balance was exhausted, but it was enough to process requests. During a test, the max outgoing traffic of web cache was 500Mbit/s, usage of CPU was 60% and free memory was more than enough.
I couldn't find any cause of TPS decrease. Is there any limitation on EC2 machine or network?
There are several factors that could be constraining your processes.
CPU credits on T2 instances
As you referenced, T2 instances use credits for bursting CPU. They are very powerful machines, but each instance is limited to a certain amount of CPU. t2.micro instances are given 10% of CPU, meaning they actually get 100% of the CPU only 10% of the time (at low millisecond resolution).
Instances start with CPU credits for a fast start, and these credits are consumed when the CPU is used faster than the credits are earned. However, you say that the credit balance was sufficient, so this appears not to be the cause.
Network Bandwidth
Each Amazon EC2 instance can use a certain throughput of network bandwidth. Smaller instances have 'low' bandwidth, bigger instances have more. There is no official statement of bandwidth size, but this is an interesting reference from Serverfault: Bandwidth limits for Amazon EC2
Disk IOPS
If your application uses disk access for each transaction, and your instance is using a General Purpose (SSD) instance type, then your disk may have consumed all available burst credits. If your disk is small, this could mean it will run slow (speed is 3 IOPS per GB, so a 20GB disk would run at 60 IOPS). Check the Amazon CloudWatch VolumeQueueLength metric to see if IO is queuing excessively.
Something else
The slowdown could also be due to your application or cache system (eg running out of free memory for storing data).
I am aware that the Intel Xeon phi coprocessor SE10X has 61 cores
and it is suggested to use only 60 cores since 1 core is used for the offload daemon.
Also, since intel xeon phi coprocessor 5110P has 60 cores, is it suggested to use 59 cores?
From this this MIC-related FAQ:
Sensible Affinities
Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes).
From this OpenMP on MIC guide:
Offloaded programs inherit an affinity map that hides the last core, which is dedicated to offload system functions. Native programs can use all the cores, making the calculations required for balancing the threads slightly different.
None of these sources is specific to any MIC model, they're about the architecture; so it seems that if you offload to the device and don't use the default affinity, you should indeed avoid the last core.
I evaluated the performance of my test code on a intel xeon phi 7120p card. I observed that the code performance was best when no. of threads was a multiple of (number of cores - 1). This is because one of the cores is busy running the Linux micro-OS services.
In general:
No. of threads to create >= K * T * (N-1)
K = Positive integer (=2 works fine)
T = No. of thread contexts on hardware(4 in my case)
N = No. of cores present on hardware.
When you execute your workload in offload mode (when application runs on the CPU and offloads some computation to the Xeon Phi) it is recommended to leave 1 core for offload runtime. There is a COI demon on the Xeon Phi side that runs four service threads to manage offload activity. Keep in mind that 1 physical core on Xeon Phi runs 4 hardware threads.
In case of native execution model when application started directly on Xeon Phi card you could use all available cores. Since there are now any offload activity.
Amazon measures their CPU allotment in terms of virtual cores and EC2 Compute Units. EC2 Compute Units are defined as:
The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation.
My question is, say I have a "Large Instance" which comes with "4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)". Does this mean I essentially have 4 cores in a logical sense? Would I want to spawn 4 CPU-bound threads? Or are the compute units simply a measure of power, and I have 2 cores?
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?
nicholaides is correct, the small instances are the equivalent of one core, the large two cores. The remainder of the measurement is expressed as Compute Units, which are defined as follows:
One EC2 Compute Unit (ECU) provides
the equivalent CPU capacity of a
1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
I run my small website on a single small instance, with both web server and database hosted on the one virtual machine. I've been impressed with the performance, but again don't have a tremendous amount of load on it.
If all you're caring for is bang for your buck, I'd try your setup with both servers running on a single small instance (1 core, 1 EC2 unit at $0.10 / hour) and see how that stacks up. The next step up would be a high-CPU medium instance (2 cores, 5 total EC2 units at $0.20 / hour). Unless you're really hammering your servers, I have to believe you'll be able to run them on that single medium instance. For only twice the price of the small instance, you get five times the performance, which is much better than running two small instances.
One thing to be careful of is that the small and high-CPU medium instances are 32-bit, where all others (large, extra large, and high-CPU extra large) are 64-bit. You cannot run a 32-bit Amazon Machine Image on a 64-bit instance, and vice versa. If you're working with a stock AMI, this isn't a problem because you'll usually be able to find both versions of it, but for a custom image it might make you do a little extra work.
"4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)" simply means you get 2 virtual cpu's, each of which is twice as fast as the basic Small instance.
In total, you get 4 times the power of the Small instance, but since you only get 2 cores, it makes sense to start only two threads.
As for your second question, I think Brad Larson answers it pretty well. The Medium instance has a lot of power for the money. We run our db en web servers on the same host, and it's surprising how many db-heavy sites you can run on a single machine. However, since it depends on your own application your best bet is to benchmark it to see how much load it can handle.
If you must scale up I would suggest separating the two services into different servers, instead of running a larger server, simply because it is easier to optimize each host for the specific service.
As I recall, "Compute Units" are not measuring cores but simple a measure of "power."
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?
It really depends on the application. Trying it out and getting hard data might be your best bet.