Couple of questions about Amazon EC2 - amazon-ec2

Amazon measures their CPU allotment in terms of virtual cores and EC2 Compute Units. EC2 Compute Units are defined as:
The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation.
My question is, say I have a "Large Instance" which comes with "4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)". Does this mean I essentially have 4 cores in a logical sense? Would I want to spawn 4 CPU-bound threads? Or are the compute units simply a measure of power, and I have 2 cores?
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?

nicholaides is correct, the small instances are the equivalent of one core, the large two cores. The remainder of the measurement is expressed as Compute Units, which are defined as follows:
One EC2 Compute Unit (ECU) provides
the equivalent CPU capacity of a
1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
I run my small website on a single small instance, with both web server and database hosted on the one virtual machine. I've been impressed with the performance, but again don't have a tremendous amount of load on it.
If all you're caring for is bang for your buck, I'd try your setup with both servers running on a single small instance (1 core, 1 EC2 unit at $0.10 / hour) and see how that stacks up. The next step up would be a high-CPU medium instance (2 cores, 5 total EC2 units at $0.20 / hour). Unless you're really hammering your servers, I have to believe you'll be able to run them on that single medium instance. For only twice the price of the small instance, you get five times the performance, which is much better than running two small instances.
One thing to be careful of is that the small and high-CPU medium instances are 32-bit, where all others (large, extra large, and high-CPU extra large) are 64-bit. You cannot run a 32-bit Amazon Machine Image on a 64-bit instance, and vice versa. If you're working with a stock AMI, this isn't a problem because you'll usually be able to find both versions of it, but for a custom image it might make you do a little extra work.

"4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)" simply means you get 2 virtual cpu's, each of which is twice as fast as the basic Small instance.
In total, you get 4 times the power of the Small instance, but since you only get 2 cores, it makes sense to start only two threads.
As for your second question, I think Brad Larson answers it pretty well. The Medium instance has a lot of power for the money. We run our db en web servers on the same host, and it's surprising how many db-heavy sites you can run on a single machine. However, since it depends on your own application your best bet is to benchmark it to see how much load it can handle.
If you must scale up I would suggest separating the two services into different servers, instead of running a larger server, simply because it is easier to optimize each host for the specific service.

As I recall, "Compute Units" are not measuring cores but simple a measure of "power."
Also, given the scalability of the servers, would it be better to double the computing power of a single box and host the database and server on the same box? Or should I have 2 seperate, weaker boxes?
It really depends on the application. Trying it out and getting hard data might be your best bet.

Related

Weight-factor to determine Virtual Cores for a Virtual Machine

I have to design an algorithm which will decide to assign virtual cores to a VM.
e.g. I have 2 options to create a machines. That could be physical/virtual. Let's consider 2 cases:
If I require 1-core of 2.3 GHz, which means I require a processor having the ability to run 2.3 * 10^9 instructions. In the case of assigning a processor having these capabilities to a physical machine, it is ok.
But when I want to assign 1-core of 2.3 GHz to a virtual machine, I want to use a constant weight-factor of value 0.8. I divide the "number of instructions" i.e. 2.3 * 10^9 with the weight-factor 0.8. So, the required processing capability for the virtual-machine should be scaled by this factor. The value turns out to be 2.875 * 10^9.
I want to make sure that from you that, is this a correct way to scale the required processing capabilities by the use of a weight-factor, in the case of virtual-machines.
If yes, are there any related studies or proof of concepts to use this mechanism of determining the number of processors required for a virtual-machine?
In general; for SMT (e.g. hyper-threading) on 80x86 CPUs; with all logical CPUs within a core doing work:
If all logical CPUs are using different resources (e.g. maybe one using SSE instructions and another using general purpose integer instructions); each logical CPU may be as fast as it would be if it was the only logical CPU using the core
If all logical CPUs are fighting for the same resource/s, the performance of the core may be equally divided by the logical CPUs (e.g. with 2 logical CPUs per core, each logical CPU might get half of the performance it would have if it was the only logical CPU using the core).
Note that this may also apply to AMD's Bulldozer (even though it's not considered SMT), where the FPU is shared between cores but the rest of the core is not (in other words, if both cores are pounding the FPU at the same time then performance of both cores will be effected).
This means that (e.g.) for a 2.3 GHz core with 2 logical CPUs per core, each logical CPU may get (the crude equivalent of) anything from 0.75 GHz to 3.4 Ghz; depending on the exact code that each logical CPU happens to be executing and various power management conditions (thermal throttling, turbo-boost, etc).
However, actual performance also depends on things like caches (and cache sharing), RAM chip bandwidth, and virtual machine overheads (which vary from "extreme" for code causing a huge number of VMEXITs to almost nothing). With this in mind (e.g.) for a 2.3 GHz core, each logical CPU may get (the crude equivalent of) anything from a few hundred MHz to 3.4 Ghz; depending on many factors.
Essentially; your "weight" should be any random number from 0.1 to 1.0 depending on a bunch of stuff that you can't/won't know.
Fortunately, any code running inside the virtual machine is likely to be designed to handle a wide variety of different CPUs each with a wide variety of speeds; so it's enough to just assign any CPU to the virtual machine and let the software running inside the virtual machine adapt to whatever performance it was given.
Alternatively (if you need to guarantee some kind of performance or you want to try to hide the timing differences so that code in the VM doesn't know it's not running on real hardware); you can keep track of "virtual time" and "wall clock time" and try to keep these times roughly in sync. For example, if "virtual time" is moving too slow (e.g. because the code inside the VM is causing lots of VMEXITs) you can pretend that the virtual CPU got hot and started thermal throttling to create a plausible/realistic excuse that allows "virtual time" to catch up to "wall clock time"; and if something can happen sooner than it should (e.g. you know that guest is waiting for a virtual timer that will expire in 100 milliseconds and can pretend that 100 milliseconds passed when it didn't) you can deliberately slow down the virtual machine until "wall clock time" catches up to "virtual time". In this case it would be a good idea to give yourself some room to move (pretend the virtual CPU is slower than it could be, because it's easier to slow the virtual machine down than it is to speed it up). Of course this can also be used to hide timing differences caused by SMT, and can hide timing difference caused by sharing CPUs between VMs (e.g. when there's more virtual cores than real cores).
Note: The "alternative alternative" is to say that "virtual time" has nothing to do with "wall clock time" at all. This allows you to (e.g.) emulate a 6 GHz CPU when all you have is an old 1 GHz CPU - it'd just mean that 1 "virtual second" takes about 6 "wall clock seconds".
Also note that with all the security problems in the last 18+ months (e.g. spectre) I'd strongly consider using "cores" as the minimum assignable unit, such that at any point in time a VM gets all logical CPUs belonging to a core or none of the logical CPUs belonging to the core (and refuse to allow logical CPUs within the same core to be assigned to different virtual machines at the same time, because data will probably leak across any of the many side-channels from one VM to another).

is it possible to a vCPU to use different CPUs from two different hardware computers

I'v searched about this but i don't seem to get fair answer.
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
is it possible to create one vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
Update 1: lets say i'm using virtualBox, and the term vCPU is referring to virtual cpu, and it's a well known term.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters, and i wan't to know if the word "dynamic" really means allocating virtual cpus dynamically from different hardwares, like bare-metal servers. i don't know if i was searching in the wrong place but no one really answers this question in the docs.
Unfortunately, I have to start by saying that I completely disagree with the answer from OSGX (and I have to start with that as the rest of my answer depends on it). There are documented cases where aggregating CPU power of multiple physical systems into a single system image work great. Even about the comment regarding ScaleMP ...solutions can be ranged from "make target application slower" to "make target application very-very slow" ... - all one needs to do to invalidate that claim is to check the top-rated machines in the SPEC CPU benchmark lists to see machines using ScaleMP are in the top 5 SMPs ever built for performance on this benchmark.
Also, from computer architecture perspective, all large scale machines are essentially a collection of smaller machines with a special fabric (Xbar, Numalink, etc.) and some logic/chipset to manage cache coherence. today's standard fabrics (PCIe Switching, InfiniBand) are just as fast, if not faster, than those proprietary SMP interconnects. Will OSGX claim those SMPs are also "very-very-slow"?
The real question, as with any technology, is what are you trying to achieve. Most technologies are a good fit for one task but not the other. If you are trying to build a large machine (say, combine 16 servers, each with 24 cores, into a 384-core SMP), on-top of which you will be running small VMs, each using single digit number of vCPUs, then this kind of SSI solution would probably work very nicely as to the underlying infrastructure you are merely running a high-throughput computing (HTC) job - just like SPEC CPU is. However, if you are running a thread-parallel software that excessively uses serializing elements (barriers, locks, etc) that require intensive communication between all cores - then maybe you won't see any benefit.
As to the original question on the thread, or rather, the "Update 2" by the author:
...I'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters...
Indeed, there is not a lot of technology out there that enables the creation of a single system from CPUs across a cluster. The technology mentioned earlier, from ScaleMP, does this but only at a physical server granularity (so, if you have a cluster of 100 servers and each cluster node has 24 cores, then you can "dynamically" create virtual machines of 48 cores (2 cluster nodes), 72 cores (3 cluster nodes), and so on, but you could not create a machine with 36 cores (1.5 cluster nodes), nor combine a few vacant CPUs from across different nodes - you either use all the cores from a node to combine into a virtual SMP, or none at all.
I'll use term vCPU as virtual cores and pCPU as physical cores, as it is defined by virtualbox documentation: https://www.virtualbox.org/manual/ch03.html#settings-processor
On the "Processor" tab, you can set how many virtual CPU cores the guest operating systems should see. Starting with version 3.0, VirtualBox supports symmetrical multiprocessing (SMP) and can present up to 32 virtual CPU cores to each virtual machine. You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
And I will try to answer your questions:
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
If you want to create virtual machine (with single OS image, SMP machine) all virtual cores should have shared memory. Two physical machines each of 5 cores have in sum 10 cores, but they have no shared memory. So, with classic virtualization software (qemu, kvm, xen, vmware, virtualbox, virtualpc) you is not able to convert two physical machine into single virtual machine.
is it possible to create that vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
No.
Regular physical machine have one or more CPU chips (sockets) and each chip has one or more cores. First PC had 1 chip with one core; there were servers with two sockets with one core in each. Later multicore chips were made, and huge servers may have 2, 4, 6 or sometimes even 8 sockets, with some number of cores per socket. Also, physical machine has RAM - dynamic computer memory, which is used to store data. Earlier multisocket systems had single memory controller, current multisocket systems have several memory controllers (MC, 1-2 per socket, every controller with 1, 2, or sometimes 3 or 4 channels of memory). Both multicore and multisocket systems allow any CPU core to access any memory, even if it is controlled by MC of other socket. And all accesses to the system memory are coherent (Memorycoherence, Cachecoherence) - any core may write to memory and any other core will see writes from first core in some defined order (according to Consistency model of the system). This is the shared memory.
"two physical" chips of two different machines (your PC and your laptop) have not connected their RAM together and don't implement in hardware any model of memory sharing and coherency. Two different computers interacts using networks (Ethernet, Wifi, .. which just sends packets) or files (store file on USB drive, disconnect from PC, connect to laptop, get the file). Both network and file sharing are not coherent and are not shared memory
i'm using virtualBox
With VirtualBox (and some other virtualization solutions) you may allocate 8 virtual cores for the virtual machine even when your physical machine has 4 cores. But VMM will just emulate that there are 8 cores, scheduling them one after one on available physical cores; so at any time only programs from 4 virtual cores will run on physical cores (https://forums.virtualbox.org/viewtopic.php?f=1&t=30404 " core i7, this is a 4 core .. I can use up to 16 VCPU on virtual Machine .. Yes, it means your host cores will be over-committed. .. The total load of all guest VCPUs will be split among the real CPUs."). In this case you will be able to start 10 core virtual machine on 5 core physical, and application which want to use 10 cores will get them. But performance of the application will be not better as with 5 real CPUs, and it will be less, because there will be "virtual CPU switching" and frequent synchronization will add extra overhead.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning
If you want to research about "dynamic provisioning", ask about it, not about "running something unknown on two PC at the same time)
in HPC clusters,
There are no single type of "HPC" or "HPC clusters". Different variants of HPC will require different solutions and implementations. Some HPC tasks needs huge amounts of memory (0.25, 0.5, 1, 2 TB) and will run only on shared-memory 4- or 8-socket machines, filled with hugest memory DIMM modules. Other HPC tasks may use GPGPU a lot. Third kind will combine thread parallelism (OpenMP) and process parallelism (MPI), so applications will use shared memory while threads of it runs on single machine, and they will send and receive packets over network to work collectively on one task while running on several (thousands) physical machines. Fourth kind of HPC may want to have 100 or 1000 TB of shared memory; but there are no SMP / NUMA machines with such amounts, so application can be written in Distributed shared memory paradigm/model (Distributed global address space DGAS, Partitioned global address space PGAS) to run on special machines or on huge clusters. Special solutions are used, and in PGAS the global shared memory of 100s TB is emulated from many computers which are connected with network. Program is written in special language or just use special library functions to access memory (list of special variants from Wikipedia: PGAS "Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM"). If the address or the request is in local memory, use it; if it is in memory of other machine, send packet to that machine to request data from memory. Even with fastest (100 Gbit/s) special networks with RDMA capability (network adapter may access memory of the PC without any additional software processing of incoming network packet) the difference between local memory and memory of remote computer is speed: you have higher latency of access and you have lower bandwidth when memory is remote (remote memory is slower than local memory).
If you say "vCPU must have 10 cores" we can read this as "there is application which want 10 core of shared memory system". In theory it is possible to emulate shared memory for application (and it can be possible to create virtualization solution which will use resources from several PC to create single virtual pc with more resources), but in practice this is very complex task and the result probably will has too low performance. There is commercial ScaleMP (very high cost; Wikipedia: ScaleMP "The ScaleMP hypervisor combines x86 servers to create a virtual symmetric multiprocessing system. The process is a type of hardware virtualization called virtualization for aggregation.") and there was commercial Cluster OpenMP from Intel (https://software.intel.com/sites/default/files/1b/1f/6330, https://www.hpcwire.com/2006/05/19/openmp_on_clusters-1/) to convert OpenMP programs (uses threads and shared memory) into MPI-like software with help of library and OS-based handlers of access to remote memory. Both solutions can be ranged from "make target application slower" to "make target application very-very slow" (internet search of scalemp+slow and cluster+openmp+slow), as computer network is always slower that computer memory (network has greater distance than memory - 100m vs 0.2m, network has narrow bus of 2, 4 or 8 high-speed pairs while memory has 64-72 high-speed pairs for every memory channel; network adapter will use external bus of CPU when memory is on internal interface, most data from network must be copied to the memory to become available to CPU).
and i wan't to know if the word "dynamic" really means
no one really answers this question in the docs.
If you want help from other people, show us the context or the docs you have with the task. It can be also useful to you to better understand some basic concepts from computing and from cluster computing (Did you have any CS/HPC courses?).
There are some results from internet search request like "dynamic+provisioning+in+HPC+clusters", but we can't say is it the same HPC variant as you want or not.

How to reduce time taken for large calculations in MATLAB

When using the desktop PC's in my university (Which have 4Gb of ram), calculations in Matlab are fairly speedy, but on my laptop (Which also has 4Gb of ram), the exact same calculations take ages. My laptop is much more modern so I assume it also has a similar clock speed to the desktops.
For example, I have written a program that calculates the solid angle subtended by 50 disks at 500 points. On the desktop PC's this calculation takes about 15 seconds, on my laptop it takes about 5 minutes.
Is there a way to reduce the time taken to perform these calculations? e.g, can I allocate more ram to MATLAB, or can I boot up my PC in a way that optimises it for using MATLAB? I'm thinking that if the processor on my laptop is also doing calculations to run other programs this will slow down the MATLAB calculations. I've closed all other applications, but I know theres probably a lot of stuff going on I can't see. Can I boot my laptop up in a way that will have less of these things going on in the background?
I can't modify the code to make it more efficient.
Thanks!
You might run some of my benchmarks which, along with example results, can be found via:
http://www.roylongbottom.org.uk/
The CPU core used at a particular point in time, is the same on Pentiums, Celerons, Core 2s, Xeons and others. Only differences are L2/L3 cache sizes and external memory bus speeds. So you can compare most results with similar vintage 2 GHz CPUs. Things to try, besides simple number crunching tests.
1 - Try memory test, such as my BusSpeed, to show that caches are being used and RAM not dead slow.
2 - Assuming Windows, check that the offending program is the one using most CPU time in Task Manager, also that with the program not running, that CPU utilisation is around zero.
3 - Check that CPU temperature is not too high, like with SpeedFan (free D/L).
4 - If disk light is flashing, too much RAM might be being used, with some being swapped in and out. Task Manager Performance would show this. Increasing RAM demands can be checked my some of my reliability tests.
There are many things that go into computing power besides RAM. You mention processor speed, but there is also number of cores, GPU capability and more. Programs like MATLAB are designed to take advantage of features like parallelism.
Summary: You can't compare only RAM between two machines and expect to know how they will perform with respect to one another.
Side note: 4 GB is not very much RAM for a modern laptop.
Firstly you should perform a CPU performance benchmark on both computers.
Modern operating systems usually apply the most aggressive power management schemes when it is run on laptop. This usually means turning off one or more cores, or setting them to a very low frequency. For example, a Quad-core CPU that normally runs at 2.0 GHz could be throttled down to 700 MHz on one CPU while the other three are basically put to sleep, while it is on battery. (Remark. Numbers are not taken from a real example.)
The OS manages the CPU frequency in a dynamic way, tweaking it on the order of seconds. You will need a software monitoring tool that actually asks for the CPU frequency every second (without doing busy work itself) in order to know if this is the case.
Plugging in the laptop will make the OS use a less aggressive power management scheme.
(If this is found to be unrelated to MATLAB, please "flag" this post and ask moderator to move this question to the SuperUser site.)

Memory-intense jobs scaling poorly on multi-core cloud instances (ec2, gce, rackspace)?

Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.

EC2 Large Instance Vs. 2 x 2 Processor dedicated hosting

I currently have a quad code single processor dedicated hosting with 4GB of RAM at softlayer. I am contemplating upgrading to a dual processor dual core (or quad core). While doing the price comparison with the reserved large instance in amazon, it seems the price is quite comparable to similar dedicated hosting (maybe ec2 is little cheaper like to like).
Anyone has any other point of view or experience that can shed some more light on this? I want to keep the server running 24 x 7 and my concern is the processor speed (not sure what is amazon's computing unit capabilities) and RAM. For hard disk, I guess I will have to use the elastic storage to avoid loss in case of server breakdown!
If you want to have a server running all the time I usually find the dedicated servers cheaper than cloud ones. In the cloud you pay a bit more for the dynamics that you start and stop server whenever you want.
As for ECU. That is a pity that Amazon does not say how they actually measure it. There is a pretty decent try to measure what it means with multiple benchmarks in this article. But they ended with strongly non-linear scale. Another source tells what ECU is directly proportional to UnixBench - first question on this page. Actually the second link is for service that makes comparison of prices in cloud computing. You may find that Amazon may not necessary have the cheapest CPU. But you should be careful though - the CPU measure is based on the mentioned ECU measurement, which not necessary reflect later actual application performance.

Resources