Why does Cloud Functions perform so low yet so expensive? - performance

I just register a Google Cloud free tier lately & took interested in Cloud Functions which can make me develope & run code without install OS system or Software Development Tools. So i want to test its performance to its limit with this function run on set up of 2GB & 2.8Ghz CPU
exports.helloWorld = (req, res) => {
let message = getTheLottery();
res.status(200).send(message);
};
function getTheLottery(){
for(var i=0,len=10000000;i<len;i++)
{
var ticket=sha256(makeid(5));
if(ticket=='7CD743877911812A45CD7974023A2D1ACA9831C82057902A2300874A951E6E17')
return true;
}
return false;
}
The SHA256 algorithm was from here & makeid from here to generate random string
It take 40 seconds (40311ms) to complete the task, implement this code on C++ or C# add simple parallel & multithread it can easily run the task in less than 7 seconds on my average school PC with 4GB ram & i5 2.5ghz plus it has to run a OS system & few software in the background, let alone a code, when google claim it can make the code faster 75% compare to normal PC
I haven't try Azure cloud function. Assume i brought the "function" to my site & let ppl run 1000000 times, it'll cost 1657$ arcording to the calculator & maybe a fortune to run it few hundreds time for a full day with barely any way to optimize the code or system since it's all limited & the more time it run, the more it cost by SECONDS. With this money i can rent a VM with powerful GPU & run fully developed render or brute force software with maximum performance
Is there any better way for me to run the code? 'cause this feel like a online code compiler with extra steps. How does it make serverless computing the "future"?

Heavy computation is not a use case for Cloud Functions. In fact, there are quite a few things it's not very good for, given that a server instance has only a single CPU, no GPU, and a maximum execution time of 9 minutes.
If you have a heavy compute load, consider using Compute Engine instead. If you want, you can write a Cloud Function that delegates work to Compute Engine.
Cloud Functions is intended to glue together other parts of your system without having to create and manage a formal infrastructure for it. The main benefits is effortless scaling (up and down), paying only for resources used, and the ability to trigger on events that occur in your GCP project.

Related

Reliable time measurements with cloud high-performance computing

I conduct research of graph search algorithms. In this research, the ability to reliably (i.e. re-producibly) measure the running time of a single-threaded program in order to compare the running-time performance of two algorithms is of paramount importance. The running time is measured inside the program (written in C++) and does not include any access to secondary storage (which happens only during the initial input phase). I used to have access to dedicated nodes of a real (i.e. non-cloud) HPC cluster. I recall that, when I ran my program on such a node twice (with the same input), I got time measurements that differed by a small fraction of a percent. The question is: can I get such reliable time measurements on a cloud HPC platform?
To substantiate the question more, for some algorithms and problem instances, my program may use a large amount of memory (say, 64GB). If I understand correctly, even cloud platforms that promise dedicated cores without hyper-threading and dedicated memory, would construct a virtual machine to satisfy such a memory requirement. The nodes making up that virtual machine may be different between the two runs, resulting in different communication overheads and, as a consequence, different time measurements. So, to repeat the question: can I get reliable time measurements on a cloud HPC platform?
Based on discussions and experiences described in here
and here it seems to be safe to say that you should not expect measurements to always be similar.
Though, I think that depending on the number, duration of the tests, and allocation/deallocation of tests VM between test runs you could achieve accepted degree of reliability.

How to cheaply process large amounts of data (local setup or cloud)?

I would like to try testing NLP tools against dumps of the web and other corpora, sometimes larger than 4 TB.
If I run this on a mac it's very slow. What is the best way to speed up this process?
deploying to EC2/Heroku and scaling up servers
buying hardware and creating a local setup
Just want to know how this is usually done (processing terabytes in a matter of minutes/seconds), if it's cheaper/better to experiment with this in the cloud, or do I need my own hardware setup?
Regardless of the brand of your cloud, the whole idea of cloud computing is to be able to scale-up and scale down in a flexible way.
In a corporate environment you might have a scenario in which you will consistently need the same amount of computing resources, so if you already have them, it is rather a difficult case to use the cloud because you just don't need the flexibility provided.
On the other hand if your processing tasks are not quite predictable, your best solution is the cloud because you will be able to pay more when you use more computing power, and then pay less when you don't need as much power.
Take into account though, that not all cloud-solutions are the same, for instance, a Web role is a highly web-dedicated node whose main purpose is to serve web requests, the more requests are served, the more you pay.
Whereas in a virtual role, is almost like you are given the exclusivity of a computer system that you can use for anything you want, either a linux or a windows OS, the system keeps running even though you are not using it at its best.
Overall, the costs depend on your own scenario and how well it fits to your needs.
I suppose it depends quite a bit on what kind of experimenting you are wanting to do, for what purpose and for how long.
If you're looking into buying the hardware and running your own cluster then you probably want something like Hadoop or Storm to manage the compute nodes. I don't know how feasible it is to go through 4TB of data in a matter of seconds but again that really depends on the kind of processing you want to do. Counting the frequency of words in the 4TB corpus should be pretty easy (even or your mac), but building SVMs or doing something like LDA on the lot won't be. One issue you'll run into is that you won't have enough memory to fit all of that, so you'll want a library that can run the methods off disk.
If you don't know exactly what your requirements are then I would use EC2 to setup a test rig to gain a better understanding what it is that you want to do and how much grunt/memory that needs to get done in the amount of time you require.
We recently bought two compute nodes 128 cores each with 256Gb of memory and a few terabytes of disk space for I think it was around £20k or so. These are AMD interlagos machines. That said the compute cluster already had infiniband storage so we just had to hook up to that and just buy to two compute nodes, not the whole infrastructure.
The obvious thing to do here is to start off with a smaller data set, say a few gigabytes. That'll get you started on your mac, you can experiment with the data and different methods to get an idea of what works and what doesn't, and then move your pipeline to the cloud, and run it with more data. If you don't want to start the experimentation with a single sample, you can always take multiple samples from different parts of the full corpus, just keep the sample sizes down to something you can manage on your own workstation to start off with.
As an aside, I highly recommend the scikit-learn project on GitHub for machine learning. It's written in Python, but most of the matrix operations are done in Fortran or C libraries so it's pretty fast. The developer community is also extremely active on the project. Another good library that perhaps a bit more approachable (depending on your level of expertise) is NLTK. It's nowhere near as fast but makes a bit more sense if you're not familiar with thinking about everything as a matrix.
UPDATE
One thing I forgot to mention is the time your project will be running. Or to put it another way, how long will you get some use out of your specialty hardware. If it's a project that is supposed to serve the EU parliament for the next 10 years, then you should definitely buy the hardware. If it's a project for you to get familiar with NLP, then forking out the money might be a bit redundant, unless you're also planning on starting you own cloud computing rental service :).
That said, I don't know what the real world costs of using EC2 are for something like this. I've never had to use them.

How can I limit the processing power given to a specific program?

I develop on a laptop with a dual-core amd 1.8 GHz processor but people frequently run my programs on much weaker systems (300 MHz ARM for example).
I would like to simulate such weak environments on my laptop so I can observe how my program runs. It is an interactive application.
I looked at qemu and I know how to set up an environment but its a bit painful and I didn't see the exact incantation of switches I would need to make qemu simulate a weaker cpu.
I have virtualbox but it doesn't seem like I can virtualize fewer than 1 full host cpu.
I know about http://cpulimit.sourceforge.net/ which uses sigstop and sigcont to try to limit the cpu given to a process but I am worried this is not really an accurate portrayal of a weaker cpu.
Any ideas?
If your CPU is 1800 MHz and your target is 300 MHz, and your code is like this:
while(1) { /*...*/ }
you can rewrite it like:
long last=gettimestamp();
while(1)
{
long curr=gettimestamp();
if(curr-last>1000) // out of every second...
{
long target=curr+833; // ...waste 5/6 of it
while(gettimestamp()<target);
last=target;
}
// your original code
}
where gettimestamp() is your OS's high frequency timer.
You can choose to work with smaller values for a smoother experience, say 83ms out of every 100ms, or 8ms out of every 10ms, and so on. The lower you go though the more precision loss will ruin your math.
edit: Or how about this? Create a second process that starts the first and attaches itself as a debugger to it, then periodically pauses it and resumes it according to the algorithm above.
You may want to look at an emulator that is built for this. For example, from Microsoft you can find this tech note, http://www.nsbasic.com/ce/info/technotes/TN23.htm.
Without knowing more about the languages you are using, and platforms, it is hard to be more specific, but I would trust the emulator programs to do a good job in providing the test environment.
I've picked a PIIMMX-266 laptop somewhere, and installed a mininal Debian on it. That was a perfect solution until it has died some weeks ago. It is a Panasonic model, which has a non-standard IDE connector (it's not 40-pin, nor 44-pin), so I was unable to replace its HDD with a CF (a CF-to-IDE adapter costs near zero). Also, the price of such a machine is USD 50 / EUR 40.
(I was using it to simulate a slow ARM-based machine for our home aut. system, which is planned to able to run on even smallest-slowest Linux systems. Meanwhile, we've choosen a small and slow computer for home aut. purposes: GuruPlug. It has a cca. 1.2 GByte fast CPU.)
(I'm not familiar with QEMU, but the manual says that you can use KVM (kernel virtualization) in order to run programs at native speed; I assume that if it's an extra feature then it can be turned off, so, strange but true, it can emulate x86 on x86.)

How do I ensure GUI responsiveness when using OpenCL on the display GPU?

In my relatively short time learning OpenCL I frequently see my application cause the operating system UI to become significantly less responsive (several seconds for a window to respond to a drag for example). I have encountered this problem on Windows Vista and Mac OS X both with NVidia GPUs.
What can I do when using OpenCL on the same GPU as the display to ensure that my application does not significantly degrade the UI responsiveness like this? Also, can this be done without taking needless performance losses within my application? (Ie, if the user is not doing some UI intensive task then I would not expect my application to run any slower than it does now.)
I understand that any answers will be very platform specific (where platform includes OS/GPU/driver combo).
As described in Dr. David Gohara's OpenCL Tutorial Episode 6 (beginning at 43:49), graphics cards cannot be preemptively scheduled at this time. As a result, using the same graphics card both for an intensive OpenCL kernel and the UI (or other GPU-using operations) will result in clunkiness or the visual appearance of freezing. Until graphics cards get preemptively scheduled multitasking (if ever), there's no way to do exactly what you want with just a single graphics card. I don't believe this is a platform-specific issue at all.
However, this problem might be solvable by dividing the problem up. Given the relative speed of whatever single GPU is available (you'll have to do testing to find the right setup), divide up your OpenCL problem to run the kernel multiple times with different parts of the input data, and later combine the output data when all sets of kernels are complete. I would recommend creating kernel sets that take less than 100 milliseconds to run (on a given GPU) so that lag would be, if not unnoticeable, not significantly annoying (the 100 milliseconds figure is a good "rule of thumb" according to this paper).
Based on your comment about your program being a command-line application, I assume your application will only run once at any given time, versus being a continuously running application with real-time output, as a lot of OpenCL demos are. My above answer is only satisfactory for non-continuous applications, since real-time performance isn't inherently expected. However, if your application is supposed to be continuous, the only solution currently available is to add a second, simpler graphics card that will only be used for UI.

How do I get repeatable CPU-bound benchmark runtimes on Windows?

We sometimes have to run some CPU-bound tests where we want to measure runtime. The tests last in the order of a minute. The problem is that from run to run the runtime varies by quite a lot (+/- 5%). We suspect that the variation is caused by activity from other applications/services on the system, eg:
Applications doing housekeeping in their idle time (e.g. Visual Studio updating IntelliSense)
Filesystem indexers
etc..
What tips are there to make our benchmark timings more stable?
Currently we minimize all other applications, run the tests at "Above Normal" priority, and not touch the machine while it runs the test.
The usual approach is to perform lots of repetitions and then discard outliers. So, if the distractions such as the disk indexer only crops up once every hour or so, and you do 5 minutes runs repeated for 24 hours, you'll have plenty of results where nothing got in the way. It is a good idea to plot the probability density function to make sure you are understand what is going on. Also, if you are not interested in startup effects such as getting everything into the processor caches then make sure the experiment runs long enough to make them insignificant.
First of all, if it's just about benchmarking the application itself, you should use CPU time, not wallclock time as a measure. That's then (almost) free from influences of what the other processes or the system do. Secondly, as Dickon Reed pointed out, more repetitions increase confidence.
Quote from VC++ team blog, how they do performance tests:
To reduce noise on the benchmarking machines, we take several steps:
Stop as many services and processes as possible.
Disable network driver: this will turn off the interrupts from NIC caused by >broadcast packets.
Set the test’s processor affinity to run on one processor/core only.
Set the run to high priority which will decrease the number of context switches.
Run the test for several iterations.
I do the following:
Call the method x times and measure the time
Do this n times and calculate the mean and standard deviation of those measurements
Try to get the x to a point where you're at a >1 second per measurement. This will reduce the noise a bit.
The mean will tell you the average performance of your test and the standard deviation the stability of your test/measurements.
I also set my application at a very high priority, and when I test a single-thread algorithm I associate it with one cpu core to make sure there is not scheduling overhead.
This code demonstrates how to do this in .NET:
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
if (Environment.ProcessorCount > 1)
{
Process.GetCurrentProcess().ProcessorAffinity =
new IntPtr(1 << (Environment.ProcessorCount - 1));
}

Resources