How can Oracle User Profiles be put to practical use? - performance

Oracle 10g has Profiles that allow various resources to be limited. Here are some references for clarity - orafaq.com, Oracle Documentation. I am particularly interested in limiting CPU_PER_CALL, LOGICAL_READS_PER_CALL, and COMPOSITE_LIMIT with the goal of preventing a poorly formed statement from destroying performance for every other session.
My problem is that I don't know what values to use for these parameters that will allow your typical long running resource intensive operations while preventing the truly bad ones. I realize that the values will differ based on the hardware, tolerance levels, and queries involved, which is why I am more interested in a method to follow to determine what values are best.

There are a variety of approaches, depending on the situation. The simplest possible approach that has any hope of working is to ask how long the longest running realistic operation would run (that's obviously system-dependent, and depends on whether this is a system you're building or something existing) and to back in to a CPU_PER_CALL based on that time limit and the degree of parallelism. Assuming single-threaded operation, if you can reasonably say that if a query hasn't returned in 30 minutes you want to kill it, you can set CPU_PER_CALL to allocate 30 minutes worth of CPU (obviously most queries aren't going to use 100% constantly, so that 30 minute limit gives you some amount of breathing room).
If this is an existing system, you (or your DBA) can go through AWR/ statspack reports for a reasonable number of days (some systems will need to make sure to look at reports from month/ quarter/ year-end where additional processing may be done) and find the real statements that use the most CPU and I/O. You can then set your profile limits appropriately (i.e. the maximum CPU recorded for a statement in the past month + 30% of breathing room).
Of course, for any limit you pick, someone has to monitor the system to make sure that the limits keep pace. If queries get more and more expensive over time because of increases in data volume, for example, that max + 30% limit might be insufficient in 6 months. You don't want to find that out when the nightly processing aborts, someone has to keep on top of that.
If you are using the enterprise edition, you may be better served looking at Resource Manager rather than profiles. While profiles allow you to kill runaway sessions, Resource Manager allows you to change session priority based on a variety of factors. Rather than killing a query that has used more than 30 minutes of CPU, it may be better to make it lower priority so that it doesn't interfere with other sessions without killing it, in case it is just running long.

Related

Why real time is much higher than "user" and "system" CPU TIME combined?

We have a batch process that executes every day. This week, a job that usually does not past 18 minutes of execution time (real time, as you can see), now is taking more than 45 minutes to finish.
Fullstimmer option is already active, but we don't know why only the real time was increased.
In old documentation there are Fullstimmer stats that could help identify the problem but they do not appear in batch log. (The stats are those down below: Page Faults, Context Switches, Block Operation and so on, as you can see)
It might be an I/O issue. Does anyone know how we can identify if it is really an I/O problem or if it could be some other issue (network, for example)?
To be more specific, this is one of the queries that have increased in time dramatically. As you can see, it is reading from a data base (SQL Server, VAULT schema) and work and writing in work directory.
Number of observations its almost the same:
We asked customer about any change in network traffic, and they said still the same.
Thanks in advance.
For a process to complete, much more needs to be done than the actual calculations on the CPU.
Your data has te be read and your results have to be written.
You might have to wait for other processes to finish first, and if your process includes multiple steps, writing to and reading from disk each time, you will have to wait for the CPU each time too.
In our situation, if real time is much larger than cpu time, we usually see much trafic to our Network File System (nfs).
As a programmer, you might notice that storing intermediate results in WORK is more efficient then on remote libraries.
You might safe much time by creating intermediate results as views instead of tables, IF you only use them once. That is not only possible in SQL, but also in data steps like this
data MY_RESULT / view=MY_RESULT;
set MY_DATA;
where transaction_date between '1jan2022'd and 30jun2022'd;
run;

Maximum number of concurrent requests a webserver can serve assuming average service time to be known

Is it logical to say: "If average service time for a request is X and affordable waiting time for the requests is Y then maximum number of concurrent requests to serve would be Y / X" ?
I think what I'm asking is that if there're any hidden factors that I'm not taking into account!?
If you're talking specifically about webservers, then no, your formula doesn't work, because webservers are designed to handle multiple, simultaneous requests, using forking or threading.
This turns the formula into something far harder to quantify - in my experience, web servers can handle LOTS (i.e. hundreds or thousands) of concurrent requests which consume little or no time, but tend to reduce that concurrency quite dramatically as the requests consume more time.
That means that "average service time" isn't massively useful - it can hide wide variations, and it's actually the outliers that affect you the most.
Broadly yes, but your service provider (webserver in your case) is capable of handling more than one request in parallel, so you should take that into account. I assume you measured end to end service time and havent already averaged it by number of parallel streams. One other thing you didnt and cannot realistically measure is the delay to/from your website.
What you are heading towards is the Erlang unit (not the language using the same name) which is used to described how much load a system can take. Erlangs are unitless (it is just a number) and originated from old school telephony, POTS, where it was used to describe how many wires were needed to handle X calls per time period with low blocking probability. Beyond erlang is engset which is used more for high capacity systems, such as mobile systems.
It also gets used for expensive consultant reports into realtime computer systems and databases to describe the point at which performance degradation is likely to occur. Wikipedia has an article on this http://en.wikipedia.org/wiki/Erlang_(unit) and the book 'Fixed and mobile telecommunications, network systems and services' has a good chapter on performance analysis.
While aimed at telephone systems, just replace with word webserver and it behaves the same. A webserver is the same concept, load is offered that arrives at random intervals to a system with finite parallel capacity. In your case, you can probably calculate total load with load tools easier than parallel capacity and then back calculate the formulas. This is widely done to gain a level of confidence in overall system models.
Erlang/engsetformulas are really useful when you have a randomly arriving load over parallel stream (ie web requests) and a service time that can only be averaged or estimated (ie it varies in real life). You can then calculate the blocking probability, which is the probability a new request will need to wait while current requests are serviced, and how long it will wait. It also helps analyse whether you need to handle more requests in parallel, or make each faster (#lines and holding time in erlang speak)
You will probably look into queuing systems analysis next, as a soon as requests block (queue), the models change slightly.
many factors are not taken into account
memory limits
data locking constraints such as people wanting to update the same data
application latency
caching mechanisms
different users will have different tasks on the site and put different loads
That said, one easy way to get a rough estimate is with apache ab tool (apache benchmark)
Example, get 1000 times the homepage with 100 requests at a time:
ab -c 100 -n 1000 http://www.example.com/

How to do load testing using jmeter and visualVM?

I want to do load testing for 10 million users for my site. The site is a Java based web-app. My approach is to create a Jmeter test plan for all the links and then take a report for the 10 million users. Then use jvisualVM to do profiling and check if there are any bottlenecks.
Is there any better way to do this? Is there any existing demo for doing this? I am doing this for the first time, so any assistance will be very helpful.
You are on the correct path, but your load limit is of with a high factor.
Why I'm saying this is cause your site probably will need more machine to handle 10Milj Concurrent users. A process alone would probably struggle to handle concurrent 32K TCP-streams. Also do some math of the bandwidth it would take to actually handle 10Milj users.
Now I do not know what kind of service you thinking of providing on your site, but when thinking of that JVisualVM slows down processing by a factor 10 (or more for method tracing), you would not actually measure the "real world" if you got JMeter and JVisualVM to work at the same time.
JVisualVM is more useful when you run on lower loads.
To create a good measurement first make sure your have a good baseline.
Make a test with 10 concurrent users, connect up JVisuamVM and let it run for a while, not down all interesting values.
After you have your baseline, then you can start adding more load.
Add 10times the load (ea: 100 users), look at the changes in JVisualVM. Continue this until it becomes obvious that JVisualVM slows you down, for every time to add extra load, make sure you have written down the numbers your are interested in. Plot down the numbers in a graph.
Now... Interpolate the graph (by hand) for the number of users you want. This works for memory usage, disc access etc, but not for used CPU time, cause JVisualVM will eat CPU and give you invalid numbers on that (especially if you have method tracing turned on).
If you really want to go as high as 10Milj users, I would not trust JMeter either, I would write a little test program of my own that performs the test you want. This would be okey, since the the setting up the site to handle 10Milj will also take time, so spending a little extra time of the test tools are not a waste.
Just because you have 10 million users in the database, doesn't mean that you need to load test using that many users. Think about it - is your site really going to have 10 million simultaneous users? For web applications, a ratio of 1:100 registered users is common i.e. you are unlikely to have more than 100K users at any moment.
Can JMeter handle that kind of load? I doubt it. Please try faban instead. It is very light-weight and can support thousands of users on a single VM. You also have much better flexibility in creating your workload and can also automate monitoring of your entire test infrastructure.
Now to the analysis part. You didn't say what server you were using. Any Java appserver will provide sufficient monitoring support. Commercial servers provide nice GUI tools while Tomcat provides extensive monitoring via JMX. You may want to start here before getting down to the JVM level.
For the JVM, you really don't want to use VisualVM while running such a large performance test. Besides to support such a load, I assume you are using multiple appserver/JVM instances. The major performance issue is usually GC, so use the JVM options to collect and log GC information. You will have to post-process the data.
This is a non-trivial exercise - good luck!
There are two types of load testing - bottleneck identification and throughput. The question leads me to believe this is about bottlenecks, so number of users is a something of a red herring, instead the goal being for a given configuration finding areas that can be improved to increase concurrency.
Application bottlenecks usually fall into three categories: database, memory leak, or slow algorithm. Finding them involves putting the application in question under stress (i.e. load) for an extended period of time - at least an hour, perhaps up to several days. Jmeter is a good tool for this purpose. One of the things to consider is running the same test with cookie handling enabled (i.e. Jmeter retains cookies and sends with each subsequent request) and disabled - sometimes you get very different results and this is important because the latter is effectively a simulation of what some crawlers do to your site. Details for bottleneck detection follow:
Database
Tables without indices or SQL statements involving multiple joins are frequent app bottlenecks. Every database server I've dealt with, MySQL, SQL Server, and Oracle has some way of logging or identifying slow running SQL statements. MySQL has the slow query log, whereas SQL Server has dynamic management views that track the slowest running SQL. Once you've got your hands on the slow statements use explain plan to see what the database engine is trying to do, use any features that suggest indices, and consider other strategies - such as denormalization - if those two options do not solve the bottleneck.
Memory Leak
Turn on verbose garbage collection logging and a JMX monitoring port. Then use jConsole, which provides much better graphs, to observe trends. In particular leaks usually show up as filling the Old Gen or Perm Gen spaces. Leaks are a bottleneck with the JVM spends increasing amounts of time attempting garbage collection unsuccessfully until an OOM Error is thrown.
Perm Gen implies the need to increase the space as a command line parameter to the JVM. While Old Gen implies a leak where you should stop the load test, generate a heap dump, and then use Eclipse Memory Analysis Tool to identify the leak.
Slow Algorithm
This is more difficult to track down. The most frequent offenders are synchronization, inter process communication (e.g. RMI, web services), and disk I/O. Another common issue is code using nested loops (look mom O(n^2) performance!).
Best way I've found to find these issues absent some deeper knowledge is generating stack traces. These will tell what all threads are doing at a given point in time. What you're looking for are BLOCKED threads or several threads all accessing the same code. This usually points at some slowness within the codebase.
I blogged, the way I proceeded with the performance test:
Make sure that the server (hardware can be as per the staging/production requirements) has no other installations that can affect the performance.
For setting up the users in DB, a procedure can be used and can be called as a part of jmeter test plan.
Install jmeter on a separate machine, so that jmeter won't affect the performance.
Create a test plan in jmeter (as shown in the figure 1) for all the uri's, with response checking and timer based requests.
Take the initial benchmark, using jmeter.
Check for the low performance uri's. These are the points to expect for bottlenecks.
Try different options for performance improvement, but focus on only one bottleneck at a time.
Try any one fix from step 6 and then take an benchmark. If there is any improvement commit the changes and repeat from step 5. Otherwise revert and try for any other options from step 6.
The next step would be to use load balancing, hardware scaling, clustering, etc. This may include some physical setup and hardware/software cost. Give the results with the scalability options.
For detailed explanation: http://www.daemonthread.com/2011/06/site-performance-tuning-using-jmeter.html
I started using JMeter plugins.
This allows me to gather application metrics available over JMX to use in my Load Test.

How to decide on what hardware to deploy web application

Suppose you have a web application, no specific stack (Java/.NET/LAMP/Django/Rails, all good).
How would you decide on which hardware to deploy it? What rules of thumb exist when determining how many machines you need?
How would you formulate parameters such as concurrent users, simultaneous connections, daily hits and DB read/write ratio to a decision on how much, and which, hardware you need?
Any resources on this issue would be very helpful...
Specifically - any hard numbers from real world experience and case studies would be great.
Capacity Planning is quite a detailed and extensive area. You'll need to accept an iterative model with a "Theoretical Baseline > Load Testing > Tuning & Optimizing" approach.
Theory
The first step is to decide on the Business requirements: how many users are expected for peak usage ? Remember - these numbers are usually inaccurate by some margin.
As an example, let's assume that all the peak traffic (at worst case) will be over 4 hours of the day. So if the website expects 100K hits per day, we dont divide that over 24 hours, but over 4 hours instead. So my site now needs to support a peak traffic of 25K hits per hour.
This breaks down to 417 hits per minute, or 7 hits per second. This is on the front end alone.
Add to this the number of internal transactions such as database operations, any file i/o per user, any batch jobs which might run within the system, reports etc.
Tally all these up to get the number of transactions per second, per minute etc that your system needs to support.
This gets further complicated when you have requirements such as "Avg response time must be 3 seconds etc" which means you have to figure in network latency / firewall / proxy etc
Finally - when it comes to choosing hardware, check out the published datasheets from each manufacturer such as Sun, HP, IBM, Windows etc. These detail the maximum transactions per second under test conditions. We usually accept 50% of those peaks under real conditions :)
But ultimately the choice of the hardware is usually a commercial decision.
Also you need to keep a minimum of 2 servers at each tier : web / app / even db for failover clustering.
Load testing
It's recommended to have a separate reference testing environment throughout the project lifecycle and post-launch so you can come back to run dedicated performance tests on the app. Scale this to be a smaller version of production, so if Prod has 4 servers and Ref has 1, then you test for 25% of the peak transactions etc.
Tuning & Optimizing
Too often, people throw some expensive hardware together and expect it all to work beautifully. You'll need to tune the hardware and OS for various parameters such as TCP timeouts etc - these are published by the software vendors, and these have to be done once the software are finalized. Set these tuning params on the Ref env, test and then decide which ones you need to carry over to Production.
Determine your expected load.
Setup a machine and run some tests against it with a Load testing tool.
How close are you if you only accomplished 10% of the peak load with some margin for error then you know you are going to need some load balancing. Design and implement a solution and test again. Make sure you solution is flexible enough to scale.
Trial and error is pretty much the way to go. It really depends on the individual app and usage patterns.
Test your app with a sample load and measure performance and load metrics. DB queries, disk hits, latency, whatever.
Then get an estimate of the expected load when deployed (go ask the domain expert) (you have to consider average load AND spikes).
Multiply the two and add some just to be sure. That's a really rough idea of what you need.
Then implement it, keeping in mind you usually won't scale linearly and you probably won't get the expected load ;)

How to create a system with 1500 servers that deliver results instantaneously?

I want to create a system that delivers user interface response within 100ms, but which requires minutes of computation. Fortunately, I can divide it up into very small pieces, so that I could distribute this to a lot of servers, let's say 1500 servers. The query would be delivered to one of them, which then redistributes to 10-100 other servers, which then redistribute etc., and after doing the math, results propagate back again and are returned by a single server. In other words, something similar to Google Search.
The problem is, what technology should I use? Cloud computing sounds obvious, but the 1500 servers need to be prepared for their task by having task-specific data available. Can this be done using any of the existing cloud computing platforms? Or should I create 1500 different cloud computing applications and upload them all?
Edit: Dedicated physical servers does not make sense, because the average load will be very, very small. Therefore, it also does not make sense, that we run the servers ourselves - it needs to be some kind of shared servers at an external provider.
Edit2: I basically want to buy 30 CPU minutes in total, and I'm willing to spend up to $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.
Edit3: I expect the solution to be something like "Use Google Apps, create 1500 apps and deploy them" or "Contact XYZ and write an asp.net script which their service can deploy, and you pay them based on the amount of CPU time you use" or something like that.
Edit4: A low-end webservice provider, offering asp.net at $1/month would actually solve the problem (!) - I could create 1500 accounts, and the latency is ok (I checked), and everything would be ok - except that I need the 1500 accounts to be on different servers, and I don't know any provider that has enough servers that is able to distribute my accounts on different servers. I am fully aware that the latency will differ from server to server, and that some may be unreliable - but that can be solved in software by retrying on different servers.
Edit5: I just tried it and benchmarked a low-end webservice provider at $1/month. They can do the node calculations and deliver results to my laptop in 15ms, if preloaded. Preloading can be done by making a request shortly before the actual performance is needed. If a node does not respond within 15ms, that node's part of the task can be distributed to a number of other servers, of which one will most likely respond within 15ms. Unfortunately, they don't have 1500 servers, and that's why I'm asking here.
[in advance, apologies to the group for using part of the response space for meta-like matters]
From the OP, Lars D:
I do not consider [this] answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.
Lars,
I sincerely apologize for reading and responding to your question at a naive and generic level. I hope you can see how both the lack of specifity in the question itself, particularly in its original form, and also the somewhat unusual nature of the problem (1) would prompt me respond to the question in like fashion. This, and the fact that such questions on SO typically emanate from hypotheticals by folks who have put but little thought and research into the process, are my excuses for believing that I, a non-practionner [of massively distributed systems], could help your quest. The many similar responses (some of which had the benefits of the extra insight you provided) and also the many remarks and additional questions addressed to you show that I was not alone with this mindset.
(1) Unsual problem: An [apparently] mostly computational process (no mention of distributed/replicated storage structures), very highly paralellizable (1,500 servers), into fifty-millisecondish-sized tasks which collectively provide a sub-second response (? for human consumption?). And yet, a process that would only be required a few times [daily..?].
Enough looking back!
In practical terms, you may consider some of the following to help improve this SO question (or move it to other/alternate questions), and hence foster the help from experts in the domain.
re-posting as a distinct (more specific) question. In fact, probably several questions: eg. on the [likely] poor latency and/or overhead of mapreduce processes, on the current prices (for specific TOS and volume details), on the rack-awareness of distributed processes at various vendors etc.
Change the title
Add details about the process you have at hand (see many questions in the notes of both the question and of many of the responses)
in some of the questions, add tags specific to a give vendor or technique (EC2, Azure...) as this my bring in the possibly not quite unbuyist but helpful all the same, commentary from agents at these companies
Show that you understand that your quest is somewhat of a tall order
Explicitly state that you wish responses from effective practionners of the underlying technologies (maybe also include folks that are "getting their feet wet" with these technologies as well, since with the exception of the physics/high-energy folks and such, who BTW traditionnaly worked with clusters rather than clouds, many of the technologies and practices are relatively new)
Also, I'll be pleased to take the hint from you (with the implicit non-veto from other folks on this page), to delete my response, if you find that doing so will help foster better responses.
-- original response--
Warning: Not all processes or mathematical calculations can readily be split in individual pieces that can then be run in parallel...
Maybe you can check Wikipedia's entry from Cloud Computing, understanding that cloud computing is however not the only architecture which allows parallel computing.
If your process/calculation can efficitively be chunked in parallelizable pieces, maybe you can look into Hadoop, or other implementations of MapReduce, for an general understanding about these parallel processes. Also, (and I believe utilizing the same or similar algorithms), there also exist commercially available frameworks such as EC2 from amazon.
Beware however that the above systems are not particularly well suited for very quick response time. They fare better with hour long (and then some) data/number crunching and similar jobs, rather than minute long calculations such as the one you wish to parallelize so it provides results in 1/10 second.
The above frameworks are generic, in a sense that they could run processes of most any nature (again, the ones that can at least in part be chunked), but there also exist various offerings for specific applications such as searching or DNA matching etc. The search applications in particular can have very short response times (cf Google for example) and BTW this is in part tied to fact that such jobs can very easily and quickly be chunked for parallel processing.
Sorry, but you are expecting too much.
The problem is that you are expecting to pay for processing power only. Yet your primary constraint is latency, and you expect that to come for free. That doesn't work out. You need to figure out what your latency budgets are.
The mere aggregating of data from multiple compute servers will take several milliseconds per level. There will be a gaussian distribution here, so with 1500 servers the slowest server will respond after 3σ. Since there's going to be a need for a hierarchy, the second level with 40 servers , where again you'll be waiting for the slowest server.
Internet roundtrips also add up quickly; that too should take 20 to 30 ms of your latency budget.
Another consideration is that these hypothethical servers will spend much of their time idle. That means they're powered on, drawing electricity yet not generating revenue. Any party with that many idle servers would turn them off, or at the very least in sleep mode just to conserve electricity.
MapReduce is not the solution! Map Reduce is used in Google, Yahoo and Microsoft for creating the indexes out of the huge data (the whole Web!) they have on their disk. This task is enormous and Map Reduce was built to make it happens in hours instead of years, but starting a Master controller of Map Reduce is already 2 seconds, so for your 100ms this is not an option.
Now, from Hadoop you may get advantages out of the distributed file system. It may allow you to distribute the tasks close to where the data is physically, but that's it. BTW: Setting up and managing an Hadoop Distributed File System means controlling your 1500 servers!
Frankly in your budget I don't see any "cloud" service that will allow you to rent 1500 servers. The only viable solution, is renting time on a Grid Computing solution like Sun and IBM are offering, but they want you to commit to hours of CPU from what I know.
BTW: On Amazon EC2 you have a new server up in a couple of minutes that you need to keep for an hour minimum!
Hope you'll find a solution!
I don't get why you would want to do that, only because "Our user interfaces generally aim to do all actions in less than 100ms, and that criteria should also apply to this".
First, 'aim to' != 'have to', its a guideline, why would u introduce these massive process just because of that. Consider 1500 ms x 100 = 150 secs = 2.5 mins. Reducing the 2.5 mins to a few seconds its a much more healthy goal. There is a place for 'we are processing your request' along with an animation.
So my answer to this is - post a modified version of the question with reasonable goals: a few secs, 30-50 servers. I don't have the answer for that one, but the question as posted here feels wrong. Could even be 6-8 multi-processor servers.
Google does it by having a gigantic farm of small Linux servers, networked together. They use a flavor of Linux that they have custom modified for their search algorithms. Costs are software development and cheap PC's.
It would seem that you are indeed expecting at least 1000-fold speedup from distributing your job to a number of computers. That may be ok. Your latency requirement seems tricky, though.
Have you considered the latencies inherent in distributing the job? Essentially the computers would have to be fairly close together in order to not run into speed of light issues. Also, the data center in which the machines would be would again have to be fairly close to your client so that you can get your request to them and back in less than 100 ms. On the same continent, at least.
Also note that any extra latency requires you to have many more nodes in the system. Losing 50% of available computing time to latency or anything else that doesn't parallelize requires you to double the computing capacity of the parallel portions just to keep up.
I doubt a cloud computing system would be the best fit for a problem like this. My impression at least is that the proponents of cloud computing would prefer to not even tell you where your machines are. Certainly I haven't seen any latency terms in the SLAs that are available.
You have conflicting requirements. You're requirement for 100ms latency is directly at odds with your desire to only run your program sporadically.
One of the characteristics of the Google-search type approach you mentioned in your question is that the latency of the cluster is dependent on the slowest node. So you could have 1499 machines respond in under 100ms, but if one machine took longer, say 1s - whether due to a retry, or because it needed to page you application in, or bad connectivity - your whole cluster would take 1s to produce an answer. It's inescapable with this approach.
The only way to achieve the kinds of latencies you're seeking would be to have all of the machines in your cluster keep your program loaded in RAM - along with all the data it needs - all of the time. Having to load your program from disk, or even having to page it in from disk, is going to take well over 100ms. As soon as one of your servers has to hit the disk, it is game over for your 100ms latency requirement.
In a shared server environment, which is what we're talking about here given your cost constraints, it is a near certainty that at least one of your 1500 servers is going to need to hit the disk in order to activate your app.
So you are either going to have to pay enough to convince someone to keep you program active and in memory at all times, or you're going to have to loosen your latency requirements.
Two trains of thought:
a) if those restraints are really, absolutely, truly founded in common sense, and doable in the way you propose in the nth edit, it seems the presupplied data is not huge. So how about trading storage for precomputation to time. How big would the table(s) be? Terabytes are cheap!
b) This sounds a lot like a employer / customer request that is not well founded in common sense. (from my experience)
Let´s assume the 15 minutes of computation time on one core. I guess thats what you say.
For a reasonable amount of money, you can buy a system with 16 proper, 32 hyperthreading cores and 48 GB RAM.
This should bring us in the 30 second range.
Add a dozen Terabytes of storage, and some precomputation.
Maybe a 10x increase is reachable there.
3 secs.
Are 3 secs too slow? If yes, why?
Sounds like you need to utilise an algorithm like MapReduce: Simplified Data Processing on Large Clusters
Wiki.
Check out Parallel computing and related articles in this WikiPedia-article - "Concurrent programming languages, libraries, APIs, and parallel programming models have been created for programming parallel computers." ... http://en.wikipedia.org/wiki/Parallel_computing
Although Cloud Computing is the cool new kid in town, your scenario sounds more like you need a cluster, i.e. how can I use parallelism to solve a problem in a shorter time.
My solution would be:
Understand that if you got a problem that can be solved in n time steps on one cpu, does not guarantee that it can be solved in n/m on m cpus. Actually n/m is the theoretical lower limit. Parallelism is usually forcing you to communicate more and therefore you'll hardly ever achieve this limit.
Parallelize your sequential algorithm, make sure it is still correct and you don't get any race conditions
Find a provider, see what he can offer you in terms of programming languages / APIs (no experience with that)
What you're asking for doesn't exist, for the simple reason that doing this would require having 1500 instances of your application (likely with substantial in-memory data) idle on 1500 machines - consuming resources on all of them. None of the existing cloud computing offerings bill on such a basis. Platforms like App Engine and Azure don't give you direct control over how your application is distributed, while platforms like Amazon's EC2 charge by the instance-hour, at a rate that would cost you over $2000 a day.

Resources