I have 5 Scenarios in total, and 70 Users segregated to different Scenarios which runs for around 15 Minutes only with 1 Loop configuration.
Is it ideal test duration to evaluate the realistic Performance results?
Or do I need to adjust with the test duration?
Any suggestion on this is highly appreciated.
Thanks
It depends on what you're trying to achieve. 70 concurrent users doesn't look like a real "load" to me, moreover given you have only one loop you may run into the situation when some users have already finished executing their scenarios and were shut down and some are still running or even have not yet been started. So I would recommend monitoring the real concurrency using i.e. Active Threads Over Time listener to see how many users were online at the given stage of the test.
Normally the following testing types are conducted:
Load testing - putting the system under anticipated load and ensuring that main metrics (i.e. response time and throughput) are matching NFRs or SLAs
Soak testing - basically the same as load testing, but it assumes prolonged duration (several hours, overnight or over the weekend). This testing type allows to discover obvious and non-obvious memory leaks
Stress testing - starting with anticipated number of users and gradually increasing the load until response time starts exceeding acceptable threshold or errors start occurring (whatever comes the first). If will shed some light on the slowest or most fragile component, to wit the first performance bottleneck
Check out Why ‘Normal’ Load Testing Isn’t Enough article for more information on the aforementioned performance testing types.
No matter which test you're conducting consider increasing (and decreasing) the load gradually, i.e. come up with proper ramp-up (and ramp-down) strategies, this way you will be able to correlate increasing load with i.e. increasing response time
Performance Tests in java is a bit tricky, it can vary wildly depending on what other programs are running on the system and what its load is.
In an ideal world, you need to use a dedicated system, if you can't make sure to quit all programs you're running (including the IDE), The Java HotSpot compiler kicks in when it sees a ‘hot spot’ in your code. It is therefore quite common that your code will run faster over time! So, you should adapt and repeat your testing methods, investigate memory and CPU usage.
or even better you can use a profiler. There are plenty around, both free profilers and demos / time-locked trials of commercials strength ones.
Related
I have a high-performance software server application that is expected to get increased traffic in the next few months.
I was wondering what approach or methodology is good to use in order to gauge if the server still has the capacity to handle this increased load?
I think you're looking for Stress Testing and the scenario would be something like:
Create a load test simulating current real application usage
Start with current number of users and gradually increase the load until
you reach the "increased traffic" amount
or errors start occurring
or you start observing performance degradation
whatever comes the first
Depending on the outcome you either can state that your server can handle the increased load without any issues or you will come up with the saturation point and the first bottleneck
You might also want to execute a Soak Test - leave the system under high prolonged load for several hours or days, this way you can detect memory leaks or other capacity problems.
More information: Why ‘Normal’ Load Testing Isn’t Enough
Test the product with one-tenth the data and traffic. Be sure the activity is 'realistic'.
Then consider what will happen as traffic grows -- with the RAM, disk, cpu, network, etc, grow linearly or not?
While you are doing that, look for "hot spots". Optimize them.
Will you be using web pages? Databases? Etc. Each of these things scales differently. (In other words, you have not provided enough details in your question.)
Most canned benchmarks focus on one small aspect of computing; applying the results to a specific application is iffy.
I would start by collecting base line data on critical resources - typically, CPU, memory usage, disk usage, network usage - and track them over time. If any of those resources show regular spikes where they remain at 100% capacity for more than a fraction of a second, under current usage, you have a bottleneck somewhere. In this case, you cannot accept additional load without likely outages.
Next, I'd start figuring out what the bottleneck resource for your application is - it varies between applications, but in most cases it's the bottleneck resource that stops you from scaling further. Your CPU might be almost idle, but you're thrashing the disk I/O, for instance. That's a tricky process - load and stress testing are the way to go.
If you can resolve the bottleneck by buying better hardware, do so - it's much cheaper than rewriting the software. If you can't buy better hardware, look at load balancing. If you can't load balance, you've got to look at application architecture and implementation and see if there are ways to move the bottleneck.
It's quite typical for the bottleneck to move from one resource to the next - you've got CPU to behave, but now when you increase traffic, you're spiking disk I/O; once you resolve that, you may get another CPU challenge.
Please can you provide me the detailed analysis of the attached graph,Just a general review.This is a composite graph related to transaction,response and active threads.
I am not able to explain this graph to my client.
Configuration-
Thread-500
Ramp up-100
loop count-1.
Is it good to go or not,or what about the spikes,errors that needs to be handle.What are the reason for this error and what can we do to fix it.Please let me know.enter image description here
http://i.stack.imgur.com/IKfsi.png
At peak transactions(red-line) we would have expected peak errors(pink-line) and peak response time(blue-line). Oddly however, the peak errors occurred on the down ramp of transactions, implying a lag in processing while response times abnormally continued to increase. Based on my experience this implies strained system resources, try increasing the server's system resources and retest for comparison.
2 screenshots is not enough to perform analysis, however looking into them I can figure out the following:
On well-behaved system throughput should increase as the load increases. On this one it doesn't.
Response time continues to grow even when load ended. It normally indicates memory leak on application under test side. Try going for Soak Testing - put the application under prolonged load and see if it fail with the form of out of memory error.
I would also recommend monitoring system resources on application under test side as application can misbehave if it lacks i.e. RAM or due to slow disk. It might be cheaper to add RAM or replace disk than to tune the application to fit into existing hardware.
When using dotTrace, I have to pick a profiling mode and a time measurement method. Profiling modes are:
Tracing
Line-by-line
Sampling
And time measurement methods are:
Wall time (performance counter)
Thread time
Wall time (CPU instruction)
Tracing and line-by-line can't use thread time measurement. But that still leaves me with seven different combinations to try. I've now read the dotTrace help pages on these well over a dozen times, and I remain no more knowledgeable than I started out about which one to pick.
I'm working on a WPF app that reads Word docs, extracts all the paragraphs and styles, and then loops through that extracted content to pick out document sections. I'm trying to optimize this process. (Currently it takes well over an hour to complete, so I'm trying to profile it for a given length of time rather than until it finishes.)
Which profiling and time measurement types would give me the best results? Or if the answer is "It depends", then what does it depend on? What are the pros and cons of a given profiling mode or time measurement method?
Profiling types:
Sampling: fastest but least accurate profiling-type, minimum profiler overhead. Essentially equivalent to pausing the program many times a second and viewing the stacktrace; thus the number of calls per method is approximate. Still useful for identifying performance bottlenecks at the method-level.
Snapshots captured in sampling mode occupy a lot less space on disk (I'd say 5-6 less space.)
Use for initial assessment or when profiling a long-running application (which sounds like your case.)
Tracing: Records the duration taken for each method. App under profiling runs slower but in return, dotTrace shows exact number of calls of each function, and function timing info is more accurate. This is good for diving into details of a problem at the method-level.
Line-by-line: Profiles the program on a per-line basis. Largest resource hog but most fine-grained profiling results. Slows the program way down. The preferred tactic here is to initially profile using another type, and then hand-pick functions for line-by-line profiling.
As for meter kinds, I think they are described quite well in Getting started with dotTrace Performance by the great Hadi Hariri.
Wall time (CPU Instruction): This is the simplest and fastest way to measure wall time (that is, the
time we observe on a wall clock). However, on some older multi-core processors this may produce
incorrect results due to the cores timers being desynchronized. If this is the case, it is recommended
to use Performance Counter.
Wall time (Performance Counter): Performance counters is part of the Windows API and it allows
taking time samples in a hardware-independent way. However, being an API call, every measure takes
substantial time and therefore has an impact on the profiled application.
Thread time: In a multi-threaded application concurrent threads contribute to each other's wall time.
To avoid such interference we can use thread time meter which makes system API calls to get the
amount of time given by the OS scheduler to the thread. The downsides are that taking thread time
samples is much slower than using CPU counter and the precision is also limited by the size of
quantum used by thread scheduler (normally 10ms). This mode is only supported when the Profiling
Type is set to Sampling
However they don't differ too much.
I'm not a wizard in profiling myself but in your case I'd start with sampling to get a list of functions that take ridiculously long to execute, and then I'd mark them for line-by-line profiling.
All too often I read statements about some new framework and their "benchmarks." My question is a general one but to the specific points of:
What approach should a developer take to effectively instrument code to measure performance?
When reading about benchmarks and performance testing, what are some red-flags to watch out for that might not represent real results?
There are two methods of measuring performance: using code instrumentation and using sampling.
The commercial profilers (Hi-Prof, Rational Quantify, AQTime) I used in the past used code instrumentation (some of them could also use sampling) and in my experience, this gives the best, most detailed result. Especially Rational Quantity allow you to zoom in on results, focus on sub trees, remove complete call trees to simulate an improvement, ...
The downside of these instrumenting profilers is that they:
tend to be slow (your code runs about 10 times slower)
take quite some time to instrument your application
don't always correctly handle exceptions in the application (in C++)
can be hard to set up if you have to disable the instrumentation of DLL's (we had to disable instrumentation for Oracle DLL's)
The instrumentation also sometimes skews the times reported for low-level functions like memory allocations, critical sections, ...
The free profilers (Very Sleepy, Luke Stackwalker) that I use use sampling, which means that it is much easier to do a quick performance test and see where the problem lies. These free profilers don't have the full functionality of the commercial profilers (although I submitted the "focus on subtree" functionality for Very Sleepy myself), but since they are fast, they can be very useful.
At this time, my personal favorite is Very Sleepy, with Luke StackWalker coming second.
In both cases (instrumenting and sampling), my experience is that:
It is very difficult to compare the results of profilers over different releases of your application. If you have a performance problem in your release 2.0, profile your release 2.0 and try to improve it, rather than looking for the exact reason why 2.0 is slower than 1.0.
You must never compare the profiling results with the timing (real time, cpu time) results of an application that is run outside the profiler. If your application consumes 5 seconds CPU time outside the profiler, and when run in the profiler the profiler reports that it consumes 10 seconds, there's nothing wrong. Don't think that your application actually takes 10 seconds.
That's why you must consistently check results in the same environment. Consistently compare results of your application when run outside the profiler, or when run inside the profiler. Don't mix the results.
Also use a consistent environment and system. If you get a faster PC, your application could still run slower, e.g. because the screen is larger and more needs to be updated on screen. If moving to a new PC, retest the last (one or two) releases of your application on the new PC so you get an idea on how times scale to the new PC.
This also means: use fixed data sets and check your improvements on these datasets. It could be that an improvement in your application improves the performance of dataset X, but makes it slower with dataset Y. In some cases this may be acceptible.
Discuss with the testing team what results you want to obtain beforehand (see Oded's answer on my own question What's the best way to 'indicate/numerate' performance of an application?).
Realize that a faster application can still use more CPU time than a slower application, if the faster one uses multi-threading and the slower one doesn't. Discuss (as said before) with the testing time what needs to be measured and what doesn't (in the multi-threading case: real time instead of CPU time).
Realize that many small improvements may lead to one big improvement. If you find 10 parts in your application that each take 3% of the time and you can reduce it to 1%, your application will be 20% faster.
It depends what you're trying to do.
1) If you want to maintain general timing information, so you can be alert to regressions, various instrumenting profilers are the way to go. Make sure they measure all kinds of time, not just CPU time.
2) If you want to find ways to make the software faster, that is a distinctly different problem.
You should put the emphasis on the find, not on the measure.
For this, you need something that samples the call stack, not just the program counter (over multiple threads, if necessary). That rules out profilers like gprof.
Importantly, it should sample on wall-clock time, not CPU time, because you are every bit as likely to lose time due to I/O as due to crunching. This rules out some profilers.
It should be able to take samples only when you care, such as not when waiting for user input. This also rules out some profilers.
Finally, and very important, is the summary you get.
It is essential to get per-line percent of time.
The percent of time used by a line is the percent of stack samples containing the line.
Don't settle for function-only timings, even with a call graph.
This rules out still more profilers.
(Forget about "self time", and forget about invocation counts. Those are seldom useful and often misleading.)
Accuracy of finding the problems is what you're after, not accuracy of measuring them. That is a very important point. (You don't need a large number of samples, though it does no harm. The harm is in your head, making you think about measuring, rather than what is it doing.)
One good tool for this is RotateRight's Zoom profiler. Personally I rely on manual sampling.
Any idea how to do performance and scalability testing if no clear performance requirements have been defined?
More information about my application.
The application has 3 components. One component can only run on Linux, the other two components are Java programs so they can run on Linux/Windows/Mac... The 3 components can be deployed to one box or each component can be deployed to one box. Deployment is very flexible. The Linux-only component will capture raw TCP/IP packages over the network, then one Java component will get those raw data from it and assemble them into the data end users will need and output them to hard disk as data files. The last Java component will upload data from data files to my database in batch.
In the absence of 'must be able to perform X iterations within Y seconds...' type requirements, how about these kinds of things:
Does it take twice as long for twice the size of dataset? (yes = good)
Does it take 10x as long for twice the size of dataset? (yes = bad)
Is it CPU bound?
Is it RAM bound (eg lots of swapping to virtual memory)?
Is it IO / Disk bound?
Is there a certain data-set size at which performance suddenly falls off a cliff?
Surprisingly this is how most perf and scalability tests start.
You can clearly do the testing without criteria, you just define the tests and measure the results. I think your question is more in the lines 'how can I establish test passing criteria without performance requirements'. Actually this is not at all uncommon. Many new projects have no clear criteria established. Informally it would be something like 'if it cannot do X per second we failed'. But once you passed X per second (and you better do!) is X the 'pass' criteria? Usually not, what happens is that you establish a new baseline and your performance tests guard against regression: you compare your current numbers with the best you got, and decide if the new build is 'acceptable' as build validation pass (usually orgs will settle here at something like 70-80% as acceptable, open perf bugs, and make sure that by ship time you get back to 90-95% or 100%+. So basically the performance test themselves become their own requirement.
Scalability is a bit more complicated, because there there is no limit. The scope of your test should be to find out where does the product break. Throw enough load at anything and eventually it will break. You need to know where that limit is and, very importantly, find out how does your product break. Does it give a nice error message and revert or does it spills its guts on the floor?
Define your own. Take the initiative and describe the performance goals yourself.
To answer any better, we'd have to know more about your project.
If there has been 'no performance requirement defined', then why are you even testing this?
If there is a performance requirement defined, but it is 'vague', can you indicate in what way it is vague, so that we can better help you?
Short of that, start from the 'vague' requirement, and pick a reasonable target that at least in your opinion meets or exceeds the vague requirement, then go back to the customer and get them to confirm that your clarification meets their requirements and ideally get formal sign-off on that.
Some definitions / assumptions:
Performance = how quickly the application responds to user input, e.g. web page load times
Scalability = how many peak concurrent users the applicaiton can handle.
Firstly perfomance. Performance testing can be quite simple, such as measuring and recording page load times in a development environment and using techniques like applicaiton profiling to identify and fix bottlenecks.
Load. To execute a load test there are four key factors, you will need to get all of these in place to be successfull.
1. Good usage models of how users will use your site and/or application. This can be easy of the application is already in use, but it can be extermely difficult if you are launching a something new, e.g. a Facebook application.
If you can't get targets as requirements, do some research and make some educated assumptions, document and circulate them for feedback.
2. Tools. You need to have performance testing scripts and tools that can excute the scenarios defined in step 1, with the number of expected users in step 1. (This can be quite expensive)
3. Environment. You will need a production like environment that is isolated so your tests can produce repoducible results. (This can also be very expensive.)
4. Technical experts. Once the applicaiton and environment starts breaking you will need to be able to identify the faults and re-configure the environment and or re-code the application once faults are found.
Generally most projects have a "performance testing" box that they need to tick because of some past failure, however they never plan or budget to do it properley. I normally recommend to do budget for and do scalability testing properley or save your money and don't do it at all. Trying to half do it on the cheap is a waste of time.
However any good developer should be able to do performance testing on their local machine and get some good benefits.
rely on tools (fxcop comes to mind)
rely on common sense
If you want to test performance and scalability with no requirements then you should create your own requirements / specs that can be done in the timeline / deadline given to you. After defining the said requirements, you should then tell your supervisor about it if he/she agrees.
To test scalability (assuming you're testing a program/website):
Create lots of users and data and check if your system and database can handle it. MyISAM table type in MySQL can get the job done.
To test performance:
Optimize codes, check it in a slow internet connection, etc.
Short answer: Don't do it!
In order to get a (better) definition write a performance test concept you can discuss with the experts that should define the requirements.
Make assumptions for everything you don't know and document these assumptions explicitly. Assumptions comprise everything that may be relevant to your system's behaviour under load. Correct assumptions will be approved by the experts, incorrect ones will provoke reactions.
For all of those who have read Tom DeMarcos latest book (Adrenaline Junkies ...): This is the strawman pattern. Most people who are not willing to write some specification from scratch will not hesitate to give feedback to your document. Because you need to guess several times when writing your version you need to prepare for being laughed at when being reviewed. But at least you will have better information.
The way I usually approach problems like this is just to get a real or simulated realistic workload and make the program go as fast as possible, within reason. Then if it can't handle the load I need to think about faster hardware, doing parts of the job in parallel, etc.
The performance tuning is in two parts.
Part 1 is the synchronous part, where I tune each "thread", under realistic workload, until it really has little room for improvement.
Part 2 is the asynchronous part, and it is hard work, but needs to be done. For each "thread" I extract a time-stamped log file of when each message sent, each message received, and when each received message is acted upon. I merge these logs into a common timeline of events. Then I go through all of it, or randomly selected parts, and trace the flow of messages between processes. I want to identify, for each message-sequence, what its purpose is (i.e. is it truly necessary), and are there delays between the time of receipt and time of processing, and if so, why.
I've found in this way I can "cut out the fat", and asynchronous processes can run very quickly.
Then if they don't meet requirements, whatever they are, it's not like the software can do any better. It will either take hardware or a fundamental redesign.
Although no clear performance and scalability goals are defined, we can use the high level description of the three components you mention to drive general performance/scalability goals.
Component 1: It seems like a network I/O bound component, so you can use any available network load simulators to generate various work load to saturate the link. Scalability can be measure by varying the workload (10MB, 100MB, 1000MB link ), and measuring the response time , or in a more precise way, the delay associated with receiving the raw data. You can also measure the working set of the links box to drive a realistic idea about your sever requirement ( how much extra memory needed to receive X more workload of packets, ..etc )
Component 2: This component has 2 parts, an I/O bound part ( receiving data from Component 1 ), and a CPU bound part ( assembling the packets ), you can look at the problem as a whole, make sure to saturate your link when you want to measure the CPU bound part, if is is a multi threaded component, you can look for ways to improve look if you don't get 100% CPU utilization, and you can measure time required to assembly X messages, from this you can calculate average wait time to process a message, this can be used later to drive the general performance characteristic of your system and provide and SLA for your users ( you are going to guarantee a response time within X millisecond for example ).
Component 3: Completely I/O bound, and depends on both your hard disk bandwidth, and the back-end database server you use, however you can measure how much do you saturate disk I/O to optimize throughput, how much I/O counts do you require to read X MB of data, and improve around these parameters.
Hope that helps.
Thanks