Related
I know the title of my question is rather vague, so I'll try to clarify as much as I can. Please feel free to moderate this question to make it more useful for the community.
Given a standard LAMP stack with more or less default settings (a bit of tuning is allowed, client-side and server-side caching turned on), running on modern hardware (16Gb RAM, 8-core CPU, unlimited disk space, etc), deploying a reasonably complicated CMS service (a Drupal or Wordpress project for arguments sake) - what amounts of traffic, SQL queries, user requests can I resonably expect to accommodate before I have to start thinking about performance?
NOTE: I know that specifics will greatly depend on the details of the project, i.e. optimizing MySQL queries, indexing stuff, minimizing filesystem hits - assuming web developers did a professional job - I'm really looking for a very rough figure in terms of visits per day, traffic during peak visiting times, how many records before (transactional) MySQL fumbles, so on.
I know the only way to really answer my question is to run load testing on a real project, and I'm concerned that my question may be treated as partly off-top.
I would like to get a set of figures from people with first-hand experience, e.g. "we ran such and such set-up and it handled at least this much load [problems started surfacing after such and such]". I'm also greatly interested in any condenced (I'm short on time atm) reading I can do to get a better understanding of the matter.
P.S. I'm meeting a client tomorrow to talk about his project, and I want to be prepared to reason about performance if his project turns out to be akin FourSquare.
Very tricky to answer without specifics as you have noted. If I was tasked with what you have to do, I would take each component in turn ( network interface, CPU/memory, physical IO load, SMP locking etc) and get the maximum capacity available, divide by rough estimate of use per request.
For example, network io. You might have 1x 1Gb card, which might achieve maybe 100Mbytes/sec. ( I tend to use 80% of theoretical max). How big will a typical 'hit' be? Perhaps 3kbytes average, for HTML, images etc. that means you can achieve 33k requests per second before you bottleneck at the physical level. These numbers are absolute maximums, depending on tools and skills you might not get anywhere near them, but nobody can exceed these maximums.
Repeat the above for every component, perhaps varying your numbers a little, and you will build a quick picture of what is likely to be a concern. Then, consider how you can quickly get more capacity in each component, can you just chuck $$ and gain more performance (eg use SSD drives instead of HD)? Or will you hit a limit that cannot be moved without rearchitecting? Also take into account what resources you have available, do you have lots of skilled programmer time, DBAs, or wads of cash? If you have lots of a resource, you can tend to reduce those constraints easier and quicker as you move along the experience curve.
Do not forget external components too, firewalls may have limits that are lower than expected for sustained traffic.
Sorry I cannot give you real numbers, our workloads are using custom servers, high memory caching and other tricks, and not using all the products you list. However, I would concentrate most on IO/SQL queries and possibly network IO, as these tend to be more hard limits, than CPU/memory, although I'm sure others will have a different opinion.
Obviously, the question is such that does not have a "proper" answer, but I'd like to close it and give some feedback. The client meeting has taken place, performance was indeed a biggie, their hosting platform turned out to be on the Amazon cloud :)
From research I've done independently:
Memcache is a must;
MySQL (or whatever persistent storage instance you're running) is usually the first to go. Solutions include running multiple virtual instances and replicate data between them, distributing the load;
http://highscalability.com/ is a good read :)
I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !
Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).
Any idea how to do performance and scalability testing if no clear performance requirements have been defined?
More information about my application.
The application has 3 components. One component can only run on Linux, the other two components are Java programs so they can run on Linux/Windows/Mac... The 3 components can be deployed to one box or each component can be deployed to one box. Deployment is very flexible. The Linux-only component will capture raw TCP/IP packages over the network, then one Java component will get those raw data from it and assemble them into the data end users will need and output them to hard disk as data files. The last Java component will upload data from data files to my database in batch.
In the absence of 'must be able to perform X iterations within Y seconds...' type requirements, how about these kinds of things:
Does it take twice as long for twice the size of dataset? (yes = good)
Does it take 10x as long for twice the size of dataset? (yes = bad)
Is it CPU bound?
Is it RAM bound (eg lots of swapping to virtual memory)?
Is it IO / Disk bound?
Is there a certain data-set size at which performance suddenly falls off a cliff?
Surprisingly this is how most perf and scalability tests start.
You can clearly do the testing without criteria, you just define the tests and measure the results. I think your question is more in the lines 'how can I establish test passing criteria without performance requirements'. Actually this is not at all uncommon. Many new projects have no clear criteria established. Informally it would be something like 'if it cannot do X per second we failed'. But once you passed X per second (and you better do!) is X the 'pass' criteria? Usually not, what happens is that you establish a new baseline and your performance tests guard against regression: you compare your current numbers with the best you got, and decide if the new build is 'acceptable' as build validation pass (usually orgs will settle here at something like 70-80% as acceptable, open perf bugs, and make sure that by ship time you get back to 90-95% or 100%+. So basically the performance test themselves become their own requirement.
Scalability is a bit more complicated, because there there is no limit. The scope of your test should be to find out where does the product break. Throw enough load at anything and eventually it will break. You need to know where that limit is and, very importantly, find out how does your product break. Does it give a nice error message and revert or does it spills its guts on the floor?
Define your own. Take the initiative and describe the performance goals yourself.
To answer any better, we'd have to know more about your project.
If there has been 'no performance requirement defined', then why are you even testing this?
If there is a performance requirement defined, but it is 'vague', can you indicate in what way it is vague, so that we can better help you?
Short of that, start from the 'vague' requirement, and pick a reasonable target that at least in your opinion meets or exceeds the vague requirement, then go back to the customer and get them to confirm that your clarification meets their requirements and ideally get formal sign-off on that.
Some definitions / assumptions:
Performance = how quickly the application responds to user input, e.g. web page load times
Scalability = how many peak concurrent users the applicaiton can handle.
Firstly perfomance. Performance testing can be quite simple, such as measuring and recording page load times in a development environment and using techniques like applicaiton profiling to identify and fix bottlenecks.
Load. To execute a load test there are four key factors, you will need to get all of these in place to be successfull.
1. Good usage models of how users will use your site and/or application. This can be easy of the application is already in use, but it can be extermely difficult if you are launching a something new, e.g. a Facebook application.
If you can't get targets as requirements, do some research and make some educated assumptions, document and circulate them for feedback.
2. Tools. You need to have performance testing scripts and tools that can excute the scenarios defined in step 1, with the number of expected users in step 1. (This can be quite expensive)
3. Environment. You will need a production like environment that is isolated so your tests can produce repoducible results. (This can also be very expensive.)
4. Technical experts. Once the applicaiton and environment starts breaking you will need to be able to identify the faults and re-configure the environment and or re-code the application once faults are found.
Generally most projects have a "performance testing" box that they need to tick because of some past failure, however they never plan or budget to do it properley. I normally recommend to do budget for and do scalability testing properley or save your money and don't do it at all. Trying to half do it on the cheap is a waste of time.
However any good developer should be able to do performance testing on their local machine and get some good benefits.
rely on tools (fxcop comes to mind)
rely on common sense
If you want to test performance and scalability with no requirements then you should create your own requirements / specs that can be done in the timeline / deadline given to you. After defining the said requirements, you should then tell your supervisor about it if he/she agrees.
To test scalability (assuming you're testing a program/website):
Create lots of users and data and check if your system and database can handle it. MyISAM table type in MySQL can get the job done.
To test performance:
Optimize codes, check it in a slow internet connection, etc.
Short answer: Don't do it!
In order to get a (better) definition write a performance test concept you can discuss with the experts that should define the requirements.
Make assumptions for everything you don't know and document these assumptions explicitly. Assumptions comprise everything that may be relevant to your system's behaviour under load. Correct assumptions will be approved by the experts, incorrect ones will provoke reactions.
For all of those who have read Tom DeMarcos latest book (Adrenaline Junkies ...): This is the strawman pattern. Most people who are not willing to write some specification from scratch will not hesitate to give feedback to your document. Because you need to guess several times when writing your version you need to prepare for being laughed at when being reviewed. But at least you will have better information.
The way I usually approach problems like this is just to get a real or simulated realistic workload and make the program go as fast as possible, within reason. Then if it can't handle the load I need to think about faster hardware, doing parts of the job in parallel, etc.
The performance tuning is in two parts.
Part 1 is the synchronous part, where I tune each "thread", under realistic workload, until it really has little room for improvement.
Part 2 is the asynchronous part, and it is hard work, but needs to be done. For each "thread" I extract a time-stamped log file of when each message sent, each message received, and when each received message is acted upon. I merge these logs into a common timeline of events. Then I go through all of it, or randomly selected parts, and trace the flow of messages between processes. I want to identify, for each message-sequence, what its purpose is (i.e. is it truly necessary), and are there delays between the time of receipt and time of processing, and if so, why.
I've found in this way I can "cut out the fat", and asynchronous processes can run very quickly.
Then if they don't meet requirements, whatever they are, it's not like the software can do any better. It will either take hardware or a fundamental redesign.
Although no clear performance and scalability goals are defined, we can use the high level description of the three components you mention to drive general performance/scalability goals.
Component 1: It seems like a network I/O bound component, so you can use any available network load simulators to generate various work load to saturate the link. Scalability can be measure by varying the workload (10MB, 100MB, 1000MB link ), and measuring the response time , or in a more precise way, the delay associated with receiving the raw data. You can also measure the working set of the links box to drive a realistic idea about your sever requirement ( how much extra memory needed to receive X more workload of packets, ..etc )
Component 2: This component has 2 parts, an I/O bound part ( receiving data from Component 1 ), and a CPU bound part ( assembling the packets ), you can look at the problem as a whole, make sure to saturate your link when you want to measure the CPU bound part, if is is a multi threaded component, you can look for ways to improve look if you don't get 100% CPU utilization, and you can measure time required to assembly X messages, from this you can calculate average wait time to process a message, this can be used later to drive the general performance characteristic of your system and provide and SLA for your users ( you are going to guarantee a response time within X millisecond for example ).
Component 3: Completely I/O bound, and depends on both your hard disk bandwidth, and the back-end database server you use, however you can measure how much do you saturate disk I/O to optimize throughput, how much I/O counts do you require to read X MB of data, and improve around these parameters.
Hope that helps.
Thanks
In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?
Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.
Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.
Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.
From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.
To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.
Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.
Look at what your compiler generates, particularly for hot areas of code.
If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.
Simply put, do as little as possible.
Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.
Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.
Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.
also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware
Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.
On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/
Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.
Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this