Where to learn about low-level, hard-core performance stuffs?

Where to learn about low-level, hard-core performance stuffs? - performance

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)

The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)

EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.

Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs

Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.

I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

Related

Smallest time-slice where OpenMP is useful

I'm aware this will vary between systems and implementations, but I have absolutely no idea if OpenMP is feasible for use where you have easily parallelizable code that doesn't takea long time to run. For instance if I have a loop which would be ideal for OpenMP, but only takes a couple of ms to run on a single thread, will the overhead kill any gains?
My usage is in real-time rendering, where we have perhaps 20ms to generate each frame. So typically we don't have any single block of code taking more than 5-10ms... is OpenMP targeted at this kind of time-scale, or only on operations taking several seconds to run?
Any anecdotal/empirical data is welcome, as well as more 'official' sources. I suppose I'm primarily interested in MSVC++ and GCC OpenMP implementations.
I found this interesting article: http://software.intel.com/en-us/blogs/2010/06/02/using-openmp-to-parallelize-a-game/

The answer, to all empirical questions when data is not available, is 42 ...
But seriously, you're going to have to derive your own answer to this question for your own configuration of hardware, software and problem.
Yes, perhaps when, in 1997, OpenMP was first published, the typical use case was for relatively heavyweight threads of computation taking seconds each. Since then processors have got faster and implementations have got better, I don't think you'll get a good answer to your question without benchmarking. And you're going to have to benchmark anyway, whatever you learn from SO, so stop wasting time and get coding.
You might be persuaded that OpenMP is not useful: but you won't quite trust the answer wrt your own requirements.
You might be persuaded that OpenMP is useful: but you won't quite trust the answer wrt your own requirements.
You might get a lot of conflicting anecdotal evidence: but you won't quite trust the answer wrt your own requirements.
Get benchmarking. And let us know how you get on, SO is far too long on anecdotes (such as the trash I'm writing), far too short on hard data.
PS Don't lose your benchmark codes. Revise them and rerun them as your experience develops and as your hardware, systems software and applications software change. Of one thing I am convinced without data: the answer to your question is a moving target.

Are there any resources for language independent performance tips?

I work with many people that program video games for a living. I have a quite a bit of knowledge in C++ and I know a number of general performance strategies to utilize in day to day programming. Like using prefix ++/-- over post fix.
My problem is that often times people come to me to give them tips on general optimizations they can do on a regular basis when programming, but often times these people program in all sorts of languages. Some use C++, C#, Java, ActionScript, etc.
I am wondering if there are any general performance tips that can be utilized on a day by day programming basis? For example, I would suggest prefix ++/-- over postfix for people programming in another language, but I am just not sure if that is true.
My guess is that it is language specific and the best way to go about general optimizations is to make sure you are not using majorly bloated algorithms, but maybe someone has some advice.

Without going into language specifics, or even knowing whether this is embedded, web, CAD, game, or iPhone programming, there isn't much that can be said. All we know is that there's multiple languages involved, and for some unknown reason performance is always slower than desirable.
First, check your algorithms. A slow algorithm can cause horrible performance. Read up on algorithms and their complexity.
Second, note if there are any really slow operations, such as hitting a database or transmitting information or moving a robot arm. See if the program is doing more of those than it should.
Third, profile. If there's a section of code that's taking 5% of the time, no optimization will make your program more than 5% faster. If a section of code is taking a lot of the time, it's worth looking at.
Fourth, get somebody who knows what they're doing to make any specific optimizations. Test them when they're done to make sure they actually speed up performance. When performance was an issue, I've improved it with some counterintuitive measures, like rolling up loops.

I don't think you can generalize optimization as such. To optimize execution time, you need to dig deep into the language and understand how things work in detail. Just guessing or making assumptions on experiences with other languages won't work! For example, writing x = x << 1 instead of x = x*2 might be a big benefit in C++. In JavaScript it will slow you down.
With all the differences between all the languages it's hard to find generic optimization tips. Maybe for some languages which are similar (f.ex. C# and Java). But if you add both JavaScript and Python to that list I'm pretty sure not many common optimization techniques will be left over.
Also keep in mind that premature optimization is often considered bad practice. Developer-hours are much more expensive than buying additional hardware.
However, there is one thing which comes to mind. Over the past decade or so, Object Relational Mappers have become quite popular. And hence, they emerge(d) in pretty much all popular languages. But you have to be careful with those. It's easy to load tons of data into memory that you will never use in your code if not properly configured. Keep that in mind. Lazy loading might be of some help here. But your mileage will vary.
Optimization depends on so many things that answering such a generic question would make this post explode into a full-fledged paper. In my opinion, optimization should be regarded on a project-by-project basis. Not only Language-by-Language basis.

I think you need to split this into two separate questions:
1) Are there language-agnostic ways to find performance problems? YES. Profile, but avoid the myths around that subject.
2) Are there language-agnostic ways to fix performance problems? IT DEPENDS.
A general language-agnostic principle is: do (1) before you do (2).
In other words, Ready-Aim-Fire, not Ready-Fire-Aim.
Here's an example of performance tuning, in C, but it could be any language.

A few things I have learned since asking this:
I/O operations are usually the most expensive to performance. This holds especially true when you are doing disk or network I/O (which is usually the most expensive because if you have to wait for a response from the other host you have to wait for all processing and I/O operations the remote host does). Only do these operations when absolutely necessary and possibly consider using a cache when possible.
Database operations can be very expensive because of network/disk I/O and the translation time to and from SQL. Using in-memory DB or cache can help reduce I/O issues and some (not all) NoSQL databases can reduce SQL translation time.
Only log important information. Using logging libraries like log4j can help because you can put logging to your hearts desire in your application but you set each message to a certain log level. Whichever log level you set the application to it will only log messages at that level or higher. This way if you need to troubleshoot functionality you only have to change a quick config and restart you application to give you additional messages. Then when you are done just turn you application back to the default level so that you do not log too often.
Only include functionality that is needed. Additional functionality may be nice to have but can increase processing time, provide additional locations for the application to fail, and costs your team development time that could be spent on more important tasks.
Use and configure your memory manager correctly. Garbage collection routines can kill performance if they are not configured correctly. If every minute you application freezes for a second or two for garbage collection your customer probably will not be happy.
Profile only after you have discovered a performance issue. Profilers will make the applications performance look worse than it is because you have your application and the profiler running on the same host, consuming the same hardware resources.
Do not prematurely do performance tuning. There are general practices you can take that should be better on performance in each language, but starting performance tuning in the middle of application development can cost you a lot on development because there is still functionality to be added.
This is not necessarily going to help performance but keep class dependency to a minimal. When you get into performance tuning there is good chance you will have to rewrite whole portions of code, which if there is a lot of dependencies on the section you are performance tuning the greater chance you will break the code. It can often be a domino affect because after fixing the performance issue than you have to fix all the dependencies, and possibly dependencies of the original dependencies. A performance tuning exercise estimate for a few hours can quickly turn into months with an application that has a lot of dependencies.
If performance is a concern do not use interpreted languages (scripting languages).
Only use the hardware you need. Having a system with a 64 core processor may seem cool but if you only have two or three threads running in your application than you are getting little benefit from having 64 cores. In fact, in rare instances having overly excessive hardware can sometimes hurt performance because the chips have to be wired to handle all the hardware which can cause your application to spend more time switching between cores or processors than actually being processed.
Any timing metrics you report make as granular as possible. Currently, you may only need to be worried about the number of milliseconds a process takes but in the future as you make your application faster and faster you may need more granular timings. If version A uses milliseconds and version B uses microseconds, how can you compare performance if version B is taking about the same number of milliseconds. Version B may be better but you just can't tell because version A did not use granular enough metrics.

Optimization! - What is it? How is it done?

Its common to hear about "highly optimized code" or some developer needing to optimize theirs and whatnot. However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Care to explain the general idea of it? Also, recommend some reading materials and really whatever you feel like saying on the matter. Feel free to rant and preach.

Optimize is a term we use lazily to mean "make something better in a certain way". We rarely "optimize" something - more, we just improve it until it meets our expectations.
Optimizations are changes we make in the hopes to optimize some part of the program. A fully optimized program usually means that the developer threw readability out the window and has recoded the algorithm in non-obvious ways to minimize "wall time". (It's not a requirement that "optimized code" be hard to read, it's just a trend.)
One can optimize for:
Memory consumption - Make a program or algorithm's runtime size smaller.
CPU consumption - Make the algorithm computationally less intensive.
Wall time - Do whatever it takes to make something faster
Readability - Instead of making your app better for the computer, you can make it easier for humans to read it.
Some common (and overly generalized) techniques to optimize code include:
Change the algorithm to improve performance characteristics. If you have an algorithm that takes O(n^2) time or space, try to replace that algorithm with one that takes O(n * log n).
To relieve memory consumption, go through the code and look for wasted memory. For example, if you have a string intensive app you can switch to using Substring References (where a reference contains a pointer to the string, plus indices to define its bounds) instead of allocating and copying memory from the original string.
To relieve CPU consumption, cache as many intermediate results if you can. For example, if you need to calculate the standard deviation of a set of data, save that single numerical result instead looping through the set each time you need to know the std dev.

I'll mostly rant with no practical advice.
Measure First. Optimization should be done to places where it matters. Highly optimized code is often difficult to maintain and a source of problems. In places where the code does not slow down execution anyway, I alwasy prefer maintainability to optimizations. Familiarize yourself with Profiling, both intrusive (instrumented) and non-intrusive (low overhead statistical). Learn to read a profiled stack, understand where the time inclusive/time exclusive is spent, why certain patterns show up and how to identify the trouble spots.
You can't fix what you cannot measure. Have your program report through some performance infrastructure the thing it does and the times it takes. I come from a Win32 background so I'm used to the Performance Counters and I'm extremely generous at sprinkling them all over my code. I even automatized the code to generate them.
And finally some words about optimizations. Most discussion about optimization I see focus on stuff any compiler will optimize for you for free. In my experience the greatest source of gains for 'highly optimized code' lies completely elsewhere: memory access. On modern architectures the CPU is idling most of the times, waiting for memory to be served into its pipelines. Between L1 and L2 cache misses, TLB misses, NUMA cross-node access and even GPF that must fetch the page from disk, the memory access pattern of a modern application is the single most important optimization one can make. I'm exaggerating slightly, of course there will be counter example work-loads that will not benefit memory access locality this techniques. But most application will. To be specific, what these techniques mean is simple: cluster your data in memory so that a single CPU can work an a tight memory range containing all it needs, no expensive referencing of memory outside your cache lines or your current page. In practice this can mean something as simple as accessing an array by rows rather than by columns.
I would recommend you read up the Alpha-Sort paper presented at the VLDB conference in 1995. This paper presented how cache sensitive algorithms designed specifically for modern CPU architectures can blow out of the water the old previous benchmarks:
We argue that modern architectures
require algorithm designers to
re-examine their use of the memory
hierarchy. AlphaSort uses clustered
data structures to get good cache
locality...

The general idea is that when you create your source tree in the compilation phase, before generating the code by parsing it, you do an additional step (optimization) where, based on certain heuristics, you collapse branches together, delete branches that aren't used or add extra nodes for temporary variables that are used multiple times.
Think of stuff like this piece of code:
a=(b+c)*3-(b+c)
which gets translated into
-
* +
+ 3 b c
b c
To a parser it would be obvious that the + node with its 2 descendants are identical, so they would be merged into a temp variable, t, and the tree would be rewritten:
-
* t
t 3
Now an even better parser would see that since t is an integer, the tree could be further simplified to:
*
t 2
and the intermediary code that you'd run your code generation step on would finally be
int t=b+c;
a=t*2;
with t marked as a register variable, which is exactly what would be written for assembly.
One final note: you can optimize for more than just run time speed. You can also optimize for memory consumption, which is the opposite. Where unrolling loops and creating temporary copies would help speed up your code, they would also use more memory, so it's a trade off on what your goal is.

Here is an example of some optimization (fixing a poorly made decision) that I did recently. Its very basic, but I hope it illustrates that good gains can be made even from simple changes, and that 'optimization' isn't magic, its just about making the best decisions to accomplish the task at hand.
In an application I was working on there were several LinkedList data structures that were being used to hold various instances of foo.
When the application was in use it was very frequently checking to see if the LinkedListed contained object X. As the ammount of X's started to grow, I noticed that the application was performing more slowly than it should have been.
I ran an profiler, and realized that each 'myList.Contains(x)' call had O(N) because the list has to iterate through each item it contains until it reaches the end or finds a match. This was definitely not efficent.
So what did I do to optimize this code? I switched most of the LinkedList datastructures to HashSets, which can do a '.Contains(X)' call in O(1)- much better.

This is a good question.
Usually the best practice is 1) just write the code to do what you need it to do, 2) then deal with performance, but only if it's an issue. If the program is "fast enough" it's not an issue.
If the program is not fast enough (like it makes you wait) then try some performance tuning. Performance tuning is not like programming. In programming, you think first and then do something. In performance tuning, thinking first is a mistake, because that is guessing.
Don't guess what to fix; diagnose what the program is doing.
Everybody knows that, but mostly they do it anyway.
It is natural to say "Could be the problem is X, Y, or Z" but only the novice acts on guesses. The pro says "but I'm probably wrong".
There are different ways to diagnose performance problems.
The simplest is just to single-step through the program at the assembly-language level, and don't take any shortcuts. That way, if the program is doing unnecessary things, then you are doing the same things, and it will become painfully obvious.
Another is to get a profiling tool, and as others say, measure, measure, measure.
Personally I don't care for measuring. I think it's a fuzzy microscope for the purpose of pinpointing performance problems. I prefer this method, and this is an example of its use.
Good luck.
ADDED: I think you will find, if you go through this exercise a few times, you will learn what coding practices tend to result in performance problems, and you will instinctively avoid them. (This is subtly different from "premature optimization", which is assuming at the beginning that you must be concerned about performance. In fact, you will probably learn, if you don't already know, that premature concern about performance can well cause the very problem it seeks to avoid.)

Optimizing a program means: make it run faster
The only way of making the program faster is making it do less:
find an algorithm that uses fewer operations (e.g. N log N instead of N^2)
avoid slow components of your machine (keep objects in cache instead of in main memory, or in main memory instead of on disk); reducing memory consumption nearly always helps!
Further rules:
In looking for optimization opportunities, adhere to the 80-20-rule: 20% of typical program code accounts for 80% of execution time.
Measure the time before and after every attempted optimization; often enough, optimizations don't.
Only optimize after the program runs correctly!
Also, there are ways to make a program appear to be faster:
separate GUI event processing from back-end tasks; priorize user-visible changes against back-end calculation to keep the front-end "snappy"
give the user something to read while performing long operations (every noticed the slideshows displayed by installers?)

However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Let me share a secret with you: nobody does. There are certain areas where we know mathematically what is and isn't slow. But for the most part, performance is too complicated to be able to understand. If you speed up one part of your code, there's a good possibility you're slowing down another.
Therefore, anyone who tells you that one method is faster than another, there's a good possibility they're just guessing unless one of three things are true:
They have data
They're choosing an algorithm that they know is faster mathematically.
They're choosing a data structure that they know is faster mathematically.

Optimization means trying to improve computer programs for such things as speed. The question is very broad, because optimization can involve compilers improving programs for speed, or human beings doing the same.

I suggest you read a bit of theory first (from books, or Google for lecture slides):
Data structures and algorithms - what the O() notation is, how to calculate it,
what datastructures and algorithms can be used to lower the O-complexity
Book: Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest
Compilers and assembly - how code is translated to machine instructions
Computer architecture - how the CPU, RAM, Cache, Branch predictions, out of order execution ... work
Operating systems - kernel mode, user mode, scheduling processes/threads, mutexes, semaphores, message queues
After reading a bit of each, you should have a basic grasp of all the different aspects of optimization.
Note: I wiki-ed this so people can add book recommendations.

I am going with the idea that optimizing a code is to get the same results in less time. And fully optimized only means they ran out of ideas to make it faster. I throw large buckets of scorn on claims of "fully optimized" code! There's no such thing.
So you want to make your application/program/module run faster? First thing to do (as mentioned earlier) is measure also known as profiling. Do not guess where to optimize. You are not that smart and you will be wrong. My guesses are wrong all the time and large portions of my year are spent profiling and optimizing. So get the computer to do it for you. For PC VTune is a great profiler. I think VS2008 has a built in profiler, but I haven't looked into it. Otherwise measure functions and large pieces of code with performance counters. You'll find sample code for using performance counters on MSDN.
So where are your cycles going? You are probably waiting for data coming from main memory. Go read up on L1 & L2 caches. Understanding how the cache works is half the battle. Hint: Use tight, compact structures that will fit more into a cache-line.
Optimization is lots of fun. And it's never ending too :)
A great book on optimization is Write Great Code: Understanding the Machine by Randall Hyde.

Make sure your application produces correct results before you start optimizing it.

Does Wirth's law still hold true?

Adage made by Niklaus Wirth in 1995:
«Software is getting slower more rapidly than hardware becomes faster»
Do you think it's actually true?
How should you measure "speed" of software? By CPU cycles or rather by time you need to complete some task?
What about software that is actually getting faster and leaner (measured by CPU cycles and MB of RAM) and more responsive with new versions, like Firefox 3.0 compared with 2.0, Linux 2.6 compared with 2.4, Ruby 1.9 compared to 1.8. Or completely new software that is order of magnitude faster then old stuff (like Google's V8 Engine)? Doesn't it negate that law?

Yes I think it is true.
How do I measure the speed of software? Well time to solve tasks is a relevant indicator. For me as a user of software I do not care whether there are 2 or 16 cores in my machine. I want my OS to boot fast, my programs to start fast and I absolutely do not want to wait for simple things like opening files to be done. A software has to just feel fast.
So .. when booting Windows Vista there is no fast software I am watching.
Software / Frameworks often improve their performance. That's great but these are mostly minor changes. The exception proves the rule :)
In my opinion it is all about feeling. And it feels like computers have been faster years ago. Of course I couldn't run the current games and software on those old machines. But they were just faster :)

It's not that software becomes slower, it's that its complexity increases.
We now build upon many levels of abstraction.
When was the last time people on SO coded in assembly language?
Most never have and never will.

It is wrong. Correct is
Software is getting slower at the same rate as hardware becomes faster.
The reason is that this is mostly determined by human patience, which stays the same.
It also neglects to mention that the software of today does more than 30 years ago, even if we ignore eye candy.

In general, the law holds true. As you have stated, there are exceptions "that prove the rule". My brother recently installed Win3.1 on his 2GHz+ PC and it boots in a blink of an eye.
I guess there are many reasons why the law holds:
Many programmers entering the profession now have never had to consider limited speed / resourced systems, so they never really think about the performance of their code.
There's generally a higher importance on getting the code written for deadlines and performance tuning usually comes last after bug fixing / new features.
I find FF's lack of immediate splash dialog annoying as it takes a while for the main window to appear after starting the application and I'm never sure if the click 'worked'. OO also suffers from this.
There are a few articles on the web about changing the perception of speed of software without changing the actual speed.
EDIT:
In addition to the above points, an example of the low importance given to efficiency is this site, or rather, most of the other Q&A sites. This site has always been developed to be fast and responsive and it shows. Compare this to the other sites out there - I've found phpBB based sites are flexible but slow. Google is another example of putting speed high up in importance (it even tells you how long the search took) - compare with other search engines that were around when google started (now, they're all fast thanks to google).
It takes a lot of effort, skill and experience to make fast code which is something I found many programmers lack.

From my own experience, I have to disagree with Wirth's law.
When I first approached a computer (in the 80'), the time for displaying a small still picture was perceptible. Today my computer can decode and display 1080p AVCHD movies in realtime.
Another indicator is the frames per second of video games. Not long ago it used to be around 15fps. Today 30fps to 60 fps are not uncommon.

Quoting from a UX study:
The technological advancements of 21 years have placed modern PCs in a completely different league of varied capacities. But the “User Experience” has not changed much in two decades. Due to bloated code that has to incorporate hundreds of functions that average users don’t even know exist, let alone ever utilize, the software companies have weighed down our PCs to effectively neutralize their vast speed advantages.
Detailed comparison of UX on a vintage Mac and a modern Dual Core: http://hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_DualCore_You_Wont_Believe_Who_Wins

One of the issues of slow software is a result of most developers using very high end machines with multicore CPUs and loads of RAM as their primary workstation. As a result they don't notice performance issues easily.
Part of their daily activity should be running their code on slower more mainstream hardware that the expected clients will be using. This will show the real world performance and allow them to focus on improving bottlenecks. Or even running within a VM with limited resources can aid in this review.
Faster hardware shouldn't be an excuse for creating slow sloppy code, however it is.

My machine is getting slower and clunkier every day. I attribute most of the slowdown to running antivirus. When I want to speed up, I find that disabling the antivirus works wonders, although I am apprehensive like being in a seedy brothel.

I think that Wirth's law was largely caused by Moore's law - if your code ran slow, you'd just disregard since soon enough, it would run fast enough anyway. Performance didn't matter.
Now that Moore's law has changed direction (more cores rather than faster CPUs), computers don't actually get much faster, so I'd expect performance to become a more important factor in software development (until a really good concurrent programming paradigm hits the mainstream, anyway). There's a limit to how slow software can be while still being useful, y'know.

Yes, software nowadays may be slower or faster, but you're not comparing like with like. The software now has so much more capability, and a lot more is expected of it.
Lets take as an example: Powerpoint. If I created a slideshow with Powerpoint from the early nineties, I can have a slideshow with pretty colours fairly easily, nice text etc. Now, its a slideshow with moving graphics, fancy transitions, nice images.
The point is, yes, software is slower, but it does more.
The same holds true of the people who use the software. back in the 70s, to create a presentation you had to create your own transparencies, maybe even using a pen :-). Now, if you did the same thing, you'd be laughed out of the room. It takes the same time, but the quality is higher.
This (in my opinion) is why computers don't give you gains in productivity, because you spend the same amount of time doing 'the job'. But if you use todays software, your results looks more professional, you gain in quality of work.

Skizz and Dazmogan have it right.
On the one hand, when programmers try to make their software take as few cycles as possible, they succeed, and it is blindingly fast.
On the other hand, when they don't, which is most of the time, their interest in "Galloping Generality" uses up every available cycle and then some.
I do a lot of performance tuning. (My method of choice is random halting.) In nearly every case, the reason for the slowness is over-design of class and data structure.
Oddly enough, the reason usually given for excessively event-driven and redundant data structure is "efficiency".
As Bompuis says, we build upon many layers of abstraction. That is exactly the problem.

Yes it holds true. You have given some prominent examples to counter the thesis, but bear in mind that these examples are developed by a big community of quite knowledgeable people, who are more or less aware of good practices in programming.
People working with the kernel are aware of different CPU's architectures, multicore issues, cache lines, etc. There is an interesting ongoing discussion about inclusion of hardware performance counters support in the mainline kernel. It is interesting from the 'political' point of view, as there is a conflict between the kernel people and people having much experience in performance monitoring.
People developing Firefox understand that the browser should be "lightweight" and fast in order to be popular. And to some extend they manage to do a good job.
New versions of software are supposed to be run on faster hardware in order to have the same user experience. But whether the price is just? How can we asses whether the functionality was added in the efficient way?
But coming back to the main subject, many of the people after finishing their studies are not aware of the issues related to performance, concurrency (or even worse, they do not care). For quite a long time Moore law was providing a stable performance boost. Thus people wrote mediocre code and nobody even noticed that there was something wrong with inefficient algorithms, data-structures or more low-level things.
Then some limitations came into play (thermal efficiency for example) and it is no longer possible to get 'easy' speed for few bucks. People who just depend on hardware performance improvements might get a cold shower. On the other hand, people who have in-depth knowledge of algorithms, data structures, concurrency issues (quite difficult to recruit these...) will continue to write good applications and their value on the job market will increase.
The Wirth law should not only be interpreted literally, it is also about poor code bloat, violating the keep-it-simple-stupid rule and people who waste the opportunity to use the 'faster' hardware.
Also if you happen to work in the area of HPC then these issues become quite obvious.

In some cases it is not true: the frame rate of games and the display/playing of multimedia content is far superior today than it was even a few years ago.
In several aggravatingly common cases, the law holds very, very true. When opening the "My Computer" window in Vista to see your drives and devices takes 10-15 seconds, it feels like we are going backward. I really don't want to start any controversy here but it was that as well as the huge difference in time needed to open Photoshop that drove me off of the Windows platform and on to the Mac. The point is that this slowdown in common tasks is serious enough to make me jump way out of my former comfort zone to get away from it.

Can´t find the sense. Why is this sentence a law?
You never can compare Software and Hardware, they are too different.
Hardware is genuine material and Software is a written code.
The connection is only that Software has to control the performance of Hardware. After an executed step in Hardware the Software needs a completion-sign, so the next order of Software can be done.
Why should I slow down Software? We allways try to make it faster !
It´s a lot of things to do in real physical way to make Hardware faster (changing print-modules or even physical parts of a computer).
It may be senseful,if Wirth means: to do this in one computer (= one Software- and Hardware-System).
To get a higher speed of Hardware it´s necessary to know the function of the Hardware, amount of parallel inputs or outputs at one moment and the frequency of possible switches in one second. Last not least it´s important that different Hardware-prints have the same or with a numeric factor multiplicated frequeny.
So perhaps the Software may slow down automaticaly very easy if You change something in the Hardware. - Wirth was thinking much more in Hardware, he is one of the great inventors since the computer is existing in the German-speaking area.
The other way is not easy. You have to know the System-Software of a computer very exactly to make the Hardware faster by changing the Software (=System-Software, Machine-Programs) of a computer. And if You use more layers you nearly have no direct influence in the speed of the Hardware.
Perhaps this may be the explanation of Wirth´s Law-Thinking....I got it!

Does the advent of MultiCore architectures affect me as a software developer?

As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?

Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.

Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.

In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.

To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.

I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...

Learn Erlang/F# (depending on your platform)

Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).

I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.

Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.

Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio