efficiency /algorithms vs system specifications - algorithm

We are all talking about the efficiency of the algorithms and it depends on input size -basically.
How about the system specifications of current computer that runs the algorithm? does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
I am sure that there must be a performance difference. But, I want to know what is the real relationship between algorithms and system specifications...

An increase in hardware performance will give you a constant C times the running time of your algorithm. Meaning if you have computer A which is overall 2 times slower than computer B. Than your algorithm will be twice as fast on computer B. Twice as fast though really makes hardly no difference when you consider big input values to an algorithm though.
In big O notation that is to say you will have something like O(n) compared to CO(n) = O(cn) = O(n). The complexity of the algorithm and general running time for large values will be about the same on both Computer A and Computer B.
If you analyze an algorithm's running time using something like big O notation, then you will have a much better idea about how the algorithm really works. Computer performance won't give you any kind of advantage when you are comparing an algorithm that is O(logn) compared to O(n^2).
Take a look at some of the data values for n:
I will assume 1 second per operation for the slow computer, and 2 operations for second for the fast computer. I will compare the better algorithm with the slow computer with the worse algorithm with the fast computer.
for n = 10:
Algorithm 1: O(logn): 4 operations
Slow computer: 4 seconds
Algorithm 2: O(n^2): 100 operations
Fast computer: 50 seconds
for n = 100:
Algorithm 1: O(logn): 7 operations
Slow computer: 7 seconds
Algorithm 2: O(n^2): 10,000
operations Fast computer: 1.4 hours
Large difference
for n = 1,000:
Algorithm 1: O(logn): 10 operations
Slow computer: 10 seconds
Algorithm 2: O(n^2): 1,000,000
operations Fast computer: 5.8 days
Huge difference
As n increases, the difference gets bigger and bigger.
Now if you tried to run each of these algorithms on a faster/slower computer for a large input size. It wouldn't matter. Hands down the O(logn) would be faster.

I don't like the answers provided by Brian Bondy and Czimi...
Perhaps this is because I started in a different era, when 32K was considered a lot of memory, and most "personal computers" had 8K bytes, and that now I work in scientific computing where the largest data sets are processed on some of the world's largest systems with thousands of processing nodes and seemingly unbelievable quantities of storage. Therefore I don't overlook certain other elements of the question.
The size of the data set in question makes a fantastic difference. Most all the answers on this question so far ignore this and work for very small numbers N. The other people who have answered have all presumed "it all fits in memory," or something close to that.
For large data sets other factors come into play, and "large" depends on what resources you have to use in solving your problem. Modern systems have the opportunity for off-line storage (e.g. DVDs), networked storage (e.g. nfs), on-line storage (e.g. serial ATA), and two levels of memory storage, system main memory and on-chip cache. How these are leveraged matters and the larger the data set the more they matter. You may or may not need to design access to these into your "algorithm", but if you do, it really matters!
As you increase scale beyond some particular point - the limit of a single CPU and its local memory is about right - these other factors become an increasingly large factor in the overhead of the workload. When I was a Digital, we did some of the first real commercial work on multi-CPU systems and I remember running a benchmark that showed that using a single-CPU as one "unit" of CPU workload capability, a second CPU (in a tightly coupled system) would give you a total of about 1.8. That is, the second CPU added about 0.8. For three, the increase dropped to about 0.6, and four it dropped a lot more, to about 0.2, for a grand total of about 2.6 for a four CPU arrangement, though we had some troubles keeping good numbers with four CPUs due to other effects (the measurement effort became a large fraction of the additional resource). ...The bottom line was that multi-CPUs weren't necessarily all they were cracked up to be - four times the CPU does NOT give you four times the processing power, even though in theory you get four times the flops. ...We repeated the work on the Alpha chip, the first multi-core in history, and the results held up pretty well. Surely there could have been optimizations to improve the fraction each additional CPU gave, and surely there has been a lot of work since then to split computing threads more smartly, but you'll never get it all the way to 100% of each new one, in part because they all slow down some (extra overhead) to coordinate.
Small interjection - we had a saying about this work: "Religate all the Important Stuff to the Compiler!" RISC, get it? This was because the compiler itself had to organize the workload so competing threads didn't step on one another!
Ultimately performing processing of really massive data crunching requires a really smart strategy of moving the data in and out of farther afield data storage through to local memory. And, division of labor within the algorithm is absolutely vital. In work I was doing with Roberto Mechoso at UCLA doing Global Circulation Modeling, they had a data-broker design that is illustrative of the attempts people make to do a great job. Frankly, the result wasn't as good as it could have been, but the design ideas that went into it are worth study. ...Presuming you consider this part of your "algorithm" - and not just the bit twiddling part, then the algorithms management of resources is one of the most vital aspects of reasonable if not optimal resource utilization doing substantial computing.
...I hope this helps answer your inquiry.

One thing not raised so far is that alogorithms are often described in terms of speed, e.g. O(n), O(n log(n)), etc... but they also have characteristics in terms of resource usage, where improved speed, say O(n) versus O (n log(n)), is at the cost of much greater memory usage. In modern computers as resources become exhausted, they are typically replaced with larger slower resources, e.g. swapping memory for disk, where the slower resource is orders of magnitude slower. Thus when we graph the performance of our algorithm against time, and expect a straight line, n log n curve, etc... we often see spikes for large values of n as memory gets exhuasted. In this case, the difference between 1GB and 2GB of RAM can be huge, so in practical terms, the answer to your question is yes, System specification is very important, and selection of algorithms requires knowledge of the system specification and the size of the input data.
For example, I develeop surface modelling and analysis software, and I know that my programs work well on a 32bit XP box for TIN models of 4 million points. The performance difference between 3.5 million and 4 million points is minor. At 4.5 million points the performance degradation is so severe the software is unusable.

Yes, it does depend on system specification. One system might be 10 times faster than another, so it will run bubblesort and quicksort on a set of data 10 times faster than the other.
But when you do analysis of algorithms, you often ignore constant factors like that, which is one thing that big-O notation does. So bubblesort is O(n^2) and quicksort is O(nlogn) (in the average case), and that holds no matter how fast your hardware is.
The interesting thing is when you start comparing apples and oranges. If you're running bubblesort on your fast hardware, you may find it's faster than quicksort on the slow hardware -- but only up to a point. Eventually, with a large enough input set, the quicksort on the slow hardware is going to be faster than bubblesort on the fast hardware.
If you want to start making comparisons like that, you need to do two things together: determine algorithmic complexity including the constant factors, and develop a speed model (e.g. how many iterations of a particular loop it can perform per second) for the actual hardware you're running on. One of the interesting things about Knuth's Art of Computer Programming, compared with other books on algorithms, is that he does both, so that for each algorithm he examines, he calculates how many units of execution time it will take for a given size of input on his (mythical) MIX computer. You could then adjust the calculation for faster or slower hardware -- something that big-O notation doesn't help with.

By your question, do you mean to ask why the efficiency of an algorithm is described only in terms of the input size?
Algorithms are usually described using the Big O Notation. This notation describes the asymptotic behavior of an algorithm; it describes the behavior when the input data is very very large.
So for example, we have two algorithms for sorting.
Algo#1 with O(n)
Algo#2 with O(n^2)
And let's take two PCs:
PC1
PC2 100x faster than PC1
And we have two setups:
PC1 running Algo#1
PC2 running Algo#2
When n is very very large (like billions?) PC1 will still beat PC1 :)

The efficiency of an algorithm doesn't depend on the system specification. The efficiency is described by the Ordo number, which gives you a relation of the processing effort and the size of the input.

Certainly yes. With high CPU the execution time will reduce.
Similarly with higher memory, the time taken to swap data (if applicable) will definitely reduce.

Be aware of the particularities of the language in which you implement your algorithm.
For instance, in java world, As illustrated in this article, a faster computer does not always means a faster runtime:
Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

Does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
In some cases absolutely! If your data set does not fit into memory you will need to use a disk based sorting algorithm such as merge sort. Quoting from Wikipedia:
When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.

Yes. Remember, we have jumped ~4 orders of magnitude since the early 80s. (1 MHz, 10 MHz, 100 MHz, 1000MHz).
But that's only the difference between n=10 and n=10000, in terms of data set sizes. I can purchase a terabyte hard drive...over 6?7? orders of magnitude than my old 20 megabyte drive.
There's a lot more data floating around out there than there is compute power. So while you might be confused about how useful the big-O is at n=50, n=500 kind of sizes...when n=1,000,00 you want to minimize n as much as you can. Anything supralinear is just rough on your compute power...non-polynomial is even worse. This extends all the way from the top of the system to the bottom. So, efficiency is king as soon as you deal with real-world dataset sizes.
Let me give you an example.
I did a junior level database design. By the end I had maybe 5 tables with maybe 20-40 pre-defined categories in them. Added rows, 10, 20 rows. No big deal. I was a whiz. I did it in PHP, not Perl. I was all that and a bag o' chips.
Move to now, a few years later. I'm doing a hobby project in datamining the stock market. I harvest data off a financial site every day - 6100 stocks, with about 10 columns in each stock. Thirty-thousand+ rows per week. My initial design was "normalized", with static data and dynamic data in different tables. As I played around with my queries, learning about things, if I did a bad join, I'd literally crash my server and make it unavailable. So I denormalized. My next phase is tagging and starting actual mining. I don't plan to start making serious predictions until Christmas-time; roughly 11x30K = 330K rows to mine and analyze. Algorithm efficiency will matter, if I want to get my data processed in a timely fashion. Doesn't matter if my CPU was 10 times as fast...if I use a N^2 algorithm, it'd only get done 2x as fast. :-)

But, I want to know what is the real
relationship between algorithms and
system specifications...
I am confused. I see a lot of people here writing a lot of things, however, when I read the quote above as a question, here's all I can say about it:
A better system (= faster CPU, more RAM) runs faster, a worse one (= slower CPU, less RAM) runs slower. The same algorithm (no matter how good or bad it is) will most likely run faster on a better system and slower on the worse one.
A faster algorithm runs faster than a slower one. It will run faster on the slower system and it will run faster on the faster system.
So what exactly was your question again? Is your question "Do we really need a fast algorithm if the system is already that fast? Won't a slow one do as well?" Yes, maybe. But in that case I would ask two questions:
Why selecting a slow algorithm just because the system is fast? That way your code will only run at decent speed on a very fast system. If you choose a fast algorithm, your code might even run at decent speed on a much worse system.
Why trying to intentionally achieve worse performance? Even though a bad algorithm might run within five seconds, what you consider fast enough on the fast machine, a good one might run in 100 milliseconds. So why making your program perform a task in 5 seconds it could perform exactly the same one in 100 milliseconds?
Actually it's point number (2) that really bugs me quite often. So often people say "Hey, don't over optimize, it won't really matter. This code is only such a small system in such a big system". Yes, if you just look at this code isolated, that is true. But there is a saying "Many a mickle makes a muckle". Of course you should optimize the most processor intensive parts first. However, if a system consists out of 100 modules and each of them uses only one percent of the CPU time, optimizing one of them to be twice as fast will only get an overall processing time improvement of 0.5%, close to nothing; that's why most people refrain from doing that. But what people overlook is that optimizing all of them to be twice as fast will get a processing time improvement of 50% (or IOW, the app will run twice as fast as a whole).
So unless there is any good reason for not doing it, why not always using the best algorithm known to solve a problem? The best means the one that shows good performance and a good CPU time/memory usage ratio (as it's useless to take the fastest one if this one needs more memory than a normal customer PC can even take).

Related

Roofline model: How does increasing Arithmetic Intensity allow room for improvements to performance?

Intel Tip: If you can’t break a memory roof, try to rework your algorithm for
higher arithmetic intensity. This will move you to the right and give
you more room to increase performance before hitting the memory
bandwidth roof.
For algorithms in the memory-bound region of a roofline plot, Intel suggests increasing the arithmetic intensity so that they move to the right (compute-bound region) hence providing room to improve the performance, since the performance roof would be higher.
I'm unable to understand how increasing the arithmetic intensity (say, increasing the no. of operations in the algorithm) can possibly improve a performance metric like the wall-clock time taken for the algorithm to run. Wouldn't you need to do more no. of computations even for a higher performance (in FLOPS)? Could someone explain how this is possible?
Increasing the arithmetic alone is not sufficient to make the algorithm faster. The idea is if you have a choice between multiple algorithm and one of them is memory bound, then it is probably better to pick the other one assuming it is not much slower in practice since you can hardly optimize memory-bound algorithm while this is often much easier for compute-bound one. The memory latency did almost not improve much over the last decade and the bandwidth is only slowly increasing (much less than the number of FLOPS of processors). This is known as the Memory Wall (stated several decades ago). Moving data becomes so expensive nowadays that it is sometimes better to recompute operations rather than storing the previous results. This is especially true for very large data since the bigger the data structure the slower it is. This situation is expected to become worse over the next decades. Thus, a slower compute-bound algorithm can become faster than a memory-bound one in a near future (especially if it can-be/is parallelized).

Understanding perf detail when comparing two different implementations of a BFS algorithm

The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.
I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.
I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...
__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>
This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?
The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.
Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.
Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.
This is my implementation of unoptimized parallel BFS (running once)...
This is using an optimized parallel BFS from a benchmark suite (running 10 times)...
Both take about 40 seconds to pre-process the data once, before doing parallel searching.
Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.
So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:
The optimized algorithm has a much higher last level cache miss which
I thought would explain the primary issue with BFS / graph algorithms
with low locality but performance seems to be unaffected by this and
my unoptimized is significantly lower.
First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.
All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.
Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

Memory data layout vs algorithm performance

How layout of a data in memory effects on algorithm performance?
For example merge sort is know for it computational complexity of O(n log n).
But in real world machine that processing algorithm will load/unload blocks of memory into CPU caches / CPU registers and spend auxiliary time on it.
Elements of collection to be sorted could be very scattered throughout the memory, and I wonder it will cause in slower performance vs sorting over gathered together elements.
Is in necessary to take into account how collections are really stores the data in memory?
In terms of big O notation - no. The time you read each block from
RAM to cpu cache is bounded by some constant, let it be C, so even
if you need to load each element in every iteration from RAM to
cache, you are going to need O(C*nlogn) time, but since C is
constant - it remains O(nlogn) time complexity.
In real world applications, especially when dealing with real-time apps, cache performance could be indeed a factor, and should be considered, so the order of accessing data, could matter. This is one of the reasons why quicksort is usually regarded as "faster" - it tends to have nice cache performance.
In addition - there are some algorithms that are developed to enjoy the "best of two worlds" - both O(nlogn) worst case with better constants, such as Timsort.
However, as rule of thumb, you should usually first implement the "easy way", then benchmark to see if it's fast enough, profile if it's not - and optimize the bottleneck. If you'll try to optimize every piece of your code for best cache performance - you will probably never finish writing it.
Profiling, profiling, profiling.
Modern computer architectures have become so complicated that accurate predictions on the running time have become impossible. You should prefer an experimental approach.
Also note that running times are no more deterministic and you should resort to statistical methods.
Architecture killed the algorithmician.
How layout of a data in memory effects on algorithm performance?
Layout is very important especially for large amount of data because access to the main memory is still expensive even for modern CPU:
http://mechanical-sympathy.blogspot.ru/2013/02/cpu-cache-flushing-fallacy.html
And your algo may spend much time on each cache miss:
http://mechanical-sympathy.blogspot.ru/2012/08/memory-access-patterns-are-important.html
Moreover, now there is a special area in Computer Science called Cache-friendly data structures and algos. See, for example, just googled:
http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf
etc etc

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

Algorithms for modern hardware?

Once again, I find myself with a set of broken assumptions. The article itself is about a 10x performance gain by modifying a proven-optimal algorithm to account for virtual memory:
On a modern multi-issue CPU, running
at some gigahertz clock frequency, the
worst-case loss is almost 10 million
instructions per VM page fault. If you
are running with a rotating disk, the
number is more like 100 million
instructions.
What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
Are there more such algorithms around? Should we re-examine all those fundamental building blocks of our education? What else do I need to watch out for when writing my own?
Clarification:
The algorithm in question isn't faster than the proven-optimal one because the Big-O notation is flawed or meaningless. It's faster because the proven-optimal algorithm relies on an assumption that is not true in modern hardware/OSes, namely that all memory access is equal and interchangeable.
You only need to re-examine your algorithms when your customers complain about the slowness of your program or it is missing critical deadlines. Otherwise focus on correctness, robustness, readability, and ease of maintenance. Until these items are achieved any performance optimization is a waste of development time.
Page faults and disk operations may be platform specific. Always profile your code to see where the bottlenecks are. Spending time on these areas will produce the most benefits.
If you're interested, along with page faults and slow disk operations, you may want to aware of:
Cache hits -- Data Oriented Design
Cache hits -- Reducing unnecessary
branches / jumps.
Cache prediction -- Shrink loops so
they fit into the processor's cache.
Again, these items are only after quality has been achieved, customer complaints and a profiler has analyzed your program.
One important thing is to realize that the most common usage of big-O notation (to talk about runtime complexity) is only half of the story - there's another half, namely space complexity (that can also be expressed using big-O) which can also be quite relevant.
Generally these days, memory capacity advances have outpaced computing speed advances (for a single core - parallelization can get around this), so less focus is given to space complexity, but it's still a factor that should be kept in mind, especially on machines with more limited memory or if working with very large amounts of data.
I'll expand on GregS's answer: the difference is between effective complexity and asymptotic complexity. Asymptotic complexity ignores constant factors and is valid only for “large enough” inputs. Oftentimes “large enough” can actually mean “larger than any computer can deal with, now and for a few decades”; this is where the theory (justifiably) gets a bad reputation. Of course there are also cases where “large enough” means n=3 !
A more complex (and thus more accurate) way of looking at this is to first ask “what is the size range of problems you are interested in?” Then you need to measure the efficiency of various algorithms in that size range, to get a feeling for the ‘hidden constants’. Or you can use finer methods of algorithmic asymptotics which actually give you estimates on the constants.
The other thing to look at are ‘transition points’. Of course an algorithm which runs in 2n2 time will be faster than one that runs in 1016nlog(n) times for all n < 1.99 * 1017. So the quadratic algorithm will be the one to choose (unless you are dealing with the sizes of data that CERN worries about). Even sub-exponential terms can bite – 3n3 is a lot better than n3 + 1016n2 for n < 5*1015 (assuming that these are actual complexities).
There are no broken assumptions that I see. Big-O notation is a measure of algorithmic complexity on a very, very, simplified idealized computing machine, and ignoring constant terms. Obviously it is not the final word on actual speeds on actual machines.
O(n) is only part of the story -- a big part, and frequently the dominant part, but not always the dominant part. Once you get to performance optimization (which should not be done too early in your development), you need to consider all of the resources you are using. You can generalize Amdahl's Law to mean that your execution time will be dominated by the most limited resource. Note that also means that the particular hardware on which you're executing must also be considered. A program that is highly optimized and extremely efficient for a massively parallel computer (e.g., CM or MassPar) would probably not do well on a big vector box (e.g., Cray-2) nor on a high-speed microprocessor. That program might not even do well on a massive array of capable microprocessors (map/reduce style). The different optimizations for the different balances of cache, CPU communications, I/O, CPU speed, memory access, etc. mean different performance.
Back when I spent time working on performance optimizations, we would strive for "balanced" performance over the whole system. A super-fast CPU with a slow I/O system rarely made sense, and so on. O() typically considers only CPU complexity. You may be able to trade off memory space (unrolling loops doesn't make O() sense, but it does frequently help real performance); concern about cache hits vice linear memory layouts vice memory bank hits; virtual vs real memory; tape vs rotating disk vs RAID, and so on. If your performance is dominated by CPU activity, with I/O and memory loafing along, big-O is your primary concern. If your CPU is at 5% and the network is at 100%, maybe you can get away from big-O and work on I/O, caching, etc.
Multi-threading, particularly with multi-cores, makes all of that analysis even more complex. This opens up into very extensive discussion. If you are interested, Google can give you months or years of references.

Resources