is there a relation between optimization and compression? - performance

Optimization is doing more computation with less.
Compression is saying more with less.
Is there a relation between the two? Can you compress computations?

Yes.
Good question.
The connection is Information Theory.
Let me simplify it enormously:
There is a difference between an amount of a) data, and b) information.
Data is a physical quantity of something, such as bits on a disk or on a communication line.
Information is what it tells to whomever is reading it, and as a measure it is always less than or equal to the data that carries it.
The amount by which the data exceeds the information that it carries is called its redundancy.
What compression does is reduce the data so it is closer to its information content, by reducing the redundancy.
What about computation? Think of a program as an information channel taking input data and processing it to recognize its information.
When the program has finished, it "knows" a certain amount of information about the input.
Every elemental operation it performs is one in which it gains (or loses) information, and it is not finished until it gains all the information.
Example: Searching for a particular key value in a table of 1024 numbers.
When the key is found, 10 bits of information have been learned, because it takes 10 bits to indicate where the number is in the table.
If binary search is used, it requires 10 comparisons, so each comparison is "learning" one bit.
There is a faster method if the key can be used as an index directly into the table.
In that case, the operation of indexing "learns" all 10 bits in a single step.
There is a slower method - linear search, which takes, on average, 512 comparisons.
In that method, each comparison only gets a fraction of a bit.
(That's why it's slow.)
So if computer operations are like data, optimization tries to minimize the number of operations for a given amount of information, while compression tries to minimize the number of data bits for a given amount of information.

It's a common trade-off in algorithms: trade extra space for faster computation (lookup table with possible values), or the opposite (binary search or similar). I'd recommend a read on space and time complexity. You may be able to find some neat tricks helping you optimize or compress, whichever you need more.

Related

Practical Efficiency of binary search

When searching an element or an insertion point in a sorted array, there are basically two approaches: straight search (element by element) or binary search. From the time complexities O(n) vs O(log(n)) we know that binary search is ultimately more efficient, however this does not automatically imply that binary search will always be faster than "normal search".
My question therefore is: Can binary search be practically less efficent than "normal" search for low n? If yes, can we estimate the point at which binary search will be more efficient?
Thanks!
Yes, a binary search can be practically less efficient than "normal" search for a small n. However, this is very hard to estimate the point at which a binary search will be more efficient (if even possible) because this is very dependent of the problem (eg. data type, search predicate), the hardware (eg. processor, RAM) and even the dynamic state of the hardware used when the search is performed as well as the actual data in the sorted array on modern systems.
The first reason a binary search can be less efficient is vectorization. Indeed, modern processors can support SIMD instructions working on pretty big vectors. Thus, a linear search can work simultaneously on many item per processing cycle. Modern processors can even often execute few SIMD instructions in parallel per cycle. While linear searches can be often trivially vectorized, it is not the case of binary searches which are almost inherently sequential. One should keep in mind that vectorization is not always possible nor always automatically done by compilers, especially on non-trivial data types (eg. composite data structures, pointer-based types) or non-trivial search predicates (eg. the ones with conditionals or memory indirections).
The second reason a binary search can be less efficient is branch predictability. Indeed, modern processors try to predict branches ahead of time to avoid pipeline stall. If this prediction works, then branches can be taken very quickly, otherwise the processor can stall for several cycles (up to dozens). A branch can be easily predicted if it is always true or always false. A randomly taken branch cannot be predicted causing stalls. Because the array is sorted, branches in linear searches are easy to predict (branches are either always taken or never taken until the element is found), while this is clearly not the case for binary searches. As a result, the speed of a search is dependent of the searched item, and data inside the sorted array.
The same thing apply for cache misses and memory fetches: because the latency of the RAM is very big compared to executing arithmetic instructions, modern processors contains dedicated hardware prefetching units trying to predict the next memory fetches and prefetch data ahead of time in order to avoid cache misses. Prefetchers are good to predict linear/contiguous memory accesses but very bad for random memory accesses. Memory accesses of linear searches are trivial while the one of binary searches appear to be mostly random for many processors. A cache miss happening during a binary search will certainly cause the processor to stall for a lot of cycles. If the sorted array is already loaded in cache, a binary search on it can be much faster.
But this is not enough: using wide SIMD instructions or doing cache-misses can impact the frequency of the computing core and so the speed of the algorithm. Not to mention that the size of the data type also matters a lot as the memory throughput is limited and strided memory accesses are slower than contiguous one. One should also take into account the additional complexity of binary searches compared to linear ones (ie. often more instructions to execute). I guess I missed some important points in the above list.
As a programmer, you may need to define a threshold to choose which algorithm to use. If you really need that, the best solution is to find is automatically using a benchmark or autotuning methods. Practical experimentations shows that the threshold changed over the last decades for a given fixed context (data type, cache state, etc.), in favour to linear searches (so the thresholds are generally increasing over time).
My personal advice is not to use a binary search for value of n smaller than 256 / data_type_size_in_bytes with trivial/native data types on mainstream processors. I think it is a good idea to use a binary search when n is bigger than 1000, or also when the data-type is non-trivial as well as when the predicate is expensive.

Computing vs printing

In my program, I made few modifications for performance improvement.
First, I undid some 3D point computations as it was a repetitive computation.
Second, I undid some print statements.
What I observe is that second change substantially improved the performance, while first one not so much.
Does it mean computations involving floating numbers are much less expensive than printing out some data to console? Is not floating point mathematics considered to be highly computation extensive?
Floating-point arithmetic is often more expensive than integer arithmetic, in terms of processor cycles and/or the space required for it in the silicon of processors and/or the energy required for it. However, printing is generally much more expensive.
Typical performance for floating-point additions or multiplications might be a latency of four processor cycles, compared to one for integer additions or multiplications.
Formatting output requires many instructions. Converting numbers to decimal requires dividing or performing table-lookups or executing other algorithms. The characters generated to represent a number must be placed in a buffer. Checks must be performed to ensure that internal buffers are not overflowed. When a buffer is full, or a printing operation is complete and must be sent to the output device (rather than just merely held in a buffer for future operations), then an operating system call must be performed to transfer the data from user memory to some input-output driver. Even simple in-buffer formatting operations may take hundreds of cycles, and printing that requires interaction with the file system or other devices may take thousands of cycles. (The actual upper limit is infinite, since printing may have to wait for some physical device to become ready. But even if all the activity of a particular operation is inside the computer itself, a print operation may take thousands of cycles.)

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

Is there a speed difference in ordering by int vs. float?

When retrieving entries in a database, is there a difference between storing values as a float or decimal vs. an int when using ORDERBY in a SELECT statement?
It depends. You didn't specify the RDBMS so I can only speak to SQL Server specifically but data types have different storage costs associated with them. Ints range from 1 to 8 bytes, Decimals are 5-17 and floats are 4 to 8 bytes.
The RDBMS will need to read data pages off disk to find your data (worst case) and they can only fit so many rows on an 8k page of data. So, if you have 17 byte decimals, you're going to get 1/17th the amount of rows read off disk per read than you could have if you sized your data correctly and used a tinyint with a 1 byte cost to store X.
That storage cost will have a cascading effect when you go to sort (order by) your data. It will attempt to sort in memory but if you have a bazillion rows and are starved for memory it may dump to temp storage for the sort and you're paying that cost over and over.
Indexes may help as the data can be stored in a sorted manner but again, if getting that data into memory may not be as efficient for obese data types.
[edit]
#Bohemian makes a fine point about the CPU efficiency of integer vs floating point comparisons but it is amazingly rare for the CPU to be spiked on a database server. You are far more likely to be constrained by the disk IO subsystem and memory which is why my answer focuses on the speed difference between getting that data into the engine for it to perform the sort operation vs the CPU cost of comparison.
(Edited) Since both int and float occupy exactly the same space on disk, and of course in memory - ie 32 bits - the only differences are in the way they are processed.
int should be faster to sort than float, because the comparison is simpler: Processors can compare ints in one machine cycle, but a float's bits have to be "interpreted" to get a value before comparing (not sure how many cycles, but probably more than one, although some CPUs may have special support for float comparison).
In general, the choice of datatypes should be driven by whether the datatype is appropriate for storing the values that are required to be stored. If a given datatype is inadequate, it doesn't matter how efficient it is.
In terms of disk i/o the speed difference is second order. Don't worry about second order effects until your design is good with regard to first order effects.
Correct index design will result in a huge decrease in delays when a query can be retrieved in sorted order to begin with. However, speeding up that query is done at the cost of slowing down other processes, like processes that modify the indexed data. The trade off has to be considered to see whether it's worth it.
In short, worry about the stuff that's going to double your disk i/o or worse before you worry about the stuff that's going to add 10% to your disk i/o

efficiency /algorithms vs system specifications

We are all talking about the efficiency of the algorithms and it depends on input size -basically.
How about the system specifications of current computer that runs the algorithm? does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
I am sure that there must be a performance difference. But, I want to know what is the real relationship between algorithms and system specifications...
An increase in hardware performance will give you a constant C times the running time of your algorithm. Meaning if you have computer A which is overall 2 times slower than computer B. Than your algorithm will be twice as fast on computer B. Twice as fast though really makes hardly no difference when you consider big input values to an algorithm though.
In big O notation that is to say you will have something like O(n) compared to CO(n) = O(cn) = O(n). The complexity of the algorithm and general running time for large values will be about the same on both Computer A and Computer B.
If you analyze an algorithm's running time using something like big O notation, then you will have a much better idea about how the algorithm really works. Computer performance won't give you any kind of advantage when you are comparing an algorithm that is O(logn) compared to O(n^2).
Take a look at some of the data values for n:
I will assume 1 second per operation for the slow computer, and 2 operations for second for the fast computer. I will compare the better algorithm with the slow computer with the worse algorithm with the fast computer.
for n = 10:
Algorithm 1: O(logn): 4 operations
Slow computer: 4 seconds
Algorithm 2: O(n^2): 100 operations
Fast computer: 50 seconds
for n = 100:
Algorithm 1: O(logn): 7 operations
Slow computer: 7 seconds
Algorithm 2: O(n^2): 10,000
operations Fast computer: 1.4 hours
Large difference
for n = 1,000:
Algorithm 1: O(logn): 10 operations
Slow computer: 10 seconds
Algorithm 2: O(n^2): 1,000,000
operations Fast computer: 5.8 days
Huge difference
As n increases, the difference gets bigger and bigger.
Now if you tried to run each of these algorithms on a faster/slower computer for a large input size. It wouldn't matter. Hands down the O(logn) would be faster.
I don't like the answers provided by Brian Bondy and Czimi...
Perhaps this is because I started in a different era, when 32K was considered a lot of memory, and most "personal computers" had 8K bytes, and that now I work in scientific computing where the largest data sets are processed on some of the world's largest systems with thousands of processing nodes and seemingly unbelievable quantities of storage. Therefore I don't overlook certain other elements of the question.
The size of the data set in question makes a fantastic difference. Most all the answers on this question so far ignore this and work for very small numbers N. The other people who have answered have all presumed "it all fits in memory," or something close to that.
For large data sets other factors come into play, and "large" depends on what resources you have to use in solving your problem. Modern systems have the opportunity for off-line storage (e.g. DVDs), networked storage (e.g. nfs), on-line storage (e.g. serial ATA), and two levels of memory storage, system main memory and on-chip cache. How these are leveraged matters and the larger the data set the more they matter. You may or may not need to design access to these into your "algorithm", but if you do, it really matters!
As you increase scale beyond some particular point - the limit of a single CPU and its local memory is about right - these other factors become an increasingly large factor in the overhead of the workload. When I was a Digital, we did some of the first real commercial work on multi-CPU systems and I remember running a benchmark that showed that using a single-CPU as one "unit" of CPU workload capability, a second CPU (in a tightly coupled system) would give you a total of about 1.8. That is, the second CPU added about 0.8. For three, the increase dropped to about 0.6, and four it dropped a lot more, to about 0.2, for a grand total of about 2.6 for a four CPU arrangement, though we had some troubles keeping good numbers with four CPUs due to other effects (the measurement effort became a large fraction of the additional resource). ...The bottom line was that multi-CPUs weren't necessarily all they were cracked up to be - four times the CPU does NOT give you four times the processing power, even though in theory you get four times the flops. ...We repeated the work on the Alpha chip, the first multi-core in history, and the results held up pretty well. Surely there could have been optimizations to improve the fraction each additional CPU gave, and surely there has been a lot of work since then to split computing threads more smartly, but you'll never get it all the way to 100% of each new one, in part because they all slow down some (extra overhead) to coordinate.
Small interjection - we had a saying about this work: "Religate all the Important Stuff to the Compiler!" RISC, get it? This was because the compiler itself had to organize the workload so competing threads didn't step on one another!
Ultimately performing processing of really massive data crunching requires a really smart strategy of moving the data in and out of farther afield data storage through to local memory. And, division of labor within the algorithm is absolutely vital. In work I was doing with Roberto Mechoso at UCLA doing Global Circulation Modeling, they had a data-broker design that is illustrative of the attempts people make to do a great job. Frankly, the result wasn't as good as it could have been, but the design ideas that went into it are worth study. ...Presuming you consider this part of your "algorithm" - and not just the bit twiddling part, then the algorithms management of resources is one of the most vital aspects of reasonable if not optimal resource utilization doing substantial computing.
...I hope this helps answer your inquiry.
One thing not raised so far is that alogorithms are often described in terms of speed, e.g. O(n), O(n log(n)), etc... but they also have characteristics in terms of resource usage, where improved speed, say O(n) versus O (n log(n)), is at the cost of much greater memory usage. In modern computers as resources become exhausted, they are typically replaced with larger slower resources, e.g. swapping memory for disk, where the slower resource is orders of magnitude slower. Thus when we graph the performance of our algorithm against time, and expect a straight line, n log n curve, etc... we often see spikes for large values of n as memory gets exhuasted. In this case, the difference between 1GB and 2GB of RAM can be huge, so in practical terms, the answer to your question is yes, System specification is very important, and selection of algorithms requires knowledge of the system specification and the size of the input data.
For example, I develeop surface modelling and analysis software, and I know that my programs work well on a 32bit XP box for TIN models of 4 million points. The performance difference between 3.5 million and 4 million points is minor. At 4.5 million points the performance degradation is so severe the software is unusable.
Yes, it does depend on system specification. One system might be 10 times faster than another, so it will run bubblesort and quicksort on a set of data 10 times faster than the other.
But when you do analysis of algorithms, you often ignore constant factors like that, which is one thing that big-O notation does. So bubblesort is O(n^2) and quicksort is O(nlogn) (in the average case), and that holds no matter how fast your hardware is.
The interesting thing is when you start comparing apples and oranges. If you're running bubblesort on your fast hardware, you may find it's faster than quicksort on the slow hardware -- but only up to a point. Eventually, with a large enough input set, the quicksort on the slow hardware is going to be faster than bubblesort on the fast hardware.
If you want to start making comparisons like that, you need to do two things together: determine algorithmic complexity including the constant factors, and develop a speed model (e.g. how many iterations of a particular loop it can perform per second) for the actual hardware you're running on. One of the interesting things about Knuth's Art of Computer Programming, compared with other books on algorithms, is that he does both, so that for each algorithm he examines, he calculates how many units of execution time it will take for a given size of input on his (mythical) MIX computer. You could then adjust the calculation for faster or slower hardware -- something that big-O notation doesn't help with.
By your question, do you mean to ask why the efficiency of an algorithm is described only in terms of the input size?
Algorithms are usually described using the Big O Notation. This notation describes the asymptotic behavior of an algorithm; it describes the behavior when the input data is very very large.
So for example, we have two algorithms for sorting.
Algo#1 with O(n)
Algo#2 with O(n^2)
And let's take two PCs:
PC1
PC2 100x faster than PC1
And we have two setups:
PC1 running Algo#1
PC2 running Algo#2
When n is very very large (like billions?) PC1 will still beat PC1 :)
The efficiency of an algorithm doesn't depend on the system specification. The efficiency is described by the Ordo number, which gives you a relation of the processing effort and the size of the input.
Certainly yes. With high CPU the execution time will reduce.
Similarly with higher memory, the time taken to swap data (if applicable) will definitely reduce.
Be aware of the particularities of the language in which you implement your algorithm.
For instance, in java world, As illustrated in this article, a faster computer does not always means a faster runtime:
Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!
Does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
In some cases absolutely! If your data set does not fit into memory you will need to use a disk based sorting algorithm such as merge sort. Quoting from Wikipedia:
When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.
Yes. Remember, we have jumped ~4 orders of magnitude since the early 80s. (1 MHz, 10 MHz, 100 MHz, 1000MHz).
But that's only the difference between n=10 and n=10000, in terms of data set sizes. I can purchase a terabyte hard drive...over 6?7? orders of magnitude than my old 20 megabyte drive.
There's a lot more data floating around out there than there is compute power. So while you might be confused about how useful the big-O is at n=50, n=500 kind of sizes...when n=1,000,00 you want to minimize n as much as you can. Anything supralinear is just rough on your compute power...non-polynomial is even worse. This extends all the way from the top of the system to the bottom. So, efficiency is king as soon as you deal with real-world dataset sizes.
Let me give you an example.
I did a junior level database design. By the end I had maybe 5 tables with maybe 20-40 pre-defined categories in them. Added rows, 10, 20 rows. No big deal. I was a whiz. I did it in PHP, not Perl. I was all that and a bag o' chips.
Move to now, a few years later. I'm doing a hobby project in datamining the stock market. I harvest data off a financial site every day - 6100 stocks, with about 10 columns in each stock. Thirty-thousand+ rows per week. My initial design was "normalized", with static data and dynamic data in different tables. As I played around with my queries, learning about things, if I did a bad join, I'd literally crash my server and make it unavailable. So I denormalized. My next phase is tagging and starting actual mining. I don't plan to start making serious predictions until Christmas-time; roughly 11x30K = 330K rows to mine and analyze. Algorithm efficiency will matter, if I want to get my data processed in a timely fashion. Doesn't matter if my CPU was 10 times as fast...if I use a N^2 algorithm, it'd only get done 2x as fast. :-)
But, I want to know what is the real
relationship between algorithms and
system specifications...
I am confused. I see a lot of people here writing a lot of things, however, when I read the quote above as a question, here's all I can say about it:
A better system (= faster CPU, more RAM) runs faster, a worse one (= slower CPU, less RAM) runs slower. The same algorithm (no matter how good or bad it is) will most likely run faster on a better system and slower on the worse one.
A faster algorithm runs faster than a slower one. It will run faster on the slower system and it will run faster on the faster system.
So what exactly was your question again? Is your question "Do we really need a fast algorithm if the system is already that fast? Won't a slow one do as well?" Yes, maybe. But in that case I would ask two questions:
Why selecting a slow algorithm just because the system is fast? That way your code will only run at decent speed on a very fast system. If you choose a fast algorithm, your code might even run at decent speed on a much worse system.
Why trying to intentionally achieve worse performance? Even though a bad algorithm might run within five seconds, what you consider fast enough on the fast machine, a good one might run in 100 milliseconds. So why making your program perform a task in 5 seconds it could perform exactly the same one in 100 milliseconds?
Actually it's point number (2) that really bugs me quite often. So often people say "Hey, don't over optimize, it won't really matter. This code is only such a small system in such a big system". Yes, if you just look at this code isolated, that is true. But there is a saying "Many a mickle makes a muckle". Of course you should optimize the most processor intensive parts first. However, if a system consists out of 100 modules and each of them uses only one percent of the CPU time, optimizing one of them to be twice as fast will only get an overall processing time improvement of 0.5%, close to nothing; that's why most people refrain from doing that. But what people overlook is that optimizing all of them to be twice as fast will get a processing time improvement of 50% (or IOW, the app will run twice as fast as a whole).
So unless there is any good reason for not doing it, why not always using the best algorithm known to solve a problem? The best means the one that shows good performance and a good CPU time/memory usage ratio (as it's useless to take the fastest one if this one needs more memory than a normal customer PC can even take).

Resources