What is the difference between a floating point merge and an integer merge? - algorithm

In this paper, two cases have been considered for comparing algorithms - integers and floating points.
I understand the differences regarding these data types in terms of storage, but am not sure why there is a difference among these.
Why is there a difference in performance between the following two cases
Using merge sort on Integers
Using merge sort on Floating Points
I understand that it comes down to speed comparison in both the cases, the question is why these speeds might be different?

The paper states, in section 4, “Conclusion”, “the execution time for merging integers on the CPU is 2.5X faster than the execution time for floating point on the CPU”. This large a difference is surprising on the Intel Nehalem Xeon E5530 used in the measurements. However, the paper does not give information about source code, specific instructions or processor features used in the merge, compiler version, or other tools used. If the processor is used efficiently, there should be only very minor differences in the performance of an integer merge versus a floating-point merge. Thus, it seems likely that the floating-point code used in the test was inefficient and is an indicator of poor tools used rather than any shortcoming of the processor.

Merge sort has an inner loop of quite a bit of instructions. Comparing floats might be a little more expensive but only by 1-2 cycles. You will not notice the difference of that among the much bigger amount of merge code.
Comparing floats is hardware accelerated and fast compared to everything else you are doing in that algorithm.
Also, the comparison likely can overlap other instructions so the difference in wall-clock time might be exactly zero (or not).

Related

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

What all operations does FLOPS include?

FLOPS stands for FLoating-point Operations Per Second and I have some idea what Floating-point is. I want to know what these Operations are? Does +, -, *, / are the only operations or operations like taking logarithm(), exponential() are also FLOs?
Does + and * of two floats take same time? And if they take different time, then what interpretation should I draw from the statement: Performance is 100 FLOPS. How many + and * are there in one second.
I am not a computer science guy, so kindly try to be less technical. Also let me know if I have understood it completely wrong.
Thanks
There is no specific set of operations that are included in FLOPS, it's just measured using the operations that each processor supports as a single instruction. The basic arithmetic operations are generally supported, but operations like logarithms are calculated using a series of simpler operations.
For modern computers all the supported floating point operations generally run in a single clock cycle or less. Even if the complexity differs a bit between operations, it's rather getting the data in and out of the processor that is the bottle neck.
The reason that FLOPS is still a useful measure for computing speed is that CPUs are not specialized on floating point calculations. Adding more floating point units in the CPU would drive up the FLOPS, but there is no big market for CPUs that are only good at that.

GPU and determinism

I was thinking of off-loading some math operations to the GPU. As I'm already using D3D11 I'd use a compute shader to do the work. But the thing is I need the results to be the same for the same input, no matter what GPU the user might have. (only requirement that it supports compute shader 4.0).
So is floating point math deterministic on GPUs?
If not, do GPUs support integer math?
I haven't used DirectCompute, only OpenCL.
GPUs definitely support integer math, both 32-bit and 64-bit integers. A couple of questions already have this discussion:
Integer Calculations on GPU
Performance of integer and bitwise operations on GPU
Basically, on modern GPUs 32-bit float and integer operations are equivalent in performance.
As for deterministic results, it depends on your code. For example, if you rely on multiple threads performing atomic operations on the same memory then reading that memory from other threads and performing operations depending on that value, then results may not be exactly the same every time.
From personal experience, I needed to generate random numbers but also required consistent results. So basically I had a largish array of seeds, one for each thread, and each one was completely independent. Other random number generators which rely on atomic operations and barriers would not have been.
The other half of having deterministic results is having the same result given different hardware. With integer operations you should be fairly safe. With floating point operations in OpenCL, avoiding the fast relaxed math option and the native variants of functions would increase chances of getting the same results on different hardware.

Are bit-wise operations common and useful in real-life programming? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I often bump into interview questions involving a sorted/unsorted array and they ask you to find some sort of property of this array. For example finding the number that appears odd number of times in an array, or find the missing number in an unsorted array of size one million. Often the question post additional constraints such as O(n) runtime complexity, or O(1) space complexity.
Both of these problems can be solved pretty efficiently using bit-wise manipulations. Of course these are not all, there's a whole ton of questions like these.
To me bit-wise programming seems to be more like hack or intuition based, because it works in binary not decimals. Being a college student with not much real life programming experience at all, I'm curious if questions of this type are actually popular at all in real work, or are they just brain twisters interviewers use to select the smartest candidate.
If they are indeed useful, in what kind of scenarios are they actually applicable?
Are bit-wise operations common and useful in real-life programming?
The commonality or applicability depends on the problem in hand.
Some real-life projects do benefit from bit-wise operations.
Some examples:
You're setting individual pixels on the screen by directly manipulating the video memory, in which every pixel's color is represented by 1 or 4 bits. So, in every byte you can have packed 8 or 2 pixels and you need to separate them. Basically, your hardware dictates the use of bit-wise operations.
You're dealing with some kind of file format (e.g. GIF) or network protocol that uses individual bits or groups of bits to represent pieces of information. Your data dictates the use of bit-wise operations.
You need to compute some kind of checksum (possibly, parity or CRC) or hash value and some of the most applicable algorithms do this by manipulating with bits.
You're implementing (or using) an arbitrary-precision arithmetic library.
You're implementing FFT and you naturally need to reverse bits in an integer or simulate propagation of carry in the opposite direction when adding. The nature of the algorithm requires some bit-wise operations.
You're short of space and need to use as little memory as possible and you squeeze multiple bit values and groups of bits into entire bytes, words, double words and quad words. You choose to use bit-wise operations to save space.
Branches/jumps on your CPU are costly and you want to improve performance by implementing your code as a series of instructions without any branches and bit-wise instructions can help. The simplest example here would be choosing the minimum (or maximum) integer value out of two. The most natural way of implementing it is with some kind of if statement, which ultimately involves comparison and branching. You choose to use bit-wise operations to improve speed.
Your CPU supports floating point arithmetic but calculating something like square root is a slow operation and you instead simulate it using a few fast and simple integer and floating operations. Same here, you benefit from manipulating with the bit representation of the floating point format.
You're emulating a CPU or an entire computer and you need to manipulate individual bits (or groups of bits) when decoding instructions, when accessing parts of CPU or hardware registers, when simply emulating bit-wise instructions like OR, AND, XOR, NOT, etc. Your problem flat out requires bit-wise instructions.
You're explaining bit-wise algorithms or tricks or something that needs bit-wise operations to someone else on the web (e.g. here) or in a book. :)
I've personally done all of the above and more in the past 20 years. YMMV, though.
From my experience, it is very useful when you are aiming for speed and efficiency for large datasets.
I use bit vectors a lot in order to represent very large sets, which makes the storage very efficient and operations such as comparisons and combinations very fast. I have also found that bit matrices are very useful for the same reasons, for example finding intersections of a large number of large binary matrices. Using binary masks to specify subsets is also very useful, for example Matlab and Python's Numpy/Scipy use binary masks (essentially binary matrices) to select subsets of elements from matrices.
Using Bitwise Operations is strictly Dependent on your main concerns.
I was once asked to solve a problem to find the all combinations of numbers which don't
have a repeating digit within them , which are of form N*i, for a given i.
I suddenly made use of bitwise operations and generated all the numbers exactly with better
time , But to my surprise I was asked to rewrite and code with the no use of the Bitwise
Operators , as people find no readability with that , the code which many people has to use
in further . So, If performance is your concern go for Bitwise .
If readability is your concern reduce their use.
If you want both at time , you need to follow a good style of writing code with bitwise
operators in a way it was readable or understandable .
Although you can often "avoid it" in user-level code if you really don't care for it, it can be useful for cases where memory consumption is a big issue. Bit operations are often times needed, or even required when dealing with hardware devices or embedded programming in general.
It's common to have I/O registers with many different configuration options addressable through various flag-style bit combinations. Or for small embedded devices where memory is extremely constrained relative to modern PC RAM sizes you may be used to in your normal work.
It's also very handy for some optimizations in hot code, where you want to use a branch-free implementation of something that could be expressed with conditional code, but need quicker run-time performance. For example, finding the nearest power of 2 to a given integer can be implemented quite efficiently on some processors using bit hacks over more common solutions.
There is a great book called "Hacker's Delight" Henry S. Warren Jr. that is filled with very useful functions for a wide variety of problems that occur in "real world" code. There are also a number of online documents with similar things.
A famous document from the MIT AI lab in the 1970s known as HAKMEM is another example.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Resources