As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I often bump into interview questions involving a sorted/unsorted array and they ask you to find some sort of property of this array. For example finding the number that appears odd number of times in an array, or find the missing number in an unsorted array of size one million. Often the question post additional constraints such as O(n) runtime complexity, or O(1) space complexity.
Both of these problems can be solved pretty efficiently using bit-wise manipulations. Of course these are not all, there's a whole ton of questions like these.
To me bit-wise programming seems to be more like hack or intuition based, because it works in binary not decimals. Being a college student with not much real life programming experience at all, I'm curious if questions of this type are actually popular at all in real work, or are they just brain twisters interviewers use to select the smartest candidate.
If they are indeed useful, in what kind of scenarios are they actually applicable?
Are bit-wise operations common and useful in real-life programming?
The commonality or applicability depends on the problem in hand.
Some real-life projects do benefit from bit-wise operations.
Some examples:
You're setting individual pixels on the screen by directly manipulating the video memory, in which every pixel's color is represented by 1 or 4 bits. So, in every byte you can have packed 8 or 2 pixels and you need to separate them. Basically, your hardware dictates the use of bit-wise operations.
You're dealing with some kind of file format (e.g. GIF) or network protocol that uses individual bits or groups of bits to represent pieces of information. Your data dictates the use of bit-wise operations.
You need to compute some kind of checksum (possibly, parity or CRC) or hash value and some of the most applicable algorithms do this by manipulating with bits.
You're implementing (or using) an arbitrary-precision arithmetic library.
You're implementing FFT and you naturally need to reverse bits in an integer or simulate propagation of carry in the opposite direction when adding. The nature of the algorithm requires some bit-wise operations.
You're short of space and need to use as little memory as possible and you squeeze multiple bit values and groups of bits into entire bytes, words, double words and quad words. You choose to use bit-wise operations to save space.
Branches/jumps on your CPU are costly and you want to improve performance by implementing your code as a series of instructions without any branches and bit-wise instructions can help. The simplest example here would be choosing the minimum (or maximum) integer value out of two. The most natural way of implementing it is with some kind of if statement, which ultimately involves comparison and branching. You choose to use bit-wise operations to improve speed.
Your CPU supports floating point arithmetic but calculating something like square root is a slow operation and you instead simulate it using a few fast and simple integer and floating operations. Same here, you benefit from manipulating with the bit representation of the floating point format.
You're emulating a CPU or an entire computer and you need to manipulate individual bits (or groups of bits) when decoding instructions, when accessing parts of CPU or hardware registers, when simply emulating bit-wise instructions like OR, AND, XOR, NOT, etc. Your problem flat out requires bit-wise instructions.
You're explaining bit-wise algorithms or tricks or something that needs bit-wise operations to someone else on the web (e.g. here) or in a book. :)
I've personally done all of the above and more in the past 20 years. YMMV, though.
From my experience, it is very useful when you are aiming for speed and efficiency for large datasets.
I use bit vectors a lot in order to represent very large sets, which makes the storage very efficient and operations such as comparisons and combinations very fast. I have also found that bit matrices are very useful for the same reasons, for example finding intersections of a large number of large binary matrices. Using binary masks to specify subsets is also very useful, for example Matlab and Python's Numpy/Scipy use binary masks (essentially binary matrices) to select subsets of elements from matrices.
Using Bitwise Operations is strictly Dependent on your main concerns.
I was once asked to solve a problem to find the all combinations of numbers which don't
have a repeating digit within them , which are of form N*i, for a given i.
I suddenly made use of bitwise operations and generated all the numbers exactly with better
time , But to my surprise I was asked to rewrite and code with the no use of the Bitwise
Operators , as people find no readability with that , the code which many people has to use
in further . So, If performance is your concern go for Bitwise .
If readability is your concern reduce their use.
If you want both at time , you need to follow a good style of writing code with bitwise
operators in a way it was readable or understandable .
Although you can often "avoid it" in user-level code if you really don't care for it, it can be useful for cases where memory consumption is a big issue. Bit operations are often times needed, or even required when dealing with hardware devices or embedded programming in general.
It's common to have I/O registers with many different configuration options addressable through various flag-style bit combinations. Or for small embedded devices where memory is extremely constrained relative to modern PC RAM sizes you may be used to in your normal work.
It's also very handy for some optimizations in hot code, where you want to use a branch-free implementation of something that could be expressed with conditional code, but need quicker run-time performance. For example, finding the nearest power of 2 to a given integer can be implemented quite efficiently on some processors using bit hacks over more common solutions.
There is a great book called "Hacker's Delight" Henry S. Warren Jr. that is filled with very useful functions for a wide variety of problems that occur in "real world" code. There are also a number of online documents with similar things.
A famous document from the MIT AI lab in the 1970s known as HAKMEM is another example.
Related
In this paper, two cases have been considered for comparing algorithms - integers and floating points.
I understand the differences regarding these data types in terms of storage, but am not sure why there is a difference among these.
Why is there a difference in performance between the following two cases
Using merge sort on Integers
Using merge sort on Floating Points
I understand that it comes down to speed comparison in both the cases, the question is why these speeds might be different?
The paper states, in section 4, “Conclusion”, “the execution time for merging integers on the CPU is 2.5X faster than the execution time for floating point on the CPU”. This large a difference is surprising on the Intel Nehalem Xeon E5530 used in the measurements. However, the paper does not give information about source code, specific instructions or processor features used in the merge, compiler version, or other tools used. If the processor is used efficiently, there should be only very minor differences in the performance of an integer merge versus a floating-point merge. Thus, it seems likely that the floating-point code used in the test was inefficient and is an indicator of poor tools used rather than any shortcoming of the processor.
Merge sort has an inner loop of quite a bit of instructions. Comparing floats might be a little more expensive but only by 1-2 cycles. You will not notice the difference of that among the much bigger amount of merge code.
Comparing floats is hardware accelerated and fast compared to everything else you are doing in that algorithm.
Also, the comparison likely can overlap other instructions so the difference in wall-clock time might be exactly zero (or not).
As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains:
Long dictionary: 256
Short but not fixed length rules (from 1 to 3 or 4 bytes depth at most)
Small (150) number of rules (if 3 bytes) or moderate (~1K) if 4
Better performance than current AC-DFA used in Snort or than AC-DFA-Split again used by Snort
Software based (recent COTS systems like E3 of E5)
Ideally would like to employ some SIMD / SSE stuff due to the fact that currently they are 128 bit wide and in near future they will be 256 in opposition to CPU's 64
We started this project by prefiltering Snort AC with algorithm shown on Sigmatch paper but sadly the results have not been that impressive (~12% improvement when compiling with GCC but none with ICC)
Afterwards we tried to exploit new pattern matching capabilities present in SSE 4.2 through IPP libraries but no performance gain at all (guess doing it directly in machine code would be better but for sure more complex)
So back to the original idea. Right now we are working along the lines of Head Body Segmentation AC but are aware unless we replace the proposed AC-DFA for the head side will be very hard to get improved performance, but at least would be able to support much more rules without a significant performance drop
We are aware using bit parallelism ideas use a lot of memory for long patterns but precisely the problem scope has been reduce to 3 or 4 bytes long at most thus making them a feasible alternative
We have found Nedtries in particular but would like to know what do you guys think or if there are better alternatives
Ideally the source code would be in C and under an open source license.
IMHO, our idea was to search for something that moved 1 byte at a time to cope with different sizes but do so very efficiently by taking advantage of most parallelism possible by using SIMD / SSE and also trying to be the less branchy as possible
I don't know if doing this in a bit wise manner or byte wise
Back to a proper keyboard :D
In essence, most algorithms are not correctly exploiting current hardware capabilities nor limitations. They are very cache inneficient, very branchy not to say they dont exploit capabilities now present in COTS CPUs that allow you to have certain level of paralelism (SIMD, SSE, ...)
This is preciselly what we are seeking for, an algorithm (or an implementation of an already existing algorithm) that properly considers all that, with the advantag of not trying to cover all rule lengths, just short ones
For example, I have seen some papers on NFAs claming that this days their performance could be on pair to DFAs with much less memory requirements due to proper cache efficiency, enhanced paralelism, etc
Please take a look at:
http://www.slideshare.net/bouma2
Support of 1 and 2 bytes is similar to what Baxter wrote above. Nevertheless, it would help if you could provide the number of single-byte and double-byte strings you expect to be in the DB, and the kind of traffic you are expecting to process (Internet, corporate etc.) - after all, too many single-byte strings may end up in a match for every byte. The idea of Bouma2 is to allow the incorporation of occurrence statistics into the preprocessing stage, thereby reducing the false-positives rate.
It sounds like you are already using hi-performance pattern matching. Unless you have some clever new algorithm, or can point to some statistical bias in the data or your rules, its going to be hard to speed up the raw algorithms.
You might consider treating pairs of characters as pattern match elements. This will make the branching factor of the state machine huge but you presumably don't care about RAM. This might buy you a factor of two.
When running out of steam algorithmically, people often resort to careful hand coding in assembler including clever use of the SSE instructions. A trick that might be helpful to handle unique sequences whereever found is to do a series of comparisons against the elements and forming a boolean result by anding/oring rather than conditional branching, because branches are expensive. The SSE instructions might be helpful here, although their alignment requirements might force you to replicate them 4 or 8 times.
If the strings you are searching are long, you might distribute subsets of rules to seperate CPUs (threads). Partitioning the rules might be tricky.
I'm thinking about different ways to implement arbitrary-precision arithmetic (sometimes called Bignum, Integer or BigInt).
It seems like the common idiom is to use an array for the storage of the actual value and reallocate it as needed if space requirements grow or shrink.
More precisely, it seems that the bit size of the array elements is often the second largest size commonly supported (to make calculations with overflow easier to implement probably?), e. g. language/platform supports 128bit-sized numbers -> array of 64bit numbers + 128bit variable to handle overflow.
Are there fundamentally different ways to implement arbitrary-precision arithmetic or is the above the “tried and true” way to implement it without huge performance losses?
My question is about the underlying data structure, not the algorithms for operations. I know Karatsuba, Toom-Cook et alii.
It is possible to use the Chinese Remainder Theorem to represent large integers in a fundamentally different way from the usual base-2^n system.
I believe a CRT-based representation will still use an array of elements which, like the conventional representation, are based on the most convenient native arithmetic available. However, these elements hold the remainders of the number when divided by a sequence of primes, not base-2^n digits.
As with the conventional representation, the number of elements used determines the maximum size of the representable number. Unfortunately, it is not easy to compute whether one CRT-based number is greater than another, so it is hard to tell if your representation has overflowed the maximum size. Note that addition and multiplication are very fast in CRT representation, which could be an advantage if you can deal with the overflow issue.
However, to answer your question: I believe it is accurate to say that the base-2^n system is indeed the "tried and true" representation, which is used by most popular bignum libraries. I think I recall that there are extant CRT-based bignum libraries, although I have not checked lately to see if they are still around....
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I understand that this makes the algorithms faster and use less storage space, and that these would have been critical features for software to run on the hardware of previous decades, but is this still an important feature? If the calculations were done with exact rational arithmetic then there would be no rounding errors at all, which would simplify many algorithms as you would no longer have to worry about catastrophic cancellation or anything like that.
Floating point is much faster than arbitrary-precision and symbolic packages, and 12-16 significant figures is usually plenty for demanding science/engineering applications where non-integral computations are relevant.
The programming language ABC used rational numbers (x / y where x and y were integers) wherever possible.
Sometimes calculations would become very slow because the numerator and denominator had become very big.
So it turns out that it's a bad idea if you don't put some kind of limit on the numerator and denominator.
In the vast majority of computations, the size of numbers required to to compute answers exactly would quickly grow beyond the point where computation would be worth the effort, and in many calculations it would grow beyond the point where exact calculation would even be possible. Consider that even running something like like a simple third-order IIR filter for a dozen iterations would require a fraction with thousands of bits in the denominator; running the algorithm for a few thousand iterations (hardly an unusual operation) could require more bits in the denominator than there exist atoms in the universe.
Many numerical algorithms still require fixed-precision numbers in order to perform well enough. Such calculations can be implemented in hardware because the numbers fit entirely in registers, whereas arbitrary precision calculations must be implemented in software, and there is a massive performance difference between the two. Ask anybody who crunches numbers for a living whether they'd be ok with things running X amount slower, and they probably will say "no that's completely unworkable."
Also, I think you'll find that having arbitrary precision is impractical and even impossible. For example, the number of decimal places can grow fast enough that you'll want to drop some. And then you're back to square one: rounded number problems!
Finally, sometimes the numbers beyond a certain precision do not matter anyway. For example, generally the nnumber of significant digits should reflect the level of experimental uncertainty.
So, which algorithms do you have in mind?
Traditionally integer arithmetic is easier and cheaper to implement in hardware (uses less space on the die so you can fit more units on there). Especially when you go into the DSP segment this can make a lot of difference.
I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
What structures might be good at each extreme?
Are there any in the middle?
Here are a few constraints or hints:
The bits are set only once, and in index order.
I need 100% accuracy, so something like a Bloom filter isn't good enough.
After the set is built, I need to be able to efficiently iterate over the "set" bits.
The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
I'm trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.
Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.
I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.
Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays
Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
If the sample is highly asymmetric,
I'll simply store the indexes to the
set bits (or maybe the distance to
the next bit) in a list.
If the sample is highly symmetric,
I'll keep using a conventional bit
array.
If the sample is moderately
symmetric, I'll use a lossless
compression method like Huffman
coding suggested by
InSciTekJeff.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.
One more compression thought:
If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.
You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.
Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.
Quick combinatoric proof that you can't really save much space:
Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.
That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.
Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.