As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains:
Long dictionary: 256
Short but not fixed length rules (from 1 to 3 or 4 bytes depth at most)
Small (150) number of rules (if 3 bytes) or moderate (~1K) if 4
Better performance than current AC-DFA used in Snort or than AC-DFA-Split again used by Snort
Software based (recent COTS systems like E3 of E5)
Ideally would like to employ some SIMD / SSE stuff due to the fact that currently they are 128 bit wide and in near future they will be 256 in opposition to CPU's 64
We started this project by prefiltering Snort AC with algorithm shown on Sigmatch paper but sadly the results have not been that impressive (~12% improvement when compiling with GCC but none with ICC)
Afterwards we tried to exploit new pattern matching capabilities present in SSE 4.2 through IPP libraries but no performance gain at all (guess doing it directly in machine code would be better but for sure more complex)
So back to the original idea. Right now we are working along the lines of Head Body Segmentation AC but are aware unless we replace the proposed AC-DFA for the head side will be very hard to get improved performance, but at least would be able to support much more rules without a significant performance drop
We are aware using bit parallelism ideas use a lot of memory for long patterns but precisely the problem scope has been reduce to 3 or 4 bytes long at most thus making them a feasible alternative
We have found Nedtries in particular but would like to know what do you guys think or if there are better alternatives
Ideally the source code would be in C and under an open source license.
IMHO, our idea was to search for something that moved 1 byte at a time to cope with different sizes but do so very efficiently by taking advantage of most parallelism possible by using SIMD / SSE and also trying to be the less branchy as possible
I don't know if doing this in a bit wise manner or byte wise
Back to a proper keyboard :D
In essence, most algorithms are not correctly exploiting current hardware capabilities nor limitations. They are very cache inneficient, very branchy not to say they dont exploit capabilities now present in COTS CPUs that allow you to have certain level of paralelism (SIMD, SSE, ...)
This is preciselly what we are seeking for, an algorithm (or an implementation of an already existing algorithm) that properly considers all that, with the advantag of not trying to cover all rule lengths, just short ones
For example, I have seen some papers on NFAs claming that this days their performance could be on pair to DFAs with much less memory requirements due to proper cache efficiency, enhanced paralelism, etc
Please take a look at:
http://www.slideshare.net/bouma2
Support of 1 and 2 bytes is similar to what Baxter wrote above. Nevertheless, it would help if you could provide the number of single-byte and double-byte strings you expect to be in the DB, and the kind of traffic you are expecting to process (Internet, corporate etc.) - after all, too many single-byte strings may end up in a match for every byte. The idea of Bouma2 is to allow the incorporation of occurrence statistics into the preprocessing stage, thereby reducing the false-positives rate.
It sounds like you are already using hi-performance pattern matching. Unless you have some clever new algorithm, or can point to some statistical bias in the data or your rules, its going to be hard to speed up the raw algorithms.
You might consider treating pairs of characters as pattern match elements. This will make the branching factor of the state machine huge but you presumably don't care about RAM. This might buy you a factor of two.
When running out of steam algorithmically, people often resort to careful hand coding in assembler including clever use of the SSE instructions. A trick that might be helpful to handle unique sequences whereever found is to do a series of comparisons against the elements and forming a boolean result by anding/oring rather than conditional branching, because branches are expensive. The SSE instructions might be helpful here, although their alignment requirements might force you to replicate them 4 or 8 times.
If the strings you are searching are long, you might distribute subsets of rules to seperate CPUs (threads). Partitioning the rules might be tricky.
Related
This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.
We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.
Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?
Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?
Thank you very much!
There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)
An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.
I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.
For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.
Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.
Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I often bump into interview questions involving a sorted/unsorted array and they ask you to find some sort of property of this array. For example finding the number that appears odd number of times in an array, or find the missing number in an unsorted array of size one million. Often the question post additional constraints such as O(n) runtime complexity, or O(1) space complexity.
Both of these problems can be solved pretty efficiently using bit-wise manipulations. Of course these are not all, there's a whole ton of questions like these.
To me bit-wise programming seems to be more like hack or intuition based, because it works in binary not decimals. Being a college student with not much real life programming experience at all, I'm curious if questions of this type are actually popular at all in real work, or are they just brain twisters interviewers use to select the smartest candidate.
If they are indeed useful, in what kind of scenarios are they actually applicable?
Are bit-wise operations common and useful in real-life programming?
The commonality or applicability depends on the problem in hand.
Some real-life projects do benefit from bit-wise operations.
Some examples:
You're setting individual pixels on the screen by directly manipulating the video memory, in which every pixel's color is represented by 1 or 4 bits. So, in every byte you can have packed 8 or 2 pixels and you need to separate them. Basically, your hardware dictates the use of bit-wise operations.
You're dealing with some kind of file format (e.g. GIF) or network protocol that uses individual bits or groups of bits to represent pieces of information. Your data dictates the use of bit-wise operations.
You need to compute some kind of checksum (possibly, parity or CRC) or hash value and some of the most applicable algorithms do this by manipulating with bits.
You're implementing (or using) an arbitrary-precision arithmetic library.
You're implementing FFT and you naturally need to reverse bits in an integer or simulate propagation of carry in the opposite direction when adding. The nature of the algorithm requires some bit-wise operations.
You're short of space and need to use as little memory as possible and you squeeze multiple bit values and groups of bits into entire bytes, words, double words and quad words. You choose to use bit-wise operations to save space.
Branches/jumps on your CPU are costly and you want to improve performance by implementing your code as a series of instructions without any branches and bit-wise instructions can help. The simplest example here would be choosing the minimum (or maximum) integer value out of two. The most natural way of implementing it is with some kind of if statement, which ultimately involves comparison and branching. You choose to use bit-wise operations to improve speed.
Your CPU supports floating point arithmetic but calculating something like square root is a slow operation and you instead simulate it using a few fast and simple integer and floating operations. Same here, you benefit from manipulating with the bit representation of the floating point format.
You're emulating a CPU or an entire computer and you need to manipulate individual bits (or groups of bits) when decoding instructions, when accessing parts of CPU or hardware registers, when simply emulating bit-wise instructions like OR, AND, XOR, NOT, etc. Your problem flat out requires bit-wise instructions.
You're explaining bit-wise algorithms or tricks or something that needs bit-wise operations to someone else on the web (e.g. here) or in a book. :)
I've personally done all of the above and more in the past 20 years. YMMV, though.
From my experience, it is very useful when you are aiming for speed and efficiency for large datasets.
I use bit vectors a lot in order to represent very large sets, which makes the storage very efficient and operations such as comparisons and combinations very fast. I have also found that bit matrices are very useful for the same reasons, for example finding intersections of a large number of large binary matrices. Using binary masks to specify subsets is also very useful, for example Matlab and Python's Numpy/Scipy use binary masks (essentially binary matrices) to select subsets of elements from matrices.
Using Bitwise Operations is strictly Dependent on your main concerns.
I was once asked to solve a problem to find the all combinations of numbers which don't
have a repeating digit within them , which are of form N*i, for a given i.
I suddenly made use of bitwise operations and generated all the numbers exactly with better
time , But to my surprise I was asked to rewrite and code with the no use of the Bitwise
Operators , as people find no readability with that , the code which many people has to use
in further . So, If performance is your concern go for Bitwise .
If readability is your concern reduce their use.
If you want both at time , you need to follow a good style of writing code with bitwise
operators in a way it was readable or understandable .
Although you can often "avoid it" in user-level code if you really don't care for it, it can be useful for cases where memory consumption is a big issue. Bit operations are often times needed, or even required when dealing with hardware devices or embedded programming in general.
It's common to have I/O registers with many different configuration options addressable through various flag-style bit combinations. Or for small embedded devices where memory is extremely constrained relative to modern PC RAM sizes you may be used to in your normal work.
It's also very handy for some optimizations in hot code, where you want to use a branch-free implementation of something that could be expressed with conditional code, but need quicker run-time performance. For example, finding the nearest power of 2 to a given integer can be implemented quite efficiently on some processors using bit hacks over more common solutions.
There is a great book called "Hacker's Delight" Henry S. Warren Jr. that is filled with very useful functions for a wide variety of problems that occur in "real world" code. There are also a number of online documents with similar things.
A famous document from the MIT AI lab in the 1970s known as HAKMEM is another example.
Are there any papers on state of the art UTF-8 validators/decoders. I've seen implementations "in the wild" that use clever loops that process up to 8 bytes per iteration in common cases (e.g. all 7-bit ASCII input).
I don't know about papers, it' probably a bit too specific and narrow a subject for strictly scientific analysis but rather an engineering problem. You can start by looking at how this is handled different libraries. Some solutions will use language-specific tricks while others are very general. For Java, you can start with the code of UTF8ByteBufferReader, a part of Javolution. I have found this to be much faster than the character set converters built into the language. I believe (but I'm not sure) that the latter use a common piece of code for many encodings and encoding-specific data files. Javolution in contrast has code designed specifically for UTF-8.
There are also some techniques used for specific tasks, for example if you only need to calculate how many bytes a UTF-8 character takes as you parse the text, you can use a table of 256 values which you index by the first byte of the UTF-8 encoded character and this way of skipping over characters or calculating a string's length in characters is much faster than using bit operations and conditionals.
For some situations, e.g. if you can waste some memory and if you now that most characters you encounter will be from the Basic Multilingual Plane, you could try even more aggressive lookup tables, for example first calculate the length in bytes by the method described above and if it's 1 or 2 bytes (maybe 3 makes sense too), look up the decoded char in a table. Remember, however, to benchmark this and any other algorithm you try, as it need not be faster at all (bit operations are quite fast, and with a big lookup table you loose locality of reference plus the offset calculation isn't completely free, either).
Any way, I suggest you start by looking at the Javolution code or another similar library.
When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.
In the past I had to develop a program which acted as a rule evaluator. You had an antecedent and some consecuents (actions) so if the antecedent evaled to true the actions where performed.
At that time I used a modified version of the RETE algorithm (there are three versions of RETE only the first being public) for the antecedent pattern matching. We're talking about a big system here with million of operations per rule and some operators "repeated" in several rules.
It's possible I'll have to implement it all over again in other language and, even though I'm experienced in RETE, does anyone know of other pattern matching algorithms? Any suggestions or should I keep using RETE?
The TREAT algorithm is similar to RETE, but doesn't record partial matches. As a result, it may use less memory than RETE in certain situations. Also, if you modify a significant number of the known facts, then TREAT can be much faster because you don't have to spend time on retractions.
There's also RETE* which balances between RETE and TREAT by saving some join node state depending on how much memory you want to use. So you still save some assertion time, but also get memory and retraction time savings depending on how you tune your system.
You may also want to check out LEAPS, which uses a lazy evaluation scheme and incorporates elements of both RETE and TREAT.
I only have personal experience with RETE, but it seems like RETE* or LEAPS are the better, more flexible choices.