Papers on fast validation of UTF-8 - performance

Are there any papers on state of the art UTF-8 validators/decoders. I've seen implementations "in the wild" that use clever loops that process up to 8 bytes per iteration in common cases (e.g. all 7-bit ASCII input).

I don't know about papers, it' probably a bit too specific and narrow a subject for strictly scientific analysis but rather an engineering problem. You can start by looking at how this is handled different libraries. Some solutions will use language-specific tricks while others are very general. For Java, you can start with the code of UTF8ByteBufferReader, a part of Javolution. I have found this to be much faster than the character set converters built into the language. I believe (but I'm not sure) that the latter use a common piece of code for many encodings and encoding-specific data files. Javolution in contrast has code designed specifically for UTF-8.
There are also some techniques used for specific tasks, for example if you only need to calculate how many bytes a UTF-8 character takes as you parse the text, you can use a table of 256 values which you index by the first byte of the UTF-8 encoded character and this way of skipping over characters or calculating a string's length in characters is much faster than using bit operations and conditionals.
For some situations, e.g. if you can waste some memory and if you now that most characters you encounter will be from the Basic Multilingual Plane, you could try even more aggressive lookup tables, for example first calculate the length in bytes by the method described above and if it's 1 or 2 bytes (maybe 3 makes sense too), look up the decoded char in a table. Remember, however, to benchmark this and any other algorithm you try, as it need not be faster at all (bit operations are quite fast, and with a big lookup table you loose locality of reference plus the offset calculation isn't completely free, either).
Any way, I suggest you start by looking at the Javolution code or another similar library.

Related

What concepts or algorithms exist for parallelizing parsers?

It seems easy to parallelize parsers for large amounts of input data that is already given in a split format, e.g. a large list of individual database entries, or is easy to split by a fast preprocessing step, e.g. parsing the grammatical structure of sentences in large texts.
A bit harder seems to be parallel parsing that already requires quite some effort to locate sub-structures in a given input. Common programming language code looks like a good example. In languages like Haskell, that use layout/indentation for separating individual definitions, you could probably check the number of leading spaces of each line after you've found the start of a new definition, skip all lines until you find another definition and pass each skipped chunk to another thread for full parsing.
When it comes to languages like C, JavaScript etc., that use balanced braces to define scopes, the amount of work for doing the preprocessing would be much higher. You'd need to go through the whole input, thereby counting braces, taking care of text inside string literals and so on. Even worse with languages like XML, where you also need to keep track of tag names in the opening/closing tags.
I found a parallel version of the CYK parsing algortihm that seems to work for all context-free grammars. But I'm curious what other general concepts/algorithms do exist that make it possible to parallelize parsers, including such things as the brace counting described above which would only work for a limited set of languages. This question is not about specific implementations but the ideas such implementations are based on.
I think you will find McKeeman's 1982 paper on Parallel LR Parsing quite interesting, as it appears to be practical and applies to a broad class of grammars.
The basic scheme is standard LR parsing. What is clever is that the (presumably long) input is divided into roughly N equal sized chunks (for N processors), and each chunk is parsed separately. Because the starting point for a chunk may (must!) be in the middle of some of productions, McKeemans individual parsers, unlike classic LR parsers, start with all possible left contexts (requiring that the LR state machine be augmented) to determine which LR items apply to the chunk. (It shouldn't take very many tokens before an individual parser has determined what states really apply, so this isn't very inefficient). Then the results of all the parsers are stitched together.
He sort of ducks the problem of partitioning the input in the middle of a token. (You can imagine an arbitrarily big string literal containing text that looks like code, to fool the parser the starts in the middle). What appears to happen is that parser runs into an error, and abandons its parse; the parser to its left takes up the slack. One can imagine the chunk splitter to use a little bit of smarts to mostly avoid this.
He goes to demonstrate a real parser in which speedups are obtained.
Clever, indeed.

Algorithm to compress a lot of small strings?

I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.
Examples:
Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices
I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.
You can use Huffman coding with a shared Huffman tree among all texts you want to compress.
While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.
When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.
Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.
Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.
To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.
(*)The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.

Text Compression - What algorithm to use

I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.
Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.
Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Multiple short rules pattern matching algorithm

As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains:
Long dictionary: 256
Short but not fixed length rules (from 1 to 3 or 4 bytes depth at most)
Small (150) number of rules (if 3 bytes) or moderate (~1K) if 4
Better performance than current AC-DFA used in Snort or than AC-DFA-Split again used by Snort
Software based (recent COTS systems like E3 of E5)
Ideally would like to employ some SIMD / SSE stuff due to the fact that currently they are 128 bit wide and in near future they will be 256 in opposition to CPU's 64
We started this project by prefiltering Snort AC with algorithm shown on Sigmatch paper but sadly the results have not been that impressive (~12% improvement when compiling with GCC but none with ICC)
Afterwards we tried to exploit new pattern matching capabilities present in SSE 4.2 through IPP libraries but no performance gain at all (guess doing it directly in machine code would be better but for sure more complex)
So back to the original idea. Right now we are working along the lines of Head Body Segmentation AC but are aware unless we replace the proposed AC-DFA for the head side will be very hard to get improved performance, but at least would be able to support much more rules without a significant performance drop
We are aware using bit parallelism ideas use a lot of memory for long patterns but precisely the problem scope has been reduce to 3 or 4 bytes long at most thus making them a feasible alternative
We have found Nedtries in particular but would like to know what do you guys think or if there are better alternatives
Ideally the source code would be in C and under an open source license.
IMHO, our idea was to search for something that moved 1 byte at a time to cope with different sizes but do so very efficiently by taking advantage of most parallelism possible by using SIMD / SSE and also trying to be the less branchy as possible
I don't know if doing this in a bit wise manner or byte wise
Back to a proper keyboard :D
In essence, most algorithms are not correctly exploiting current hardware capabilities nor limitations. They are very cache inneficient, very branchy not to say they dont exploit capabilities now present in COTS CPUs that allow you to have certain level of paralelism (SIMD, SSE, ...)
This is preciselly what we are seeking for, an algorithm (or an implementation of an already existing algorithm) that properly considers all that, with the advantag of not trying to cover all rule lengths, just short ones
For example, I have seen some papers on NFAs claming that this days their performance could be on pair to DFAs with much less memory requirements due to proper cache efficiency, enhanced paralelism, etc
Please take a look at:
http://www.slideshare.net/bouma2
Support of 1 and 2 bytes is similar to what Baxter wrote above. Nevertheless, it would help if you could provide the number of single-byte and double-byte strings you expect to be in the DB, and the kind of traffic you are expecting to process (Internet, corporate etc.) - after all, too many single-byte strings may end up in a match for every byte. The idea of Bouma2 is to allow the incorporation of occurrence statistics into the preprocessing stage, thereby reducing the false-positives rate.
It sounds like you are already using hi-performance pattern matching. Unless you have some clever new algorithm, or can point to some statistical bias in the data or your rules, its going to be hard to speed up the raw algorithms.
You might consider treating pairs of characters as pattern match elements. This will make the branching factor of the state machine huge but you presumably don't care about RAM. This might buy you a factor of two.
When running out of steam algorithmically, people often resort to careful hand coding in assembler including clever use of the SSE instructions. A trick that might be helpful to handle unique sequences whereever found is to do a series of comparisons against the elements and forming a boolean result by anding/oring rather than conditional branching, because branches are expensive. The SSE instructions might be helpful here, although their alignment requirements might force you to replicate them 4 or 8 times.
If the strings you are searching are long, you might distribute subsets of rules to seperate CPUs (threads). Partitioning the rules might be tricky.

How does PDF417 barcode decoding recover from damaged labels?

I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.
the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable
PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.
I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.
I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies

Resources