How does PDF417 barcode decoding recover from damaged labels? - algorithm

I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.

the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable

PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.

I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.

I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies

Related

How to add redundancy into an OCR-scanned code

This is more of an algorithmy question - I am not very mathematical so was looking for an engineery solution... If this is off topic for SO let me know and I will delete the question.
I created a mashup of open source goodness to do Optical Character Recognition on difficult backgrounds: https://github.com/metalaureate/tesseract-docker-ocr
I want to use it to scan labels with a pre-defined ID code, e.g., 2826672. The accuracy is about 70% for digits.
Question: how do I add redundancy programmatically to my code to increase accuracy to 99%, and how do I decode it? I can imagine some really kludgy ways, like doubling and inverting the digits, but I don't know how to do this in a way that honors information theory without my having to translate a lot of math.
How do I add and decode digits to correct for OCR errors?
If you have the freedom of actually printing the labels, then there's no real reason to stick with plain ol' numbers. Use QR codes instead. Both the size (information capacity) and information redundancy is configurable, so you can customize it to fit your specific scenario. Internally, Reed-Solomon error correction is used. They offer There are plenty of libraries for both QR code generation and recognition from a scan.
Further info is available in Wikipedia.

Text Compression - What algorithm to use

I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.
Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.
Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Reverse "jpeg" compression algorithm?

I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/

self-encoded QR barcode?

I was wondering if it's possible to create a QR in some file format, say png, then encode the png in QR, such that the resulting QR is the same one you started with?
I don't think so. Each QR code needs to encode the original data along with variable amounts of redundancy.
So to encode the original QR code, you need the encode the same amount of information and additional redundancies, which means the result can't be the same since it encodes more information.
There are different sizes of QR codes ranging from 21x21 to 177x177. They can hold anywhere from 152 to around 31,000 data bits. Unfortunately, even using 1 bit per "pixel", the amount of data a code can hold never reaches the number of bits required to store it.
There are sizes, though for which it is not far off. I imagine some simple compression algorithm, or maybe even ignoring common parts like the calibration areas could get to a point where you can store some representation of it in itself. It is feasible to me that you could find a way to store a qr code of some size as a qr code of the same size.
The problem then is constructing a code which creates itself. With different error correction options, there is room to fudge a few pixels around, which helps the probability that such a thing is possible, but it would still take a fair bit of magic. Perhaps some sort of genetic algorithm could do better than brute force, but you may need to read the full spec and build one cleverly by hand. The search space is pretty big.
As freespace mentioned, it's not possible to encode an image in that same image itself, for several reasons.
I have created a QR Code which contains an URL which contains (again) the original image:
http://qr.ai/qqq
I really think that's the closest you can get.
A QR code can contain max. 4296 characters. I assume this is unicode, and that two bytes are used to represent one character. This means that a QR code can contain a maximum of 7089 bytes, which is enough to store a small image (like a small qr code).
The only issue here, is that most QR readers expect qr-codes to contain text (not image data).

Combining semacodes and steganography?

Update
I asked this question quite a while ago now, and I was curious if anything like this has been developed since I asked the question?
I don't even know if there is a term for this kind of algorithm, and I guess there won't be if nobody has invented it yet. However it also makes googling for this a bit hard. Does anybody know if there is a term for this algorithm/principle yet?
This is an idea I have been thinking about, but I do not quite know how to solve it. I would like to know if any solutions like this exists out there, or if you guys have any idea how this could be implemented.
Steganography
Steganography is basically the art of hiding messages. In modern days we do this digitally by e.g. modifying the least significant bits in a image as the one below. Thus for every pixel and for every colour component of that pixel we might be able to hide a byte or two.
This alternation is not visibly by the naked eye, but analysing the least significant bits might reveal patterns that exposes the existence and possibly content of a hidden message. To counter this we simply encrypt the message before embedding it in the image, which keeps the message safe and also helps preventing discovery of the existence of a hidden message.
Thus, in principle, steganography provides the following:
Hiding encrypted message in any kind of media data. (Images, music, video, etc.)
Complete deniability of the existence of a hidden message without the correct key.
Extraction of the hidden message with the correct key.
(source: cs.vu.nl)
Semacodes
Semacodes are a way of encoding data in a visually representation, that may be printed, copied, and scanned easily. The Data Matrix shown below is a example of a semacode containing the famous Lorem Ipsum text. This is essentially a 2D barcode with a higher capacity that usually barcodes. Programs for generating semacodes are readily available, and ditto for software for reading them, especially for cell phones. Semacodes usually contains error correcting codes, are generally very robust, and can be read in very damaged conditions.
Thus semacodes has the following properties:
Data encoding that may be printed and copied.
May be scanned and interpreted even in damaged (dirty) conditions, and generally a very robust encoding.
Combining it
So my idea is to create something that combines these two, with all of the combined properties. This means it would have to:
Embed a encrypted message in any media, probably a scanned image.
The message should be extractable even if the image is printed and scanned, and even partly damaged.
The existence of a embedded message should be undetectable without the key used for encryption.
So, first of all I would like to know if any solutions, algorithms or research is available on this? Secondly I would like to hear any ideas/thoughts on how this might be done?
I really hope to get a good discussion going on the possibilities and feasibility of implementing something like this, and I am looking forward to reading your answers.
Update
Thanks for all the good input on this. I will probably work a bit more on this idea when I have more time. I am convinced it must be possible. Think about research in embedding watermarks in music and movies.
I imagine part of the robustness of a semacode to damage/dirt/obscuration is the high contrast between the two states of any "cell". The reader can still make a good guess as to the actual state, even with some distortion.
That sort of contrast is not available in a photographic image, and is the very reason why steganography works - the lsb bit-flipping has almost no visual effect on the image itself, while digital fidelity ensures that a non-visual system can still very accurately read the embedded data.
As the two applications are sort of at opposite ends of the analog/digital spectrum (semacodes are all about being decipherable by analog (visual) processing but are on paper, not digital; steganography is all about the bits in the file and cares nothing for the analog representation, whether light or sound or something else), I imagine a combination of the two will extremely difficult, if not impossible.
Essentially what you're thinking of is being able to steganographically embed something in an image, print the image, make a colour photocopy of it, scan it in, and still be able to extract the embedded data.
I'm afraid I can't help, but if anyone achieves this, I'll be DAMN impressed! :)
It's not a complete answer, but you should look at watermarking. This technique solves your first two goals (embedable in a printed image and readable even from partly damaged scan).
Part of watermarking's reliability to distortion and transcription errors (from going from digital to analog and back) come from redundancy (e.g. repeating the data several times). Those would make the watermark detectable even without a key. However, you might be able to use redundancy techniques that are more subtle, maybe something related to erasure coding or secret sharing.
I know that's not a complete answer, but hopefully those leads will point you in the right direction!
What language/environment are you using? It shouldn't be that hard to write code that opens both the image and semacode as a bitmap (the latter as a monochrome), sets the lowest bit(s) of each byte of each pixel in the color image to the value of the corresponding pixel of the monochrome bitmap.
(optionally expand the semacode bitmap first to the same pixel-dimensions extending with white)

Resources