Accuracy of barcodes - barcode

In books (e.g. "Barcodes for Mobile Devices", ISBN 978-0-521-88839-4), Papers (e.g. "Bar Codes May Have Poorer Error Rates Than Commonly Believed", DOI: 10.1373/clinchem.2010.153288) or websites information about the accuracy or error rates of barcodes are given.
The given numbers vary for e.g. Code39 from 1 error in 1.7 million, over 1 error in 3 million to 1 error in 4.5 million.
Where do these numbers come from and how can one calculate it (e.g. for Code39)?
In the definition of Code39 in ISO/IEC 16388:2007 I also couldn't find usefull information.

The "error rate" these numbers describe is the read error rate, i.e. how often a barcode may be read incorrectly when scanned. In order for barcodes to be useful this needs to be a very low value and so barcode formats that have lower read error rates are potentially better (although there are other factors involved as well).
These numbers are presumably determined by scientific testing. In the website you linked to there is a further link to a study by Ohio University that describes the methodology they used, which is an example of how this can be done:
An automated test apparatus was constructed and used for the test. The apparatus included a robot which loaded carrier sheets onto oscillating stages that were moved under four fixed mounted, “hand held” moving beam, visible laser diode bar code scanners. Scanner output was a series of digital pulses. Decoding of all symbols was performed in a computer using software programs based on standard reference decode algorithms. Each symbol was scanned by
each scanner until 283 decodes were obtained. [...] An error occurred and was recorded whenever the decoded data did not match the encoded data for a given symbol.

Got to barcodefaq site you linked to and click on the Barcode type e.g. UPC and you will get a PDF that explains the methodology used. The article cited explains the error encountered as well as containing links to further informatiion.


Teller transactions archive - print barcode on papers

I am looking into options of auto indexing of daily documents generated by tellers in bank operations. The documents does not have any reference number and its handwritten by customer.
So to auto index these documents and store in EDMS, we have to put the core bank transaction reference number on each. So what options do i have? Print barcode label contains this trans number and attach to paper? or have a machine that i can feed the paper and it can print barcode on it?
Anyone know what is the right HW or SW for this?
Depends on how complex you want to be. Perhaps these documents could be multiple (stapled?) pages. would you want to index each page - and would the documents then form an associated sequence (eg. doc. 00001-01 to -20)
Next caper is to consider the form of the number. It's best to formulate a check-digiting system so that a printed number can be manually entered and the check-digit verifies that the number hasn't been miskeyed.
Now - if these documents may be different sizes for instance, or potentially a wad of paperwork, how would you feed them through a printer?
So I'd suggest that a good choice would be to produce your numbers on a specialist barcode-printer with human-readable line on the same label. Some idiot will want to insist on using cheap thermally-sensitive labels, but these almost inevitably deteriorate with time. I'd choose thermal-transfer labels which are a little more complex - your tellers would need to be able to load label-rolls and also the transfer-ribbon (a little like a typewriter-ribbon, if you remember those) but basically any monkey could do it.
Even then, there are three grades of ribbon - wax, resin and a combination. Problem with wax is that it can become worn - same thing as you get with laser-printing where the pages get stuck together if they are left to their own devices for a while. Another reason you don't use laser printers in this role - apart from the fact that you'd need to produce sheets of labels to attach rather than ones and twos on-demand is that the laser processing will cook the glue on the sheets. Fine for an address label with a lifetime of a few days, but disastrous when you may be storing documents for years. Document goes one way, label another...
Resin is the best but most expensive choice. It has better wearing characteristics.
My choice would be a Zebra TLP2824plus using thermal-transfer paper and resin ribbon. Software is easy - just means you need to go back 20 years in time and forget all about drivers - just send a sring to the printer as if it was a generic text printer. The formatting of the label - well, the manual will show you that...
Other technologies and approaches would probably be more complex than simply producing and attaching barcode labels. For instance, if you were to have an inkjet printer like those that are used to mark (milk/juice) cartons - well, it would have to deal with different sizes of paper, and different weights from near-cardboard to airmail paper. It would also have a substantial footprint since the paper would need to be physically presented to the printer. Then there's all the problems of disassembling and reassembling a stapled wad. And who can control precisely where the printing would occur? What may suit one document may not suit another - it may have inconveniently-placed logos or other artwork in the "standard" position for that-sized paper.
Another issue is colour. There's no restriction on background colour with a label (yellow or fluoro pink for example) - it would be easy to locate when necessary. Contrast that with the-ink's-running-low washed-out ink printing on a grey background. White labels wouldn't stand out all that well on the majority of (white) documents.
BUT a strong alternative technology would be to have reels of labels pre-printed by a commercial printing establishment rather than producing them with a special printer on-demand. Reels are better than sheets - they are easier to use especially for people with short fingernails.

What are some algorithms for symbol-by-symbol handwriting recognition?

I think there are some algorithms that evaluate difference between drawn symbol and expected one, or something like that. Any help will be appreciated :))
You can implement a simple Neural Network to recognize handwritten digits. The simplest type to implement is a feed-forward network trained via backpropagation (it can be trained stochastically or in batch-mode). There are a few improvements that you can make to the backpropagation algorithm that will help your neural network learn faster (momentum, Silva and Almeida's algorithm, simulated annealing).
As far as looking at the difference between a real symbol and an expected image, one algorithm that I've seen used is the k-nearest-neighbor algorithm. Here is a paper that describes using the k-nearest-neighbor algorithm for character recognition (edit: I had the wrong link earlier. The link I've provided requires you to pay for the paper; I'm trying to find a free version of the paper).
If you were using a neural network to recognize your characters, the steps involved would be:
Design your neural network with an appropriate training algorithm. I suggest starting with the simplest (stochastic backpropagation) and then improving the algorithm as desired, while you train your network.
Get a good sample of training data. For my neural network, which recognizes handwritten digits, I used the MNIST database.
Convert the training data into an input vector for your neural network. For the MNIST data, you will need to binarize the images. I used a threshold value of 128. I started with Otsu's method, but that didn't give me the results I wanted.
Create your network. Since the images from MNIST come in an array of 28x28, you have an input vector with 784 components and 1 bias (so 785 inputs), to your neural network. I used one hidden layer with the number of nodes set as per the guidelines outlined here (along with a bias). Your output vector will have 10 components (one for each digit).
Randomly present training data (so randomly ordered digits, with random input image for each digit) to your network and train it until it reaches a desired error-level.
Run test data (MNIST data comes with this as well) against your neural network to verify that it recognizes digits correctly.
You can check out an example here (shameless plug) that tries to recognize handwritten digits. I trained the network using data from MNIST.
Expect to spend some time getting yourself up to speed on neural network concepts, if you decide to go this route. It took me at least 3-4 days of reading and writing code before I actually understood the concept. A good resource is I recommend starting with trying to implement neural networks to simulate the AND, OR, and XOR boolean operations (using a threshold activation function). This should give you an idea of the basic concepts. When it actually comes down to training your network, you can try to train a neural network that recognizes the XOR boolean operator; it's a good place to start for an introduction to learning algorithms.
When it comes to building the neural network, you can use existing frameworks like Encog, but I found it to be far more satisfactory to build the network myself (you learn more that way I think). If you want to look at some source, you can check out a project that I have on github (shameless plug) that has some basic classes in Java that help you build and train simple neural-networks.
Good luck!
I've found a few sources that use k-nearest-neighbors for digit and/or character recognition:
Bangla Basic Character Recognition Using Digital Curvelet Transform (The curvelet coefficients of an
original image as well as its morphologically altered versions are used to train separate k–
nearest neighbor classifiers. The output values of these classifiers are fused using a simple
majority voting scheme to arrive at a final decision.)
The Homepage of Nearest Neighbors and Similarity Search
Fast and Accurate Handwritten Character Recognition using Approximate Nearest Neighbors Search on Large Databases
Nearest Neighbor Retrieval and Classification
For resources on Neural Networks, I found the following links to be useful:
CS-449: Neural Networks
Artificial Neural Networks: A neural network tutorial
An introduction to neural networks
Neural Networks with Java
Introduction to backpropagation Neural Networks
Momentum and Learning Rate Adaptation (this page goes over a few enhancements to the standard backpropagation algorithm that can result in faster learning)
Have you checked Detexify. I think it does pretty much what you want
It is open source, so you could take a look at how it is implemented.
You can get the code from here (if I do not recall wrongly, it is in Haskell)
In particular what you are looking for should be in Sim.hs
I hope it helps
If you have not implemented machine learning algorithms before you should really check out:
It's a free class taught by Andrew Ng, Director of the Stanford Machine Learning Centre. The course is an entirely online-taught course specifically on implementing a wide range of machine learning algorithms. It does not go too much into the theoretical intricacies of the algorithms but rather teaches you how to choose, implement, use the algorithms and how diagnose their performance. - It is unique in that your implementation of the algorithms is checked automatically! It's great for getting started in machine learning at you have instantaneous feedback.
The class also includes at least two exercises on recognising handwritten digits. (Programming Exercise 3: with multinomial classification and Programming Exercise 4: with feed-forward neural networks)
The class has started a while ago but it should still be possible to sign up. If not, a new run should start early next year. If you want to be able to check your implementations you need to sign up for the "Advanced Track".
One way to implement handwriting recognition
The answer to this question depends on a number of factors, including what kind of resource constraints you have (embedded platform) and whether you have a good library of correctly labelled symbols: i.e. different examples of a handwritten letter for which you know what letter they represent.
If you have a decent sized library, implementation of a quick and dirty standard machine learning algorithm is probably the way to go. You can use multinomial classifiers, neural networks or support vector machines.
I believe a support vector machine would be fastest to implement as there are excellent libraries out there who handle the machine learning portion of the code for you, e.g. libSVM. If you are familiar with using machine learning algorihms, this should take you less than 30 minutes to implement.
The basic procedure you would probably want to implement is as follows:
Learning what symbols "look like"
Binarise the images in your library.
Unroll the images into vectors / 1-D arrays.
Pass the "vector representation" of the images in your library and their labels to libSVM to get it to learn how the pixel coverage relates to the represented symbol for the images in the library.
The algorithm gives you back a set of model parameters which describe the recognition algorithm that was learned.
You should repeat 1-4 for each character you want to recognise to get an appropriate set of model parameters.
Note: steps 1-4 you only have to carry out once for your library (but once for each symbol you want to recognise). You can do this on your developer machine and only include the parameters in the code you ship / distribute.
If you want to recognise a symbol:
Each set of model parameters describes an algorithm which tests whether a character represents one specific character - or not. You "recognise" a character by testing all the models with the current symbol and then selecting the model that best fits the symbol you are testing.
This testing is done by again passing the model parameters and the symbol to test in unrolled form to the SVM library which will return the goodness-of-fit for the tested model.

True Random Number Generator using atmospheric noise

I have to build an One Time Pad system and for that, I have to build my own TRNG. I want to know how to make record atmospheric noise and use that to generate random numbers. I've tried so far to record a .wav file and read it in Java, but the values don't seem very...random. Any suggestions? I know about, but I can't really use their generators, I have to build my own, so what I want is some insight into how the folks at have built their numbers generator, with atmospheric noise as a source of 'randomness'.
Non Real-time solution
What you can do is record the audio surrounding the room before in and save a temporary WAV file. If you know how the WAV file works which is based on the RIFF specification. Then strip the WAV header which is 44 bytes in length. Then read the audio bytes and do the proper conversions depending on whether you want to generate WORDS, DWORDS, or BYTES, it is up to you. Then you should have some random values to work with. Then use those random values accordingly.
Real-time solution
Since I do not know whether you want to program this in Java or some other language. In addition, I do not know the intended platform; so I cannot recommend you any realtime audio processing libraries.
For C# you can use NAudio and you can record the audio in realtime and recieve the audio bytes. Then you can convert the audio bytes into either a DWORD, QWORD, WORD, etc. You should be able to have some random values. Remember to stop recording and to release unmanaged resources when generating random numbers has ceased.
Good Resources On The WAV File Specification
Link to the specification (Easy to understand)
The answer is unknown and probably intentionally so. Although hard to be sure, the site seems to be a combination of charity and for-profit work. Each radio source only produces a few Kbps of random data. How he describes it in many links, I don't see evidence of a CSRNG. It doesn't matter. For OTP purposes, if it's not truly random, it's a glorified stream cipher. (I think that's what Bruce and others have always said.)
I find it hard to recall when a good CSRNG was broken. I'd recommend you use something like ISAAC or a properly implemented block/stream cipher. Perfect Paper Passwords does this. Use a Fortuna construction with the internals of Fortuna using the above ciphers/algorithms to produce the majority of the random data. The Fortuna system can regularly have data injected into it by a TRNG. The very best TRNG on a budget is plus locally generated stuff. The best cheap, hardware solution is a VIA Artigo board with VIA Padlock (TRNG + acceleration for SHA-1, SHA256, AES, & RSA) for $300. They have libraries to help you use things, too. (There's even a pseudo-TRNG that uses processor timing under network load.)
Remember, the crypto is usually the strongest link in the chain. System security exists on many levels: processor, firmware, peripheral firmware (esp DMA), kernel mode code, OS, trusted middleware or OS functions, application. Security as a whole includes users, policy, physical security, EMSEC, etc. Anyone worrying way too much about RNG's is usually wasting effort. Just use an accepted solution or something I mentioned above. Then, focus on the rest. Especially, how people and systems interact. Configuration, patching, choice of OS, policies. Most problems happen there.
I recall an article on that I can't seem to find now. I all remember is that they used the lsb of the noise they were measuring. The MSBs will certainly not be random. Then then generated a string of 1s and 0's based on the lsb. Don't do something silly like a simple binary conversion, that won't work. You maybe have to sample the noise in binary, to make the distribution of the lsb have a more uniform sampling.
The trick they used to ensure an even distribution was to not use this string of 1's and 0's as the random numbers. Instead they would parse the string, 2 bits at a time. Every time the bits matched (ie 00 or 11) they added a 1 to their random string. Every times the bits flipped (ie 01 or 10) they added a 0 to their random string.
If you make your own TRNG, make sure you verify it!
It is hardly possible to get real random numbers out of software. Even the static in your wav file is likely to be influenced by periodic EMI generated by your computer and is therefore not purely random.
Can you use special hardware or are you forced to stick to pure software? Why won't pseudo random numbers satisfy your needs? They will do fine on a relatively small number of random samples. Because you want to use the random numbers in an OTP, I guess you won't be using it in a big scale.
Can you provide a little more detail?
The atmospheric noise approach to generating random numbers is complex because the atmosphere is filled with non-random signals, all of which pollute the entropy you seek. There is an easier way.
Chances are good your CPU already contains a true random number generator, assuming you have an Intel Ivy Bridge-based Core/Xeon processor, which became available in April, 2012. (The new Haswell architecture also has this feature).
Intel's random generator exploits the random effects of thermal noise inside an unstable digital circuit. Thermal noise is just random atomic vibrations, which is pretty much the same underlying physical phenomenon that uses when it samples atmospheric noise. The sampled random bits go through a sophisticated conditioning and testing process to eliminate pollution from non-random signals. I highly recommend this excellent article on IEEE Spectrum which describes the process in detail.
Intel added a new x86 instruction called RDRAND that allows programs to directly retrieve these random numbers. Although Java does not yet support direct access to RDRAND, it's possible using JNI. This is the approach I took with the drnglib project. For example:
DigitalRandom random = new DigitalRandom();
The nextInt() method is implemented as a JNI native call that invokes RDRAND. The performance is pretty good considering the quality of randomness. Using eight threads, I've generated ~760 MB/sec of random data.
True random number generators (TRNGs) are usually from natural sources like seismic signals, non-stationary bio-signals, etc. The two issues faced by these generators are:
1) The data points are non-uniformly distributed
2) It takes very long time to generate large sequence of numbers (specially when the requirement is in millions).
However, the most important advantage on their part is their unpredictable nature. To overcome their issues and to retain its advantage, it is better to fuse the output of TRNG to seed a pseudo-random number generator. For this, you may try using the amplitude values of atmospheric noise at random time points and use it to seed a PRNG.
This will help you to get large numbers of uniformly distributed values. As the seed is unpredictable, the output of PRNG too becomes unpredictable.

Algorithm for matching data

I have a project where I am testing a device that is very sensitive to noise (electromagnetic, radio, etc...). The device generates 5-6 bytes per second of binary data (looks like gibberish to an untrained eye) based on a give input (audio).
Depending on noise, sometime the device will miss characters, sometimes it will insert random characters, sometimes multiples of both.
I have written an app that gives the user an ability to see on the fly the errors that it generates (as compared to the master file [e.g. what the device should output in ideal conditions]). My algorithm basically takes each byte in the live data and compares it to the byte in the same position in the known master file. If the bytes don't match, I have a window of 10 characters both ways from the current position, where I'll seek a match nearby. If that matches (plus a validation or two), I visually mark up the location in the UI and register an error.
This approach works reasonably well and actually, given the speed of the incoming data, works real time as well. However, I feel like what I am doing is not optimal and the approach would fall apart if the data would stream at higher rates.
Are there other approaches I could take? Are there known algorithms for this type of thing?
I read many years ago that NASA's data collection outfit (e.g. ones that communicate with crafts in space and on the Moon/Mars) have had a 0.00001% loss of data despite tremendous interference in space.
Any ideas?
I presume of main interest is the signal generated by the device? What is more important? Detecting when an error has occurred or making the signal 'robust' against such errors? I do a lot of signal processing lately and denoising a signal is part of my routine, I'm basically trying to estimate the real signal and remove any contaminants.
I don't know how the signal generated by the device is further used...if it's being recorded to a computer, then you can easily apply some denoising, try wavelet denoising for instance. You will find packages for doing this in several languages of your choice.

How does PDF417 barcode decoding recover from damaged labels?

I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.
the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable
PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.
I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.
I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies
