This is more of an algorithmy question - I am not very mathematical so was looking for an engineery solution... If this is off topic for SO let me know and I will delete the question.
I created a mashup of open source goodness to do Optical Character Recognition on difficult backgrounds: https://github.com/metalaureate/tesseract-docker-ocr
I want to use it to scan labels with a pre-defined ID code, e.g., 2826672. The accuracy is about 70% for digits.
Question: how do I add redundancy programmatically to my code to increase accuracy to 99%, and how do I decode it? I can imagine some really kludgy ways, like doubling and inverting the digits, but I don't know how to do this in a way that honors information theory without my having to translate a lot of math.
How do I add and decode digits to correct for OCR errors?
If you have the freedom of actually printing the labels, then there's no real reason to stick with plain ol' numbers. Use QR codes instead. Both the size (information capacity) and information redundancy is configurable, so you can customize it to fit your specific scenario. Internally, Reed-Solomon error correction is used. They offer There are plenty of libraries for both QR code generation and recognition from a scan.
Further info is available in Wikipedia.
Related
I know there exist many many books about image processing but I need an advise for a particular good one giving practical hints for using the algorithms. I don't need background information about HOW an algorithm works, e.g. HoughTrafo or Canny Filter as I know that already from various books. But I need a good advise on how to use those filters efficiently and in particular on how to set the thresholds etc.
It currently gives me a huge headache on how to chose those values. When I set them to fixed values, they work for one picture and when changing the illumination slightly, the dont work anymore for various reasons. So I wonder on how to dynamically set them from image specific values. I read on SO to e.g. set the canny thresholds to:
low = 0.666*mean(img)
high = 1.333*mean(img)
(http://www.kerrywong.com/2009/05/07/canny-edge-detection-auto-thresholding/)
but somehow I havent had much success with it so far.
I'm interested in good advises for books etc in particular but included the special example on how to determine thresholds for canny to make it a valid SO-question :-)
If you want a practical book, typically the book will be targeted to a specific library.
Here you can find a list of most of the books about the OpenCV library.
I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:
I record myself saying "Good morning"
I let a foreign student record "Good morning"
Compare his recording to mine to see if his pronunciation was good enough.
I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?
A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.
I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.
Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.
Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.
Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.
Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.
Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)
Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.
You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.
Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.
Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.
You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.
It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.
Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.
I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!
The musicg api https://code.google.com/p/musicg/
has a audio fingerprint generator and scorer
along with source code to show how its done.
I think it looks for the most similar point in each track, then scores based on how far it can match.
It might look something like
import com.musicg.wave.Wave
com.musicg.fingerprint.FingerprintSimilarity
com.musicg.fingerprint.FingerprintSimilarityComputer
com.musicg.fingerprint.FingerprintManager
double score =
new FingerprintsSimilarity(
new Wave("voice1.wav").getFingerprint(),
new Wave("voice2.wav").getFingerprint() ).getSimilarity();
Idea:
The way biotechnologists align two protein sequences is as follows: Each sequence is represented as a string on an alphabet as(A/C/G/T - these are different types of proteins, irrelevant for us), where each letter (here, an entry) represents a particular amino acid. The quality of an alignment (its score) is calculated from the similarity of each pair of corresponding entries, and the number and length of the blank entries that need to be inserted to produce that alignment.
Same algorithm (http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm) can be used for pronunciation, from substitution frequencies in a set of alternate pronunciations. Then you can calculate alignment scores to measure the similarity between the two pronunciations in a way that is sensitive to the differences between phonemes. Measures of similarity that can be used here are Levenshtein distance, phoneme error rate, and word error rate.
Algorithms
The minimum number of insertions, deletions and substitutions required for transformation of one sequence into another is the Levenshtein distance. More info at http://php.net/manual/en/function.levenshtein.php
Phoneme error rate (PER) is the Levenshtein distance between a predicted pronunciation and the reference pronunciation, divided by the number of phonemes in the reference pronunciation.
Word error rate (WER) is the proportion of predicted pronunciations with at least one phoneme error to the total number of pronunciations.
Source: Did an Internship on this at UW-Madison
A carefully configured Levenshtein distance should do the trick.
I know this question is out of date but...
To solve a similar problem I used Google Speech Recognized API to check WHAT was said and visual compare scaled wave forms of volume changes to detect differences in rhythm.
Code & video of the result.
you can use Musicg https://code.google.com/p/musicg/ as roy zhang suggested. In android, just include musicg jar file in your android project and use it. A tested example:
import com.musicg.wave.Wave;
import com.musicg.fingerprint.FingerprintSimilarity;
//somewhere in your code add
String file1 = Environment.getExternalStorageDirectory().getAbsolutePath();
file1 += "/test.wav";
String file2 = Environment.getExternalStorageDirectory().getAbsolutePath();
file2 += "/test.wav";
Wave w1 = new Wave(file1);
Wave w2 = new Wave(file2);
FingerprintSimilarity fps = w1.getFingerprintSimilarity(w2);
float score = fps.getScore();
float sim = fps.getSimilarity();
Log.d("score", score+"");
Log.d("similarities", sim+"");
Good luck
If this is only to check the pronunciation [of course with different accent], you can do this :
Step 1 : Using some voice tool [say dragon dictation], you can have the text with you.
Step 2 : Compare the string or the word formed and compare it with the string that actually was meant to be pronounced.
Step 3 : If you find any discrepancy in the strings, means the word was not spelled correctly. And you can suggest the correct pronunciation.
You have to look into speech recognition algorithms. I understand that you don't need to translate speech to text (that is done by speech recognition algorithms), however, in your case many algorithms would be the same.
Probably, HMM would be helpful here (hidden markov models).
Also look into here: http://htk.eng.cam.ac.uk/
For my problem it would be best to find a numeric representation of kazakh national ornaments for generating new ones. But other approaches are also fine.
The ornaments essentially consist of combinations of relatively basic ornaments. Usually the ornaments are symmetrical.
Here are few examples of basic elements:
(The images are a bit distorted)
And this is an example of a more complex ornament:
How could I encode an ornament's representation in as few numbers as possible? So that I could write a program that would generate an ornament, given some sequence of numbers
Any ideas are appreciated.
As I write this, I have thought that generating images of snowflakes may be somewhat relevant, although it's possibly just a fractal.
You have to realize that your question is actually not how to represent them but how to generate them.
Still you might get some ideas. But don't hold your breath, because it can get complicated
EDIT:
In researching problems such as this you could start with L-systems, this paper seems to convey the idea.
Actually here's an attempt at an answer:
Represent it as a set of grammar rules.
I found this dissertation (read it as book) on image texture generation. Image Texture Tools
Update
I asked this question quite a while ago now, and I was curious if anything like this has been developed since I asked the question?
I don't even know if there is a term for this kind of algorithm, and I guess there won't be if nobody has invented it yet. However it also makes googling for this a bit hard. Does anybody know if there is a term for this algorithm/principle yet?
This is an idea I have been thinking about, but I do not quite know how to solve it. I would like to know if any solutions like this exists out there, or if you guys have any idea how this could be implemented.
Steganography
Steganography is basically the art of hiding messages. In modern days we do this digitally by e.g. modifying the least significant bits in a image as the one below. Thus for every pixel and for every colour component of that pixel we might be able to hide a byte or two.
This alternation is not visibly by the naked eye, but analysing the least significant bits might reveal patterns that exposes the existence and possibly content of a hidden message. To counter this we simply encrypt the message before embedding it in the image, which keeps the message safe and also helps preventing discovery of the existence of a hidden message.
Thus, in principle, steganography provides the following:
Hiding encrypted message in any kind of media data. (Images, music, video, etc.)
Complete deniability of the existence of a hidden message without the correct key.
Extraction of the hidden message with the correct key.
(source: cs.vu.nl)
Semacodes
Semacodes are a way of encoding data in a visually representation, that may be printed, copied, and scanned easily. The Data Matrix shown below is a example of a semacode containing the famous Lorem Ipsum text. This is essentially a 2D barcode with a higher capacity that usually barcodes. Programs for generating semacodes are readily available, and ditto for software for reading them, especially for cell phones. Semacodes usually contains error correcting codes, are generally very robust, and can be read in very damaged conditions.
Thus semacodes has the following properties:
Data encoding that may be printed and copied.
May be scanned and interpreted even in damaged (dirty) conditions, and generally a very robust encoding.
Combining it
So my idea is to create something that combines these two, with all of the combined properties. This means it would have to:
Embed a encrypted message in any media, probably a scanned image.
The message should be extractable even if the image is printed and scanned, and even partly damaged.
The existence of a embedded message should be undetectable without the key used for encryption.
So, first of all I would like to know if any solutions, algorithms or research is available on this? Secondly I would like to hear any ideas/thoughts on how this might be done?
I really hope to get a good discussion going on the possibilities and feasibility of implementing something like this, and I am looking forward to reading your answers.
Update
Thanks for all the good input on this. I will probably work a bit more on this idea when I have more time. I am convinced it must be possible. Think about research in embedding watermarks in music and movies.
I imagine part of the robustness of a semacode to damage/dirt/obscuration is the high contrast between the two states of any "cell". The reader can still make a good guess as to the actual state, even with some distortion.
That sort of contrast is not available in a photographic image, and is the very reason why steganography works - the lsb bit-flipping has almost no visual effect on the image itself, while digital fidelity ensures that a non-visual system can still very accurately read the embedded data.
As the two applications are sort of at opposite ends of the analog/digital spectrum (semacodes are all about being decipherable by analog (visual) processing but are on paper, not digital; steganography is all about the bits in the file and cares nothing for the analog representation, whether light or sound or something else), I imagine a combination of the two will extremely difficult, if not impossible.
Essentially what you're thinking of is being able to steganographically embed something in an image, print the image, make a colour photocopy of it, scan it in, and still be able to extract the embedded data.
I'm afraid I can't help, but if anyone achieves this, I'll be DAMN impressed! :)
It's not a complete answer, but you should look at watermarking. This technique solves your first two goals (embedable in a printed image and readable even from partly damaged scan).
Part of watermarking's reliability to distortion and transcription errors (from going from digital to analog and back) come from redundancy (e.g. repeating the data several times). Those would make the watermark detectable even without a key. However, you might be able to use redundancy techniques that are more subtle, maybe something related to erasure coding or secret sharing.
I know that's not a complete answer, but hopefully those leads will point you in the right direction!
What language/environment are you using? It shouldn't be that hard to write code that opens both the image and semacode as a bitmap (the latter as a monochrome), sets the lowest bit(s) of each byte of each pixel in the color image to the value of the corresponding pixel of the monochrome bitmap.
(optionally expand the semacode bitmap first to the same pixel-dimensions extending with white)
I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.
the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable
PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.
I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.
I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies