Algorithm for matching data - algorithm

I have a project where I am testing a device that is very sensitive to noise (electromagnetic, radio, etc...). The device generates 5-6 bytes per second of binary data (looks like gibberish to an untrained eye) based on a give input (audio).
Depending on noise, sometime the device will miss characters, sometimes it will insert random characters, sometimes multiples of both.
I have written an app that gives the user an ability to see on the fly the errors that it generates (as compared to the master file [e.g. what the device should output in ideal conditions]). My algorithm basically takes each byte in the live data and compares it to the byte in the same position in the known master file. If the bytes don't match, I have a window of 10 characters both ways from the current position, where I'll seek a match nearby. If that matches (plus a validation or two), I visually mark up the location in the UI and register an error.
This approach works reasonably well and actually, given the speed of the incoming data, works real time as well. However, I feel like what I am doing is not optimal and the approach would fall apart if the data would stream at higher rates.
Are there other approaches I could take? Are there known algorithms for this type of thing?
I read many years ago that NASA's data collection outfit (e.g. ones that communicate with crafts in space and on the Moon/Mars) have had a 0.00001% loss of data despite tremendous interference in space.
Any ideas?

I presume of main interest is the signal generated by the device? What is more important? Detecting when an error has occurred or making the signal 'robust' against such errors? I do a lot of signal processing lately and denoising a signal is part of my routine, I'm basically trying to estimate the real signal and remove any contaminants.
I don't know how the signal generated by the device is further used...if it's being recorded to a computer, then you can easily apply some denoising, try wavelet denoising for instance. You will find packages for doing this in several languages of your choice.


How to scale an algorithm/service/system with multiple machines?

I had some interviews recently and it's quite normal to be asked some scale problems.
For example, you have a long list of words(dict) and list of characters as the inputs, design an algorithm to find out a shortest word which in dict contains all the chars in the char list. Then the interviewer asked how to scale your algorithm into multiple machines.
Another example is you have been designed a traffic light control system for an intersection in a city. How do you scale this control system to the whole city which has many intersections.
I always have no idea about this kind of "scale" problems, welcome any suggestions and comments.
Your first question is completely different from your second question. In fact the control of traffic lights in cities is a local operation. There are boxes nearby that you can tune and optical sensor on top of the light that detects waiting cars. I guess if you need to optimize for some objective function of flow, you can route information to a server process, then it can become how to scale this server process over multiple machines.
I am no expert in design of distributed algorithm, which spans a whole field of research. But the questions in undergrad interviews usually are not that specialized. After all they are not interviewing a graduate student specializing in those fields. Take your first question as an example, it is quite generic indeed.
Normally these questions involve multiple data structures (several lists and hashtables) interacting (joining, iterating, etc) to solve a problem. Once you have worked out a basic solution, scaling is basically copying that solution on many machines and running them with partitions of the input at the same time. (Of course, in many cases this is difficult if not impossible, but interview questions won't be that hard)
That is, you have many identical workers splitting the input workload and work at the same time, but those workers are processes in different machines. That brings the problem of communication protocol and network latency etc, but we will ignore these to get to the basics.
The most common way to scale is let the workers hold copies of smaller data structures and have them split the larger data structures as workload. In your example (first question), the list of characters is small in size, so you would give each worker a copy of the list, and a portion of the dictionary to work on with the list. Notice that the other way around won't work, because each worker holding a dictionary will consume a large amount of memory in total, and it won't save you anything scaling up.
If your problem gets larger, then you may need more layer of splitting, which also implies you need a way of combining the outputs from the workers taking in the split input. This is the general concept and motivation for the MapReduce framework and its derivatives.
Hope it helps...
For the first question, how to search words that contain all the char in the char list that can run on the same time on the different machine. (Not yet the shortest). I will do it with map-reduce as the base.
First, this problem is actually can run on different machine at the same time. This is because for each word in the database, you can check it on another machine (so to check another word, you didn't have to wait for the previous word or the next word, you can literally send each word to different computer to be checked).
Using map-reduce, you can map each word as a value and then check it if it contain every char in the char list.
Map(Word, keyout, valueout){
//Word comes from dbase, keyout & valueout is input for Reduce
if(check if word contain all char){
sharedOutput(Key, Word)//Basically, you send the word to a shared file.
//The output shared file, should be managed by the 'said like' hadoop
After this Map running, you get all the Word that you want from the database locate in shared file. As for the reduce step, you can actually used some simple step to reduce it based on it length. And tada, you get the shortest one.
As for the second question, multi threading come to my mind. It's actually a problem that not relate to each other. I mean each intersection has its own timer right? So to be able handle tons of intersection, you should use multi threading.
The simple term will be using each core in the processor to control each intersection. Rather then go loop through all intersection on by one. You can alocate them in each core so that the process will be faster.

Mac OS X: Audio frequency shift by change of sample rate?

I want to change the frequency of a voice recording by changing sample rate on Mac OS X.
This is a research project aimed at people who stutter. It's essential that the latency is very low – this is, for instance, why I'm not considering Fast Fourier Transforms. Instead, I want to collect samples at a rate of, say, 44kHz, then do one of two things:
1) Play the samples back twice as slowly (i.e. 22kHz). This will result in increasing asynchrony with the source. It would be useful if I can restart the sampling every 1 second or so to prevent the asynchrony from becoming too noticeable.
2) Play the samples back twice as quickly. Obviously, it's impossible to do this continuously (i.e. can't play back samples which haven't been collected yet). To get around this, I'm intending to gate the playback with a square wave. Samples will be played back twice as quickly as they were recorded during the peak of the square wave. Nothing will be heard (but samples will still be collected) during the trough of the square wave.
I've prepared a PDF which describes the project in more detail here:
A friend has helped me with some of the programming for this using PortAudio. Unfortunately, we're getting very long latencies. I think this might be because PortAudio is working at too high a level. From the code, it looks to me as if PortAudio is buffering the incoming audio stream and then making alterations which are prima facie similar to the ones I've described above, but which are in fact operations on the buffered stream.
This isn't what I want at all. It's essential that the processing unit does as little as possible. Referring to the conditions (1) and (2) above, all the computer should do is to (1) play back the samples without any manipulation but twice as slowly; or (2) store the incoming samples then play them back twice as quickly. There should be no other processing whatsoever. I think this is the only way I'll get the very low latencies I'm looking for.
I wondered if it would be better to try doing this directly in Core Audio for OS X, rather than using PortAudio? This would limit platform compatibility. But the low latency is much more important than compatibility.
Am I likely to be able to do what I want using a mid-level service, such as Audio Units? Or would I need to write directly for a low-level service such as I/O Kit? How would I go about it?
It looks like the best thing for you would be to use something like Max/MSP or Pure Data. This will allow you to avoid working with text-based languages and should be good for you rapidly develop what you're looking to do. I/O kit is a bit too low-level for what you're trying to do.
Since max is not a text based language, sharing the code itself is a bit tricky on sites like stack overflow. I've included a screengrab. You can copy and paste max code, but it's a bit ugly and innappropiate for this.
here's a quick description. The box that says rect~ 1 is generating a square wave at Hz. The snapshot~ box is capturing the values this spits out. The if boxes check when it's greater than zero or less than zeros (peaks and troughs). If it gets a trough, the record~ box records the signal from the microphone box and stores it in a buffer. the groove~ box is a sampler that plays back the audio in this buffer, when it recives a bang from the if box, it plays back the audio. The sig~ box is being used to control the playback rate.
Also, you may not know this but the .PDF you're trying to share is unavailable.
One other thing, if latency is important, you should learn about something called a click train. This is basically where you send a signal with a single 1 at the start and time how long it takes for that value to get through your system.

Neural network and algorithm(s), predicting future outcome from past

I was working on a algorithm, where I am given some input and I am given output for them, and given the output for 3 months (give or take) I need a way to find/calculate what might be the future output.
Now, this problem given can be related to stock exchange, we are given certaing constraints and certain outcomes, and we need to find the next.
I stumbled upon neural network stock market prediction, you can Google it, or you can read about it here, here and here.
To get started at making the algorithm, I couldn't figure out what should be the structure of layers.
The given constraint are:
The output would always be integer.
The output would always be between 1 and 100.
There is no exact input for say, just like stock market, we just know that the stock price would fluctuate btw 1 and 100, so we might (or not?) consider this as the only input.
We have record for last 3 months (or more).
Now, my first question is, how many nodes do I take for input?
The output is just one, fine. But as I said, should I take 100 nodes for input layer (given that the stock price would always be integer and would always be btw 1 and 100?)
What about hidden layer? How many nodes there? Say, if I take 100 nodes there too, I don't think that would train the network much, because what I think is that for each input we need to take into account all previous input also.
Say, we are calulating output for 1st day of 4th month, we should have 90 nodes in hidden/middle layer (imagining each month is 30 days for simplicity). Now there are two cases
Our prediction was correct and outcome was same as we predicted.
Our prediction failed, and the outcome was different than what we predicted.
Whatever the case be, now when we are calculating the output for 2nd day of 4th month, we need not only those 90 input(s) but also that last result (and not the prediction, be it the same!) too, so we now have 91 nodes in our middle/hidden layer.
And so on, it would keep increasing the number of nodes each day, AFAICT.
So, my other question is how do I define/set the number of nodes in hidden/middle layer if its dynamically changing.
My last question is, is there any other particular algorithm out there (for this kinda thing/stuff) that I am not aware of? That I should be using instead of messing around with this neural networking stuff?
Lastly, is there anything, that I might be missing that might cause me (rather the algo I am making) to predict the output, I mean any caveats, or anything that might make it go wrong that I might be missing?
There is much to tell as an answer to your question. In fact, your question addresses the problem of time series forecasting in general, and neural networks application for this task. I'm writing here only several most important keys, but after reading this you should possibly dig into Google's results for the query time series prediction neural network. There is a lot of works where the principles are covered in details. A variety of software implementations (with source codes) do also exist (here is just one of examples with codes in c++).
1) I must say that the problem is 99% about data preprocessing and choosing correct input/output factors, and only 1% about concrete instrument to use, whether neural networks or something other. Just as a side note, neural networks can internally implement most of other data analysis methods. For example, you can use neural network for Principal Component Analysis (PCA) which is closely related to SVD, mentioned in another answer.
2) It's very rare that input/output values are strictly fit a specific region. Real life data can be considered as unbounded in absolute values (even if its changes seem producing a channel, it can be broken down just in a moment), but neural network can operate in a stable conditions only. This is why the data is normally converted into increments first (by calculating deltas between i-th point and i-1, or taking log from their ratio). I suggest you do it with your data anyway, though you declare it's inside [0, 100] region. If you don't do it, neural network will most likely degenerate to a so called naive predictor which produce a forecast with each next value equal to previous.
The data then is normalized into [0, 1] or [-1, +1]. The second is appropriate for the case of time series prediction where +1 denotes move up, and -1 - move down. Use hypertanh activation function for neurons in your net.
3) You should feed NN with an input data obtained from a sliding window of dates. For example, if you have a data for a year and every point is a day, you should choose the size of window - say, a month - and slide it day by day, from the past to the future. The day just at the right bound of the window is the target output for NN. This is a very simple approach (there are much more complicated), I mention it just because you ask how to handle data which does continuously arrive. The answer is - you don't need to change/enlarge your NN every day. Just use a constant structure with a fixed window size and "forget" (do not provide to the NN) the oldest point. It's important that you do not treat all the data you have as a single input, but divide it into many small vectors and train NN on them, so the net can generalize data and find regularity.
4) The size of sliding window is your NN input size. The output size is 1. You should play with hidden layer size to find better performance. Start with a value which somethat between input and output, for example sqrt(in*out).
According to lastest researches, Recurrent Neural Networks seem operating better for tasks of time series forecasting.
I agree with Stan when he says
1) I must say that the problem is 99% about data preprocessing
I've applied Neural Networks for 25+ years to various aerospace applications including helicopter flight control - setting up the input/output data set is everything - all else is secondary.
I'm amazed, in smirkman's comment that Neural Networks were quickly dropped "as they produced nothing worthwhile" - that tells me that whoever was working with Neural Networks had little experience with them.
Given that the topic discusses neural network stock market prediction - I'll say that I've made it work. Test results are downloadable from my website at
I don't give away how it was done but there's enough interesting data that should make you want to explore using Neural Networks more seriously.
This kind of problem was particularly well researched by thousands of people who wanted to win the 1M$ NetFlix prize.
Earlier submissions were often based on K Nearest Neigbours. Later submissions were made using Singular Value Decomposition, Support Vector Machines and Stochastic Gradient Descent. The winner used a blend of several techniques.
Reading the excellent Community forums will give you many insights about the best methods to predict the future from the past. You'll also find loads of source code for the different methods.
Amusingly, neural networks were quickly dropped, as they produced nothing worthwhile (and I personally have yet to see a non-trivial NN produce anything of value).
If you are starting out, I'd suggest SVD as a first path; it's quite easy to make and often produces surprising insights into data.
Good luck!

Comparing 2 one dimensional signals

I have the following problem: I have 2 signals over time. They are from the same source so they should be the same. I want to check if they really are.
they may be measured with different sample rates
the start / end time do not correlate. The measurement does not start at the same time and end at the same time.
there may be an time offset between the two signals.
My thoughts go along Fourier transformation, convolution and statistical methods for comparison. Can someone post me some links where I can find more information on how to handle this?
You can easily correct for the phase by just shifting them so their centers of mass line up. (Or alternatively, in the Fourier domain just multiplying by the inverse of the phase of the first coefficient.)
Similarly, if you want to line up the images given only partial data, you can just cross correlate and take the maximal value (which is again easy to do in the Fourier domain).
That leaves the only tricky part of this process as dealing with the sampling rates. Now if you know a-priori what the sample rates are, (and if they are related by a rational number), you can just use sinc interpolation/downsampling to rescale them to a common sampling rate:
If you don't know the sampling rate, you may be a bit screwed. Technically, you can try just brute forcing over all the different rescalings of your signal, but doing this tends to be either slow or else give mediocre results.
As a last suggestion, if you just want to match sounds exactly you can try using the cepstrum and verifying that the peaks of the signal are close enough to within some tolerance. This type of analysis is used a lot in sound and speech recognition, with some refinements to make it operate a bit more locally. It tends to work best with frequency modulated data like speech and music:
Fourier transformation does sound like the right way.
There is too much mathematical information for me to just start explaining here so if you really wanna know what's going on with that (cause I don't think you can just use FT without understanding it) you should use this reference from MIT OpenCourseWare:
Hope it helped.
If you are working with a linux box and the waveforms that need to be processed have already been recorded, you can try to use the file command to display details about the recording. It gives you the sampling rate when it is invoked on a wav file, though I am not sure what format you are recording in.
If the signals are time-shifted with respect to each other, you may try to convolve one with a delta function with increasing delays and then comparing. On MATLAB, conv and all should be good enough.
These are just 'crude' attempts (almost like hacking at the problem). There may be algorithms that are shift-invariant that may do a better job.
Hope that helps.

error correcting codes aimed at slow CPUs transmitting to fast CPUs

I'm looking for a forward error-correcting code that is relatively easy/fast to encode on a microcontroller; decode will be done on a PC so it can be more complicated.
I don't know that much about error-correcting codes and except for the simple Hamming codes they seem to all be more complicated than I can handle.
Any recommendations?
edit: I'm going to cut things short and accept Carl's answer... I guess there were two things I didn't mention:
(1) I don't strictly need the error correction, it's just advantageous for me, and I figured that there might be some error correction algorithm that was a reasonable benefit for a minimal cost. Hamming codes are probably about the right fit and even they seem like they might be too costly for my encoding application.
(2) The greater advantage than the error correction itself is the ability to resync properly to packets that follow an error. (if I get out of sync for a long time, that's bad) So I think it's just better if I keep things simple.
I haven't quite gotten straight how much overhead you can afford. In your comment you say a 16-bit error detection/correction code is about right, but you don't specify how large of a block you're thinking of attaching that to. To be meaningful, you should probably express the allowable overhead as a percentage. 16 bits of error correction for 64 bits of data is a lot different from 16 bits of error correction of a kilobyte of data..
If you can afford something like 15-20% overhead or so, you can probably use a convolutional code with Viterbi decoder. This is highly assymetrical -- the convolutional coder is quite simple (basically a shift register, with output taps leading to XOR's). A really large one might use a 16-bit register with a half dozen or so XOR's.
Fortunately you have a heavier-duty computer to handle the decoding, because a Viterbi decoder can be a fearsome beast. Especially as you use a larger encoder to reduce the overhead, the size of the decoder explodes. The size of the decoder is exponential with respect to the size of the code group.
Turbo codes were mentioned. These can make better use of available bandwidth than convolutional codes with Viterbi decoders -- but they use a considerably more complex encoder -- a minimum of two convolutional coders of a specific type (recursive systematic convolutional encoders). As such, they don't seem to fit your specification as well.
The problem with error correcting codes is that they'll let you recover from single bit or maybe 2 bit errors, but usually not detect or patch up major damage.
Thus, my recommendation would be instead to divide your data streams up into blocks (1 KB, 10 KB, ... 1 MB at most) and calculate a checksum for each block. Then, when the data arrives on the other CPU, you can ascertain whether it is correct and request retransmission of that block if not. So the receiving computer would either acknowledge and wait for the next block, or negative-acknowledge and expect a re-send.
Yes, we're implementing a subset of TCP/IP here. But there's a reason this protocol was so successful: It works!
For a checksum, I'd recommend CRC-32. It requires a table of (I think) 256 32-bit numbers and some reasonably easy computation (array indexing, OR and XOR, mostly) so it's fairly easy for a "dumb" CPU to compute.
I'd suggest using a packet-based form of forward-error correction. If you have to send six equal-length packets, send each of them with enough information to identify it as "packet 1 of 6", "2 of 6", etc. along with one more packet whose first payload byte is the xor of the first payload byte of packets 1-6, whose second payload byte is the xor of the second byte of packet 1-6, etc. Code which receives any six packets out of the seven total will be able to reconstruct the missing one. As a slight enhancement, use one "parity" packet for the "even-numbered" packets and another for the "odd" ones. If you do that, the system will be able to recover from any burst error which is no longer than a packet.
