DCF77 decoder vs. noisy signal

DCF77 decoder vs. noisy signal - algorithm

I have almost completed my open source DCF77 decoder project. It all started out when I noticed that the standard (Arduino) DCF77 libraries perform very poorly on noisy signals. Especially I was never able to get the time out of the decoders when the antenna was close to the computer or when my washing machine was running.
My first approach was to add a (digital) exponential filter + trigger to the incoming signal.
Although this improved the situation significantly, it was still not really good. Then I started to read some standard books on digital signal processing and especially the original works of Claude Elwood Shannon. My conclusion was that the proper approach would be to not "decode" the signal at all because it is (except for leap seconds) completely known a priori. Instead it would be more appropriate to match the received data to a locally synthesized signal and just determine the proper phase. This in turn would reduce the effective bandwidth by some orders of magnitude and thus reduce the noise significantly.
Phase detection implies the need for fast convolution. The standard approach for efficient convolution is of course the fast Fourier transform. However I am implementing for the Arduino / Atmega 328. Thus I have only 2k RAM. So instead of the straightforward approach with FFT, I started stacking matched phase locked loop filters. I documented the different project stages here:
First try: exponential filter
Start of the better apprach: phase lock to the signal / seconds ticks
Phase lock to the minutes
Decoding minute and hour data
Decoding the whole signal
Adding a local clock to deal with signal loss
Using local synthesized signal for faster lock reacquisition after signal loss
I searched the internet quite extensively and found no similar approach. Still I wonder if there are similar (and maybe better) implementations. Or if there exist research on this kind of signal reconstruction.
What I am not searching for: designing optimized codes for getting close to the Shannon limit. I am also not searching for information on the superimposed PRNG code on DCF77. I also do not need hints on "matched filters" as my current implementation is an approximation of a matched filter. Specific hints on Viterbi Decoders or Trellis approaches are not what I am searching for - unless they address the issue of tight CPU and RAM constraints.
What I am searching for: are there any descriptions / implementations of other non-trivial algorithms for decoding signals like DCF77, with limited CPU and RAM in the presence of significant noise? Maybe in some books or papers from the pre internet era?

Have you considered using a chip matched filter to perform your convolution?
http://en.wikipedia.org/wiki/Matched_filter
They are almost trivially easy to implement, as each chip / bit period can be implemented as a n add subtract delay line ( use a circular buffer )
A simple one for a square wave (will also work, but less optimal with other waveforms) of unknown sequence (but known frequency) can be implemented something like this:
// Filter class
template <int samples_per_bit>
class matchedFilter(
public:
// constructor
matchedFilter() : acc(0) {};
// destructor
~matchedFilter() {};
int filterInput(int next_sample){
int temp;
temp = sample_buffer.insert(nextSample);
temp -= next_sample;
temp -= result_buffer.insert(temp);
return temp;
};
private:
int acc;
CircularBuffer<samples_per_bit> sample_buffer;
CircularBuffer<samples_per_bit> result_buffer;
);
// Circular buffer
template <int length>
class CircularBuffer(
public:
// constructor
CircularBuffer() : element(0) {
buffer.fill(0);
};
// destructor
~CircularBuffer(){};
int insert(int new_element){
int temp;
temp = array[element_pos];
array[element_pos] = new_element;
element_pos += 1;
if (element_pos == length){
element_pos = 0;
};
return temp;
}
private:
std::array<int, length> buffer;
int element_pos;
);
As you can see, resource wise, this is relatively trivial. It there is a specific waveform you're after, you can cascade these together to give a longer correlation.

The reference to matched filters by Ollie B. is not what I was asking for. I already covered this before in my blog.
However by now I received a very good hint by private mail. There exists a paper "Performance Analysis and
Receiver Architectures of DCF77 Radio-Controlled Clocks" by Daniel Engeler. This is the kind of stuff I am searching for.
With further searches starting from the Engeler paper I found the following German patents DE3733966A1 - Anordnung zum Empfang stark gestoerter Signale des Senders dcf-77 and DE4219417C2 - Schmalbandempfänger für Datensignale.

Related

HLS Tool to Make Maths Simpler on FPGAs

I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?

There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.

In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sin²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial

OSX AudioUnit SMP

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA

This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...

Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.

As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.

Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.

Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.

You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

Data to audio and back. Modulation / demodulation with source code

I have a stream of binary data and want to convert it to raw waveform sound data, which I can send to the speakers.
This is what the old-school modems did in order to transfer binary data over the phone line (producing the typical modemish sound). It is called modulation.
Then I need a reverse process - from the raw waveform samples, I want to obtain the exact binary data. This is called demodulation.
Any bitrate will work for a start.
The sound is played using computer speakers and sampled using a microphone.
Bandwidth would be quite low (low quality microphone).
There is some background noise but not much.
I found one particular way to do this - Frequency shift keying. The problem is I can't find any source code.
Can you please point me to an implementation of FSK in any language?
Or offer any alternative encoding binary<->sound with available source code?

The simplest modulation scheme would be amplitude modulation (technically for the digital realm this would be called Amplitude Shift Keying). Take a fixed frequency (let's say 10Khz), your "carrier wave", and use the bits in your binary data to turn it on and off. If your data rate is 10 bits per second you will be toggling the 10KHz signal on and off at that rate. The demodulation would be an (optional) 10KHz filter followed by comparing with a threshold. This is a fairly simple scheme to implement. Generally the higher the signal frequency and your available bandwidth, the faster you can switch that signal on and off.
A very cool/fun application here would be to encode/decode as morse code and see how fast you can go.
FSK, shifting between two frequencies is more efficient in bandwidth and more immune to noise but will make the demodulator more complex as you need to distinguish between the two frequencies.
Advanced modulation scheme such as Phase Shift Keying are good at getting the highest bit rate for a given bandwidth and signal to noise ratio but they are more complicated to implement. Analog phone modems needed to deal with certain bandwidth (e.g. as little as 3Khz) and noise limitations. If you need to get the highest possible bitrate given bandwidth and noise limitations then is the way to go.
For actual code samples of advanced modulation schemes I would investigate application notes from DSP vendors (such as TI and Analog Devices) as those were common applications for DSPs.
Implementing a PI/4 Shift D-QPSK Baseband Modem Using the TMS320C50
QPSK modulation demystified
V.34 Transmitter and Receiver Implementation on the TMS320C50 DSP
Another very simple and not so efficient method is to use DTMF. Those are the tones generated by phone keypads where each symbol is a combination of two frequencies. If you Google you'll find a lot of source code. Depending on your application/requirements this may be a simple solution.
Let's dive in to some simple scheme implementation details, something like the morse code I mentioned earlier. We can use "dot" for 0 and "dash' for 1. An advantage of a morse like scheme is that it also solves the framing problem as you can resynchronize your sampling after every space. For simplicity let's pick the "Carrier Wave" frequency at 11KHz and assume your wave output is 44Khz, 16 bit, mono. We'll also use a square wave which will create harmonics but we don't really care. If 11KHz is beyond your microphone's frequency response then just divide all frequencies by 2 e.g. We'll pick some arbitrary level 10000 and so our "on" waveform looks like this:
{10000, 10000, 0, 0, 10000, 10000, 0, 0, 10000, 0, 0, ...} // 4 samples = 11Khz period
and our "off" waveform is just all zeros. I leave the coding of this part as an excersize to the reader.
And so we have something like:
const int dot_samples = 400; // ~10ms - speed up later
const int space_samples = 400; // ~10ms
const int dash_samples = 800; // ~20ms
void encode( uint8_t* source, int length, int16_t* target ) // assumes enough room in target
{
for(int i=0; i<length; i++)
{
for(int j=0; j<8; j++)
{
if((source[i]>>j) & 1) // If data bit is 1 we'll encode a dot
{
generate_on(&target, dash_samples); // Generate ON wave for n samples and update target ptr
}
else // otherwise a dash
{
generate_on(&target, dot_samples); // Generate ON wave for n samples and update target ptr
}
generate_off(&target, space_samples); // Generate zeros
}
}
}
The decoder is a bit more complicated but here's an outline:
Optionally band-pass filter the sampled signal around 11Khz. This will improve performance in a noisy enviornment. FIR filters are pretty simple and there are a few online design applets that will generate the filter for you.
Threshold the signal. Every value above 1/2 maximum amplitude is 1 every value below is 0. This assumes you have sampled the entire signal. If this is in real time you either pick a fixed threshold or do some sort of automatic gain control where you track the maximum signal level over some time.
Scan for start of dot or dash. You probably want to see at least a certain number of 1's in your dot period to consider the samples a dot. Then keep scanning to see if this is a dash. Don't expect a perfect signal - you'll see a few 0's in the middle of your 1's and a few 1's in the middle of your 0's. If there's little noise then differentiating the "on" periods from the "off" periods should be fairly easy.
Then reverse the above process. If you see dash push a 1 bit to your buffer, if a dot push a zero.

One purpose of modulation/demodulation is to adapt to channel characteristics. For instance, the channel might not be able to pass DC. Another purpose is to overcome a given amount and type of noise in the channel, while still transferring data above some given error rate.
For FSK, you simply want routines than can generate sine waves at two different frequencies on the transmit end, and filter and detect two different frequencies on the receiving end. The length of each segment of sine waves, the separation in frequency, and the amplitude will depend on the data rate and amount of noise you need to overcome.
In the simplest case, zero noise, simply produce N or 2N sine waves within successive fixed time frames. Something like:
x[i] = amplitude * sin( i * 2 * pi * (data[j] ? 1.0 : 2.0) * freq) / sampleRate )
On the receiving end, you can sample the signal at well above twice the max frequency and measure the distance between zero crossings, and see if you find an bunch of short period or long period waveforms. Much fancier methods using digital signal processing filters (IIR, FIR, etc.) and various statistic detectors can be used in the presence of non-zero noise.

Time delay estimation between two audio signals

I have two audio recordings of a same signal by 2 different microphones (for example, in a WAV format), but one of them is recorded with delay, for example, several seconds.
It's easy to identify such a delay visually when viewing these signals in some kind of waveform viewer - i.e. just spotting first visible peak in every signal and ensuring that they're the same shape:
(source: greycat.ru)
But how do I do it programmatically - find out what this delay (t) is? Two digitized signals are slightly different (because microphones are different, were at different positions, due to ADC setups, etc).
I've digged around a bit and found out that this problem is usually called "time-delay estimation" and it has myriads of approaches to it - for example, one of them.
But are there any simple and ready-made solutions, such as command-line utility, library or straight-forward algorithm available?
Conclusion: I've found no simple implementation and done a simple command-line utility myself - available at https://bitbucket.org/GreyCat/calc-sound-delay (GPLv3-licensed). It implements a very simple search-for-maximum algorithm described at Wikipedia.

The technique you're looking for is called cross correlation. It's a very simple, if somewhat compute intensive technique which can be used for solving various problems, including measuring the time difference (aka lag) between two similar signals (the signals do not need to be identical).
If you have a reasonable idea of your lag value (or at least the range of lag values that are expected) then you can reduce the total amount of computation considerably. Ditto if you can put a definite limit on how much accuracy you need.

Having had the same problem and without success to find a tool to sync the start of video/audio recordings automatically,
I decided to make syncstart (github).
It is a command line tool. The basic code behind it is this:
import numpy as np
from scipy import fft
from scipy.io import wavfile
r1,s1 = wavfile.read(in1)
r2,s2 = wavfile.read(in2)
assert r1==r2, "syncstart normalizes using ffmpeg"
fs = r1
ls1 = len(s1)
ls2 = len(s2)
padsize = ls1+ls2+1
padsize = 2**(int(np.log(padsize)/np.log(2))+1)
s1pad = np.zeros(padsize)
s1pad[:ls1] = s1
s2pad = np.zeros(padsize)
s2pad[:ls2] = s2
corr = fft.ifft(fft.fft(s1pad)*np.conj(fft.fft(s2pad)))
ca = np.absolute(corr)
xmax = np.argmax(ca)
if xmax > padsize // 2:
file,offset = in2,(padsize-xmax)/fs
else:
file,offset = in1,xmax/fs

A very straight forward thing todo is just to check if the peaks exceed some threshold, the time between high-peak on line A and high-peak on line B is probably your delay. Just try tinkering a bit with the thresholds and if the graphs are usually as clear as the picture you posted, then you should be fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio