Algorithms for finding coincidences in sets of data - algorithm

Background & Motivation
I have three lists of timestamps (UNIX timestamps, plus a 10-digit subsecond part (e.g. 1451606400.9475839201)). They are not necessarily of the same size, however they are ordered (in non-descending order).
Each list corresponds to data from a real-life instrument, of which there are 3. The instruments in question record a timestamp each time they observe an "event". The issue is that the instruments are very sensitive, thus they record timestamps on the order of 10+ hz, and I'm working with a ~1 year of data, where only a tiny portion correspond to actual events.
Unfortunately, it's difficult to put a number on precisely how many timestamps should be real events.
We may assume that the "random" timestamps are uniformly distributed (and, in practice, this seems to be the case). There are, however, gaps in the data (e.g. all three instruments may have gone down during the months of March, April, May). These gaps will be the same for all three lists.
Unfortunately, we cannot assume that the clocks for the three instruments are well synchronized. A drift of a fraction of a microsecond to a few microseconds can be expected. Further, the "events" in question are light, and the instruments in question close together, so we can calculate the maximum difference in observation time, on the order of ~10 - 15 microseconds (assuming the clocks were synchronized, and there is no noise).
Goal:
Using only the timestamps, I want to identify those which are most "likely" to correspond to a "real" event, such that further analysis can be conducted.
What I've Tried:
My first inclination was to, from the three list, produce a list of triplets, one timestamp contributed by each list/instrument, such that max(A,B,C) - min(A,B,C) was minimized. Something like this simple algorithm.
Unfortunately, this found very few, and sometimes no, "coincidences" which fell within a reasonable time window. Further, those few that did didn't correspond to real events, upon further analysis.
Next, I tried the above, but this time minimizing the RSS error, which I defined to be, for some triplet A,B,C, as (A-B)**2 + (A-C)**2 + (B-C)**2. This found not too many more triplets within a reasonable time window, and none corresponded to real events.
Lastly, I tried simply iterating over all elements of the first vector, and finding the closest match in the second and third vectors (by binary search), then repeating for the second and third vectors. This gave me identical results to the RSS minimization code.
Is there a better or a "standard" approach?
I'm worried about nothing outside of effectively finding these "real events". This includes efficiency... if it works well, speed/efficiency is of little concern.

Related

Rapid change detection algorithm

I'm logging temperature values in a room, saving them to the database. I'd like to be alerted when temperature rises suddenly. I can't set fixed values, because 18°C is acceptable in winter and 25°C is acceptable in summer. But if it jumps from 20°C to 25°C during, let's say, 30 minutes and stays like this for 5 minutes (to eliminate false readouts), I'd like to be informed.
My current idea is to take readouts from last 30 minutes (A) and readouts from last 5 minutes (B), calculate median of A and B and check if difference between them is less then my desired threshold.
Is this correct way to solve this or is there a better algorithm? I searched for a specific one but most of them seem overcomplicated.
Thanks!
Detecting changes in a time-series is a well-researched subject, and hundreds if not thousands of papers have been written on this subject. As you've seen many methods are quite advanced, but proved to be quite useful for many use cases. Whatever method you choose, you should evaluate it against real of simulated data, and optimize its parameters for your use case.
As you require, let me suggest a very simple method that in many cases prove to be good enough, and is quite similar to that you considered.
Basically, you have two concerns:
Detecting a monotonous change in a sampled noisy signal
Ignoring false readouts
First, note that medians are not commonly used for detecting trends. For the series (1,2,3,30,35,3,2,1) the medians of 5 consecutive terms is be (3, 3, 3, 3). It is much more common to use averages.
One common trick is to throw the extreme values before averaging (e.g. for each 7 values average only the middle 5). If many false readouts are expected - try to take measurements at a faster rate, and throw more extreme values (e.g. for each 13 values average the middle 9).
Also, you should throw away unfeasible values and replace them with the last measured value (unfeasible means out of range, or non-physical change rate).
Your idea of comparing a short-period measure with a long-period measure is a good idea, and indeed it is commonly used (e.g. in econometrics).
Quoting from "Financial Econometric Models - Some Contributions to the Field [Nicolau, 2007]:
Buy and sell signals are generated by two moving averages of the price
level: a long-period average and a short-period average. A typical
moving average trading rule prescribes a buy (sell) when the
short-period moving average crosses the long-period moving average
from below (above) (i.e. when the original time series is rising
(falling) relatively fast).
When you say "rises suddenly," mathematically you are talking about the magnitude of the derivative of the temperature signal.
There is a nice algorithm to simultaneously smooth a signal and calculate its derivative called the Savitzky–Golay filter. It's explained with examples on Wikipedia, or you can use Matlab to help you generate the convolution coefficients required. Once you have the coefficients the calculation is very simple.

Gene representation for production planning with constraints

I'm trying to improve the throughput of a production system. The exact type of the system isn't relevant (I think).
Description
The system consists of a LINE of stations (numbered 1, 2, 3...) and an ARM.
The system receives an ITEM at random times.
Each ITEM has a PLAN associated with it (for example, ITEM1 may have a PLAN which
says it needs to go through station 3, then 1, then 5). The PLAN includes timing information on
how long the ITEM would be at each station (a range of hard max/min values).
Every STATION can hold one ITEM at a time.
The ARM is used to move each ITEM from one STATION to the next. Each PLAN includes
timing information for the ARM as well, which is a fixed value.
Current Practice
I have two current (working) planning solutions.
The first maintains a master list of usage for each STATION, consider this a 'booking' approach.
As each new ITEM-N enters, the system searches ahead to find the earliest possible slot where
PLAN-N would fit. So for example, it would try to fit it at t=0, then progressively try higher
delays till it found a fit (well actually I have some heuristics here to cut down processing time,
but the approach holds)
The second maintains a list for each ITEM specifying when it is to start. When a new ITEM-N
enters, the system compares its' PLAN-N with all existing lists to find a suitable time to
start. Again, it starts at t=0 then progressively tries higher delays.
Neither of the two solutions take advantage of the range of times an ITEM is allowed at each
station. A fixed time is assumed (midpoint or minimum).
Ideal Solution
It's quite self-evident that there exists situations where an incoming ITEM would be able to
start earlier than otherwise possible if some of the current ITEMs change the duration they
spend in certain STATION, whether by shortening that duration (so the new ITEM could enter
the STATION instead) or lengthening that duration (so the ARM has time to move the
ITEM).
I'm trying to implement a Genetic Algorithm solution to the problem. My current gene contains N
numbers (between 0 and 1) where N is the total number of stations among all item currently in the
system as well as a new item which is to be added in. It's trivial to convert this gene to an
actual duration (0 would be the min duration, 1 would be the max, scale linearly in between).
However, this gene representation consistently produces un-usable plans which overlap with each
other. The reason for this is that when multiple items are already arranged ideally (consecutive in
time, planning wise), no variation on durations is possible. This is unavoidable because once items
are already being processed, they cannot be delayed or brought forward.
An example of the above situation, say ITEMA is in STATION3 for durations t1 to t2 and t3 to
t4. ITEMB then comes along and occupies STATION3 for duration t2 to t3 (so STATION3 is fully
utilized between t1 and t4). With my current gene representation, I'm virtually guaranteed never to
find a valid solution, since that would require certain elements of the gene to have exactly the
correct value so as not to generate an overlap.
Questions
Is there a better gene representation than I describe above?
Would I be better served doing some simple hill-climbing to find modifiable timings? Or, is GA
actually suited to this problem?

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track.
I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?
Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.
That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.
Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.
Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)
Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.
If the chunk matches well enough, see if it matches all the way out to silence.
The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.
The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.
If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
ref: --part1part2--
syn: ---part1---part2----
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in syn and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".
For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.

Looking for ideas for a simple pattern matching algorithm to run on a microcontroller

I'm working on a project to recognize simple audio patterns. I have two data sets, each made up of between 4 and 32 note/duration pairs. One set is predefined, the other is from an incoming data stream. The length of the two strongly correlated data sets is often different, but roughly the same "shape". My goal is to come up with some sort of ranking as to how well the two data sets correlate/match.
I have converted the incoming frequencies to pitch and shifted the incoming data stream's pitch so that it's average pitch matches that of the predefined data set. I also stretch/compress the incoming data set's durations to match the overall duration of the predefined set. Here are two graphical examples of data that should be ranked as strongly correlated:
http://s2.postimage.org/FVeG0-ee3c23ecc094a55b15e538c3a0d83dd5.gif
(Sorry, as a new user I couldn't directly post images)
I'm doing this on a 8-bit microcontroller so resources are minimal. Speed is less an issue, a second or two of processing isn't a deal breaker.
It wouldn't surprise me if there is an obvious solution, I've just been staring at the problem too long. Any ideas?
Thanks in advance...
Couldn't see the graphic, but... Divide the spectrum into bins. You've probably already done this already , but they may be too fine. Depending on your application, consider dividing the spectrum into, say 16 or 32 bins, maybe logarithmically, since that is how we hear. Then, compare the ratios of the power in each bin. E.g, compare the ratio of 500 Hz to 1000 Hz in the first sample with that same ratio in the 2nd sample. That gets rid of any problem with unequal amplitudes of the samples.
1D signal matching is often done with using the convolution function. However, this may be processor intensive.
A simpler algorithm that could be used is to first check if the durations of each note the two signals are roughly equal. Then if check the next-frequency pattern of the two signals are the same. What I mean by next-frequency pattern is to decompose the ordered list of frequencies to an ordered list of whether or not the next frequency is higher or lower. So something that goes 500Hz to 1000Hz to 700Hz to 400Hz would simply become Higher-Lower-Lower. This may be good enough, depending on your purposes.

Resources