Extracting periodicities from message arrival timestamps - algorithm

I have a set of messages which each have an arrival timestamp. I would like to analyze the set and determine the periodicities of the messages' arrival. (Having that, I can then detect with some degree of certainty, when subsequent messages are late or missing.) So a discrete Fourier transform seems the logical choice to pull the frequency(ies) from the set.
However, all the explanations of discrete Fourier transforms which I've seen, start from a finite set of values sampled at a constant frequency. Whereas what I have is simply a set of values (monotonically increasing timestamp values.)
Convert to time series data?
I've thought of selecting a small resolution -- e.g. one second -- and then producing a time series beginning at the time of the first message, through the current real time, and a corresponding value of (0,1) at each of those time points. (Mostly zeros, with ones at the arrival time of each message.)
More specifics
I have many sets: I need to perform this calculation many times as I have many different sets of messages to analyze. Each set of messages might be on the order of 1,000 messages spanning up to a year of real time. So if I converted, (as I'm thinking above,) one set of messages into a time series; that's ~32 million (seconds in a year) time series data points, with only ~1,000 non-zero values.
Some of the message sets are more frequent: ~5,000 message over the scale of days -- so that would be more like ~400,000 time series data points, but still with only ~5,000 non-zero values.
Is this sane (convert arrival times to a time series and then head for plain FFT work)? Or is there a different way to apply Fourier transforms to my actual data (message arrival times)?

I suggest that you bin the message counts into evenly-spaced bins of a suitable duration, and then treat these bins as a time series and generate a spectrum from the series, using e.g. an FFT-based method. The resulting spectrum should show any periodicities as peaks around particular bin frequencies.

Related

Algorithms for finding coincidences in sets of data

Background & Motivation
I have three lists of timestamps (UNIX timestamps, plus a 10-digit subsecond part (e.g. 1451606400.9475839201)). They are not necessarily of the same size, however they are ordered (in non-descending order).
Each list corresponds to data from a real-life instrument, of which there are 3. The instruments in question record a timestamp each time they observe an "event". The issue is that the instruments are very sensitive, thus they record timestamps on the order of 10+ hz, and I'm working with a ~1 year of data, where only a tiny portion correspond to actual events.
Unfortunately, it's difficult to put a number on precisely how many timestamps should be real events.
We may assume that the "random" timestamps are uniformly distributed (and, in practice, this seems to be the case). There are, however, gaps in the data (e.g. all three instruments may have gone down during the months of March, April, May). These gaps will be the same for all three lists.
Unfortunately, we cannot assume that the clocks for the three instruments are well synchronized. A drift of a fraction of a microsecond to a few microseconds can be expected. Further, the "events" in question are light, and the instruments in question close together, so we can calculate the maximum difference in observation time, on the order of ~10 - 15 microseconds (assuming the clocks were synchronized, and there is no noise).
Goal:
Using only the timestamps, I want to identify those which are most "likely" to correspond to a "real" event, such that further analysis can be conducted.
What I've Tried:
My first inclination was to, from the three list, produce a list of triplets, one timestamp contributed by each list/instrument, such that max(A,B,C) - min(A,B,C) was minimized. Something like this simple algorithm.
Unfortunately, this found very few, and sometimes no, "coincidences" which fell within a reasonable time window. Further, those few that did didn't correspond to real events, upon further analysis.
Next, I tried the above, but this time minimizing the RSS error, which I defined to be, for some triplet A,B,C, as (A-B)**2 + (A-C)**2 + (B-C)**2. This found not too many more triplets within a reasonable time window, and none corresponded to real events.
Lastly, I tried simply iterating over all elements of the first vector, and finding the closest match in the second and third vectors (by binary search), then repeating for the second and third vectors. This gave me identical results to the RSS minimization code.
Is there a better or a "standard" approach?
I'm worried about nothing outside of effectively finding these "real events". This includes efficiency... if it works well, speed/efficiency is of little concern.

Incremental price graph approximation

I need to display a crypto currency price graph based similar to what is done on CoinMarketCap: https://coinmarketcap.com/currencies/bitcoin/
There could be gigabytes of data for one currency pair over a long period of time, so sending all the data to the client is not an option.
After doing some research I ended up using a Douglas-Peucker Line Approximation Algorithm: https://www.codeproject.com/Articles/18936/A-C-Implementation-of-Douglas-Peucker-Line-Appro It allows to reduce the amount of dots that is sent to the client but there's one problem: every time there's new data I have to go through all the data on the server and as I'd like to update data on the client in real time, it takes a lot of resources.
So I'm thinking about a some kind of progressive algorithm where, let's say, if I need to display data for the last month I can split data into 5 minute intervals, preprocess only the last interval and when it's completed, remove the first one. I'm debating either customising the Douglas-Peucker algorithm (but I'm not sure if it fits this scenario) or finding an algorithm designed for this purpose (any hint would be highly appreciated)
Constantly re-computing the entire reduction points when the new data arrives would change your graph continuously. The graph will lack consistency. The graph seen by one user would be different from the graph seen by another user and the graph would change when the user refreshes the page(this shouldn't happen!), and even in case of server/application shutdown, your data needs to be consistent to what it was before.
This is how I would approach:
Your reduced points should be as they are. For suppose, you are getting data for every second and you have computed reduced points for a 5-minute interval graph, save those data points in a limiting queue. Now gather all seconds data for next 5-minutes and perform the reduction operation on these 600 data points and add the final reduced point to your limiting queue.
I would make the Queue synchronous and the main thread would return the data points in the queue whenever there is an API call. And the worker thread would compute the reduction point on the 5-minute data once the data for the entire 5-minute interval is available.
I'd use tree.
A sub-node contains the "precision" and "average" values.
"precision" means the date-range. For example: 1 minute, 10 minutes, 1 day, 1 month, etc. This also means a level in the tree.
"average" is the value that best represents the price for a range. You can use a simple average, a linear regression, or whatever you decide as "best".
So if you need 600 points (say you get the window size), you can find the precision by prec=total_date_range/600, and some rounding to your existing ranges.
Now you have the 'prec' you just need to retrieve the nodes for that 'prec` level.
Being gigabytes of data, I'd slice them into std::vector objects. The tree would store ids to these vectors for the lowest nodes. The rest of nodes could also be implemented by indices to vectors.
Updating with new data only requires to update a branch (or even creating a new one), starting from root, but with not so many sub-nodes.

How to detect the precise sampling interval from samples stored in a database?

A hardware sensor is sampled precisely (precise period of sampling) using a real-time unit. However, the time value is not sent to the database together with the sampled value. Instead, time of insertion of the record to the database is stored for the sample in the database. The DATETIME type is used, and the GETDATE() function is used to get current time (Microsoft SQL Server).
How can I reconstruct the precise sampling times?
As the sampling interval is (should be) 60 seconds exactly, there was no need earlier for more precise solution. (This is an old solution, third party, with a lot of historical samples. This way it is not possible to fix the design.)
For processing of the samples, I need to reconstruct the correct time instances for the samples. There is no problem with shifting the time of the whole sequence (that is, it does not matter whether the start time is rather off, not absolute). On the other hand, the sampling interval should be detected as precisely as possible. I also cannot be sure, that the sampling interval was exactly 60 seconds (as mentioned above). I also cannot be sure, that the sampling interval was really constant (say, slight differences based on temperature of the device).
When processing the samples, I want to get:
start time
the sampling interval
the sequence o the sample values
When reconstructing the samples, I need to convert it back to tuples:
time of the sample
value of the sample
Because of that, for the sequence with n samples, the time of the last sample should be equal to start_time + sampling_interval * (n - 1), and it should be reasonably close to the original end time stored in the database.
Think in terms of the stored sample times slightly oscillate with respect to the real sample-times (the constant delay between the sampling and the insertion into the database is not a problem here).
I was thinking about calculating the mean value and the corrected standard deviation for the interval calculated from the previous and current sample times.
Discontinuity detection: If the calculated interval is greater than 3 sigma off the mean value, I would consider it a discontinuity of the sampled curve (say, the machine is switched off, or any outer event lead to missing samples. In the case, I want to start with processing a new sequence. (The sampling frequency could also be changed.)
Is there any well known approach to the problem. If yes, can you point me to the article(s)? Or can you give me the name or acronym of the algorithm?
+1 to looking at the difference sequence. We can model the difference sequence as the sum of a low frequency truth (the true rate of the samples, slowly varying over time) and high frequency noise (the random delay to get the sample into the database). You want a low-pass filter to remove the latter.

Different clustering algorithms to cluster timeseries events

I have a very large input file with the following format:
ID \t time \t duration \t Description \t status
The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI, aSI, Asi...)
The ultimate purpose is to cluster events that start and end at a "close enough' times in order to recognize that those events contribute to a bigger event. Close enough here could be determined by a theta, let's say for now is 1 hour (or it could be 2 hours, or more, etc.). If two events that have start times within 1 hour and end times within 1 hour, we cluster them together and say that they are part of a big event.
One other thing is that I only care about events that has all lower case letters in the status
So below is my initial input processing:
I filter the input file so that it only contains rows that have all lower case letters
This task is already done in Python using MapReduce and Hadoop
Then I take the output file and split it into "boxes" where each box represents a time range (ex: 11am-12pm - box1, 12pm-1pm - box2, 1pm-2pm - box 3, etc...)
Then use MapReduce again to sort each box based on start time (ascending order)
The final output is an ordered list of start time
Now I want to develop an algorithm to group "similar events" together in the output above. Similar events are determined by start and end time.
my current algorithm is:
pick the first item in the list
find any event in the list that has start time is within 1 hour with first event start time and duration is +/- 1 hour with first item duration (duration determines the end time).
Then cluster them together (basically I want to cluster events that happen at the same time frame)
If no other event found, move to the next available event (the one that has not been clustered).
Keep doing this until no more events to be clustered.
I'm not sure if my algorithm will work or it's efficient. I'm trying to do an algorithm that is better than O (n^2), so hierarchical clustering won't work. K-means might not work also since I don't know ahead of time how many clusters I will need.
There could be some other clustering algorithms that might fit better in my case. I think I might need some equations in my algorithm to calculate the distance in my cluster in order to determine similarity. I'm pretty new to this field, so any help to direct me to the right path is greatly appreciated.
Thanks a lot in advance.
You could try DBSCAN density-based clustering algorithm which is O(n log n) (garanteed ONLY in case of using indexing data structure like kd-tree, ball-tree etc, otherwise it's O(n^2)). Events that are part of a bigger event should be in a areas of high density. It seem to be a great fit to your problem.
You might need to implement your own distance function in order to perform neighborhood query (to find nearest events).
Also, it would be better to represent event time in POSIX-time format.
Here is an example.
Depending on the environment you use, DBSCAN implementation is in Java (ELKI), Python (scikit-learn), R (fpc).

Looking for ideas for a simple pattern matching algorithm to run on a microcontroller

I'm working on a project to recognize simple audio patterns. I have two data sets, each made up of between 4 and 32 note/duration pairs. One set is predefined, the other is from an incoming data stream. The length of the two strongly correlated data sets is often different, but roughly the same "shape". My goal is to come up with some sort of ranking as to how well the two data sets correlate/match.
I have converted the incoming frequencies to pitch and shifted the incoming data stream's pitch so that it's average pitch matches that of the predefined data set. I also stretch/compress the incoming data set's durations to match the overall duration of the predefined set. Here are two graphical examples of data that should be ranked as strongly correlated:
http://s2.postimage.org/FVeG0-ee3c23ecc094a55b15e538c3a0d83dd5.gif
(Sorry, as a new user I couldn't directly post images)
I'm doing this on a 8-bit microcontroller so resources are minimal. Speed is less an issue, a second or two of processing isn't a deal breaker.
It wouldn't surprise me if there is an obvious solution, I've just been staring at the problem too long. Any ideas?
Thanks in advance...
Couldn't see the graphic, but... Divide the spectrum into bins. You've probably already done this already , but they may be too fine. Depending on your application, consider dividing the spectrum into, say 16 or 32 bins, maybe logarithmically, since that is how we hear. Then, compare the ratios of the power in each bin. E.g, compare the ratio of 500 Hz to 1000 Hz in the first sample with that same ratio in the 2nd sample. That gets rid of any problem with unequal amplitudes of the samples.
1D signal matching is often done with using the convolution function. However, this may be processor intensive.
A simpler algorithm that could be used is to first check if the durations of each note the two signals are roughly equal. Then if check the next-frequency pattern of the two signals are the same. What I mean by next-frequency pattern is to decompose the ordered list of frequencies to an ordered list of whether or not the next frequency is higher or lower. So something that goes 500Hz to 1000Hz to 700Hz to 400Hz would simply become Higher-Lower-Lower. This may be good enough, depending on your purposes.

Resources