Log data reduction for variable bandwidth data link - algorithm

I have an embedded system which generates samples (16bit numbers) at 1 milli second intervals. The variable uplink bandwidth can at best transfer a sample every 5ms, so I am
looking for ways to adaptively reduce the data rate while minimizing the loss
of important information -- in this case the minimum and maximum values in a time interval.
A scheme which I think should work involves sparse coding and a variation of lossy compression. Like this:
The system will internally store the min and max values during a 10ms interval.
The system will internally queue a limited number (say 50) of these data pairs.
No loss of min or max values is allowed but the time interval in which they occur may vary.
When the queue gets full, neighboring data pairs will be combined starting at the end of the queue so that the converted min/max pairs now represent 20ms intervals.
The scheme should be iterative so that further interval combining to 40ms, 80ms etc is done when necessary.
The scheme should be linearly weighted across the length of the queue so that there is no combining for the newest data and maximum necessary combining of the oldest data.
For example with a queue of length 6, successive data reduction should cause the data pairs to cover these intervals:
initial: 10 10 10 10 10 10 (60ms, queue full)
70ms: 10 10 10 10 10 20
80ms: 10 10 10 10 20 20
90ms: 10 10 20 20 20 20
100ms: 10 10 20 20 20 40
110ms: 10 10 20 20 40 40
120ms: 10 20 20 20 40 40
130ms: 10 20 20 40 40 40
140ms: 10 20 20 40 40 80
New samples are added on the left, data is read out from the right.
This idea obviously falls into the categories of lossy-compression and sparse-coding.
I assume this is a problem that must occur often in data logging applications with limited uplink bandwidth therefore some "standard" solution might have emerged.
I have deliberately simplified and left out other issues such as time stamping.
Questions:
Are there already algorithms which do this kind of data logging? I am not looking for the standard, lossy picture or video compression algos but something more specific to data logging as described above.
What would be the most appropriate implementation for the queue? Linked list? Tree?

The term you are looking for is "lossy compression" (See: http://en.wikipedia.org/wiki/Lossy_compression ). The optimal compression method depends on various aspects such as the distribution of your data.

As i understand you want to transmit min() and max() of all samples in a timeperiod.
eg. you want transmit min/max every 10ms with taking samples every 1ms?
if you do not need the individual samples you simply compare them after each sampling
i=0; min=TYPE_MAX; max=TYPE_MIN;// First sample will always overwrite the initial values
while true do
sample = getSample();
if min>sample then
min=sample
if max<sample then
max=sample
if i%10 == 0 then
send(min, max);
// if each period should be handled seperatly: min=TYPE_MAX; max=TYPE_MIN;
done
you can also save bandwidth with sending data only on changes (depends on sample data: if they dont change very quick you will save a lot)

Define a combination cost function that matches your needs, e.g. (len(i) + len(i+1)) / i^2, then iterate the array to find the "cheapest" pair to replace.

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.
 
From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.
In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

NiFi MergeRecords leaving out one file

I'm using NiFi to take in some user data and combine all the JSONs into one record. The MergeRecord processor is working just like I need, except it always leaves out one record (usually the same one every time). The processor is set to run ever 60 seconds. I can't understand why because there are only 56 records to merge. I've included images below for any help y'all may have.
Firstly, you have 56 FlowFiles, that does not necessarily mean 56 Records unless you have 1 Record per FlowFile.
You are using MergeRecord which counts Records, not files.
Your current config is set to Min 50 - Max 1000 Records
If you have 56 files with 1 Record in each, then merging 50 files is enough to meet the Minimum condition and release the bucket.
You also say Merge is set to run every 60 seconds, and perhaps this is not doing what you think it is. In almost all cases, Merge should be left to the default 0 sec schedule.
NiFi has no idea what all means, it takes an input and works on it - it does not know if or when the next input will come.
If every FlowFile is 1 Record, and it is categorically always 56 and that will never change, then your setting could be Min 56 - Max 56 and that will always merge 56 times.
However, that is very inflexible to change - if it suddenly changed to 57, you need to modify the flow.
Instead, you could set the Min-Max to very high numbers, say 10,000-20,000 and then set a Max Bin Age to 60 seconds (and the processor scheduling back to 0 sec). This would have the effect of merging every Record that enters the processor until A) 10-20k Records have been merged, or B) 60 seconds expire.
Example scenarios:
A) All 56 arrives within the first 2 seconds of the flow starting
All 56 are merged into 1 file after 60 seconds of the first file arriving
B) 53 arrive within the first 60 seconds, 3 arrive in the second 60 seconds
The first 53 are merged into 1 file after 60 seconds of the first file arriving, the last 3 are merged into another file after 60 seconds from the frst of the 3 arriving
C) 10,000 arrive in the first 5 seconds
All 10k will merge immediately into 1 file, they will not wait for 60 seconds

Evenly schedule timed items into fixed size containers

I got a task at work to evenly schedule commercial timed items into pre-defined commercial breaks (containers).
Each campaign has a set of commercials with or without spreading order. I need to allow users to chose multiple campaigns and distribute all the commercials to best fit the breaks within a time window.
Example of Campaign A:
Item | Duration | # of times to schedule | order
A1 15 sec 2 1
A2 25 sec 2 2
A3 30 sec 2 3
Required outcome:
each item should appear only once in a break, no repeating.
if there is specific order try to best fit by keeping the order. If
no order shuffle it.
At the end of the process the breaks should contain evenly amount of
commercial time.
Ideal spread would fully fill all desired campaigns into the breaks.
For example: Campaign {Item,Duration,#ofTimes,Order}
Campaign A which has set {A1,15,2,1},{A2,25,2,2},{A3,10,1,3}
Campaign B which has set {B1,20,2,2},{B2,35,3,1},
Campaign C which has set {C1,10,1,1},{C2,15,2,3 sec},{C3,15,1,2 sec}
,{C4,40,1,4}
A client will choose to schedule those campaigns in a specific date that hold 5 breaks of 60 second each.
A good outcome would result in:
Container 1: {A1,15}{B2,35}{C1,10} total of 60 sec
Container 2: {C3,15}{A2,25}{B1,20} total of 60 sec
Container 3: {A3,10}{C2,15}{B2,35} total of 60 sec
Container 4: {C4,40}{B1,20} total of 60 sec
Container 5: {C2,15}{A3,10}{B3,35} total of 60 sec
Of course it's rarely that all will fit so perfectly in real-life examples.
There are so many combinations with large amount of items and I'm not sure how to go about it. The order of items inside a break needs to be dynamically calculated so that the end result would best fit all the items into the breaks.
If the architecture is poor and someone has a better idea (like giving priority to items over order and schedule based on priority or such I'll be glad to hear).
It seems like simulated annealing might be a good way to approach this problem. Just incorporate your constraints of: keeping order, even spreading and fitting into 60sec frame into the scoring function. Your random neighbor function might just swap 2 items with each other | move an item to a different frame.

Algorithm for aggregating stock chart datapoints out of many DB entries

I have a database with stock values in a table, for example:
id - unique id for this entry
stockId - ticker symbol of the stock
value - price of the stock
timestamp - timestamp of that price
I would like to create separate arrays for a timeframe of 24 hour, 7 days and 1 month from my database entries, each array containing datapoints for a stock chart.
For some stockIds, I have just a few data points per hour, for others it could be hundreds or thousands.
My question:
What is a good algorithm to "aggregate" the possibly many datapoints into a few - for example, for the 24 hours chart I would like to have at a maximum 10 datapoints per hour. How do I handle exceptionally high / low values?
What is the common approach in regards to stock charts?
Thank you for reading!
Some options: (assuming 10 points per hour, i.e. one roughly every 6 minutes)
For every 6 minute period, pick the data point closest to the centre of the period
For every 6 minute period, take the average of the points over that period
For an hour period, find the maximum and minimum for each 4 minutes period and pick the 5 maximum and 5 minimum in these respective sets (4 minutes is somewhat randomly chosen).
I originally thought to pick the 5 minimum points and the 5 maximum points such that each maximum point is at least 8 minutes apart, and similarly for minimum points.
The 8 minutes here is so we don't have all the points stacked up on each other. Why 8 minutes? At an even distribution, 60/5 = 12 minutes, so just moving a bit away from that gives us 8 minutes.
But, in terms of the implementation, the 4 minutes approach will be much simpler and should give similar results.
You'll have to see which one gives you the most desirable results. The last one is likely to give a decent indication of variation across the period, whereas the second one is likely to have a more stable graph. The first one can be a bit more erratic.

Algorithm to decompose set of timestamps into subsets with even temporal spacing

I have a dataset containing > 100,000 records where each record has a timestamp.
This dataset has been aggregated from several "controller" nodes which each collect their data from a set of children nodes. Each controller collects these records periodically, (e.g. once every 5 minutes or once every 10 minutes), and it is the controller that applies the timestamp to the records.
E.g:
Controller One might have 20 records timestamped at time t, 23 records timestamped at time t + 5 minutes, 33 records at time t + 10 minutes.
Controller Two might have 30 records timestamped at time (t + 2 minutes) + 10 minutes, 32 records timestamped at time (t + 2 minutes) + 20 minutes, 41 records timestamped at time (t + 2 minutes) + 30 minutes etcetera.
Assume now that the only information you have is the set of all timestamps and a count of how many records appeared at each timestamp. That is to say, you don't know i) which sets of records were produced by which controller, ii) the collection interval of each controller or ii) the total number of controllers. Is there an algorithm which can decompose the set of all timestamps into individual subsets such that the variance in difference between consecutive (ordered) elements of each given subset is very close to 0, while adding any element from one subset i to another subset j would increase this variance? Keep in mind, for this dataset, a single controller's "periodicity" could fluctuate by +/- a few seconds because of CPU timing/network latency etc.
My ultimate objective here is to establish a) how many controllers there are and b) the sampling interval of each controller. So far I've been thinking about the problem in terms of periodic functions, so perhaps there are some decomposition methods from that area that could be useful.
The other point to make is that I don't need to know which controller each record came from, I just need to know the sampling interval of each controller. So e.g. if there were two controllers that both started sampling at time u, and one sampled at 5-minute intervals and the other at 50-minute intervals, it would be hard to separate the two at the 50-minute mark because 5 is a factor of 50. This doesn't matter, so long as I can garner enough information to work out the intervals of each controller despite these occasional overlaps.
One basic approach would be to perform an FFT decomposition (or, if you're feeling fancy, a periodogram) of the dataset and look for peaks in the resulting spectrum. This will give you a crude approximation of the periods of the controllers, and may even give you an estimate of their number (and by looking at the height of the peaks, it can tell you how many records were logged).

Resources