Given a series of timestamps of network I/O activities, what's the best algorithm to calculate activeness for all timeslots in a day?
For example, choose slot size =30 seconds, then 1 day = 24*60*60/30 =2880 slots. For one timeslot, label it as active or inactive based on the timeseries data given (active if there is activity in this slot, inactive otherwise). Then calculate activeness ratio = (# active slots) / (# total slots).
Suggestions?
I'm not sure what you mean by "best" in this context, but bear with me.
So, we need a function that will take a number of time slots in a 24 hour day, and a timestamp, and return the time slot to which the timestamp belongs. Something like:
int GetSlot(int numberOfSlots, int secondsSinceMidnight)
{
int secondsInSlot = 24 * 60 * 60 / numberOfSlots;
return secondsSinceMidnight / secondsInSlot;
}
Now, create an empty map data structure from time slots to timestamp counts. Start looping through the set of timestamps. For each timestamp, call GetSlot; call its slot assignedSlot. We check the map data structure to see if it contains an entry for assignedSlot. If it does, we increase the mapped counter by one. Otherwise, we add a new entry for assignedSlot and set the timestamp count to one. Continue for all timestamps.
At the end, we have one entry in the map data structure for each active slot. We know the total number of slots, so getting the average number of active slots is easy: map.size() / numberOfSlots. We've remembered more information than you technically need, but still.
This is O(n) time and O(n) space.
An alternative would be to sort the timestamps in ascending order, then loop over them, counting active time slots as you go. This could be made O(n log n) time and O(1) space.
If you have a bunch of timestamps that are tightly clustered within a few time slots, the first approach will most likely be more efficient. If you have fewer timestamps but they're more evenly distributed throughout time slots, the second approach could be better.
Related
I am building an agent-based model for product usage. I am trying to develop a function to decide whether the user is using the product at a given time, while incorporating randomness.
So, say we know the user spends a total of 1 hour per day using the product, and we know the average distribution of this time (e.g., most used at 6-8pm).
How can I generate a set of usage/non-usage times (i.e., during each 10 minute block is the user active or not) while ensuring that at the end of the day the total active time sums to one hour.
In most cases I would just run the distributor without concern for the total, and then at the end normalize by making it proportional to the total target time so the total was 1 hour. However, I can't do that because time blocks must be 10 minutes. I think this is a different question because I'm really not computing time ranges, I'm computing booleans to associate with different 10 minute time blocks (e.g., the user was/was not active during a given block).
Is there a standard way to do this?
I did some more thinking and figured it out, if anyone else is looking at this.
The approach to take is this: You know the allowed number n of 10-minute time blocks for a given agent.
Iterate n times, and on each iteration select a time block out of the day subject to your activity distribution function.
Main point is to iterate over the number of time blocks you want to place, not over the entire day.
There's a large set of objects. Set is dynamic: objects can be added or deleted any time. Let's call the total number of objects N.
Each object has two properties: mass (M) and time (T) of last update.
Every X minutes a small batch of those should be selected for processing, which updates their T to current time. Total M of all objects in a batch is limited: not more than L.
I am looking to solve three tasks here:
find a next batch object picking algorithm;
introduce object classes: simple, priority (granted fit into at least each n-th batch) and frequent (fit into each batch);
forecast system capacity exhaust (time to add next server = increase L).
What kind of model best describes such a system?
The whole thing is about a service that processes the "objects" in time intervals. Each object should be "measured" each N hours. N can vary in a range. X is fixed.
Objects are added/deleted by humans. N grows exponentially, rather slow, with some spikes caused by publications. Of course forecast can't be precise, just some estimate. M varies from 0 to 1E7 with exponential distribution, most are closer to 0.
I see there can be several strategies here:
A. full throttle - pack each batch as much as close to 100%. As N grows, average interval a particular object gets a hit will grow.
B. equal temperament :) - try to keep an average interval around some value. A batch fill level will be growing from some low level. When it reaches closer to 100% – time to get more servers.
C. - ?
Here is a pretty complete design for your problem.
Your question does not optimally match your description of the system this is for. So I'll assume that the description is accurate.
When you schedule a measurement you should pass an object, a first time it can be measured, and when you want the measurement to happen by. The object should have a weight attribute and a measured method. When the measurement happens, the measured method will be called, and the difference between your classes is whether, and with what parameters, they will reschedule themselves.
Internally you will need a couple of priority queues. See http://en.wikipedia.org/wiki/Heap_(data_structure) for details on how to implement one.
The first queue is by time the measurement can happen, all of the objects that can't be measured yet. Every time you schedule a batch you will use that to find all of the new measurements that can happen.
The second queue is of measurements that are ready to go now, and is organized by which scheduling period they should happen by, and then weight. I would make them both ascending. You can schedule a batch by pulling items off of that queue until you've got enough to send off.
Now you need to know how much to put in each batch. Given the system that you have described, a spike of events can be put in manually, but over time you'd like those spikes to smooth out. Therefore I would recommend option B, equal temperament. So to do this, as you put each object into the "ready now" queue, you can calculate its "average work weight" as its weight divided by the number of periods until it is supposed to happen. Store that with the object, and keep a running total of what run rate you should be at. Every period I would suggest that you keep adding to the batch until one of three conditions has been met:
You run out of objects.
You hit your maximum batch capacity.
You exceed 1.1 times your running total of your average work weight. The extra 10% is because it is better to use a bit more capacity now than to run out of capacity later.
And finally, capacity planning.
For this you need to use some heuristic. Here is a reasonable one which may need some tweaking for your system. Maintain an array of your past 10 measurements of running total of average work weight. Maintain an "exponentially damped average of your high water mark." Do that by updating each time according to the formula:
average_high_water_mark
= 0.95 * average_high_water_mark
+ 0.5 * max(last 10 running work weight)
If average_high_water_mark ever gets within, say, 2 servers of your maximum capacity, then add more servers. (The idea is that a server should be able to die without leaving you hosed.)
I think answer A is good. Bin packing is to maximize or minimize and you have only one batch. Sort the objects by m and n.
I am trying to implement something along the lines of a Moving Average.
In this system, there are no guarantees of a quantity of Integers per time period. I do need to calculate the Average for each period. Therefore, I cannot simply slide over the list of integers by quantity as this would not be relative to time.
I can keep a record of each value with its associated time. We will have a ton of data running through the system so it is important to 'garbage collect' the old data.
It may also be important to note that I need to save the average to disk after the end of each period. However, they may be some overlap between saving the data to disk and having data from a new period being introduced.
What are some efficient data structures I can use to store, slide, and garbage collect this type of data?
The description of the problem and the question conflict: what is described is not a moving average, since the average for each time period is distinct. ("I need to compute the average for each period.") So that admits a truly trivial solution:
For each period, maintain a count and a sum of observations.
At the end of the period, compute the average
I suspect that what is actually wanted is something like: Every second (computation period), I want to know the average observation over the past minute (aggregation period).
This can be solved simply with a circular buffer of buckets, each of which represents the value for one computation period. There will be aggregation period / computation period such buckets. Again, each bucket contains a count and a sum. Also, a current total/sum and a cumulative total sum/count are maintained. Each observation is added to the current total/sum.
At the end of a each computation period:
subtract the sum/count for the (circularly) first period from the cumulative sum/count
add the current sum/count to the cumulative sum/count
report the average based on the cumulative sum/count
replace the values of the first period with the current sum/count
clear the current sum/count
advance the origin of the circular buffer.
If you really need to be able to compute at any time at all the average of the previous observations over some given period, you'd need a more complicated data structure, basically an expandable circular buffer. However, such precise computations are rarely actually necessary, and a bucketed approximation, as per the above algorithm, is usually adequate for data purposes, and is much more sustainable over the long term for memory management, since its memory requirements are fixed from the start.
I have a dataset containing > 100,000 records where each record has a timestamp.
This dataset has been aggregated from several "controller" nodes which each collect their data from a set of children nodes. Each controller collects these records periodically, (e.g. once every 5 minutes or once every 10 minutes), and it is the controller that applies the timestamp to the records.
E.g:
Controller One might have 20 records timestamped at time t, 23 records timestamped at time t + 5 minutes, 33 records at time t + 10 minutes.
Controller Two might have 30 records timestamped at time (t + 2 minutes) + 10 minutes, 32 records timestamped at time (t + 2 minutes) + 20 minutes, 41 records timestamped at time (t + 2 minutes) + 30 minutes etcetera.
Assume now that the only information you have is the set of all timestamps and a count of how many records appeared at each timestamp. That is to say, you don't know i) which sets of records were produced by which controller, ii) the collection interval of each controller or ii) the total number of controllers. Is there an algorithm which can decompose the set of all timestamps into individual subsets such that the variance in difference between consecutive (ordered) elements of each given subset is very close to 0, while adding any element from one subset i to another subset j would increase this variance? Keep in mind, for this dataset, a single controller's "periodicity" could fluctuate by +/- a few seconds because of CPU timing/network latency etc.
My ultimate objective here is to establish a) how many controllers there are and b) the sampling interval of each controller. So far I've been thinking about the problem in terms of periodic functions, so perhaps there are some decomposition methods from that area that could be useful.
The other point to make is that I don't need to know which controller each record came from, I just need to know the sampling interval of each controller. So e.g. if there were two controllers that both started sampling at time u, and one sampled at 5-minute intervals and the other at 50-minute intervals, it would be hard to separate the two at the 50-minute mark because 5 is a factor of 50. This doesn't matter, so long as I can garner enough information to work out the intervals of each controller despite these occasional overlaps.
One basic approach would be to perform an FFT decomposition (or, if you're feeling fancy, a periodogram) of the dataset and look for peaks in the resulting spectrum. This will give you a crude approximation of the periods of the controllers, and may even give you an estimate of their number (and by looking at the height of the peaks, it can tell you how many records were logged).
I have messages coming into my program with millisecond resolution (anywhere from zero to a couple hundred messages a millisecond).
I'd like to do some analysis. Specifically, I want to maintain multiple rolling windows of the message counts, updated as messages come in. For example,
# of messages in last second
# of messages in last minute
# of messages in last half-hour divided by # of messages in last hour
I can't just maintain a simple count like "1,017 messages in last second", since I won't know when a message is older than 1 second and therefore should no longer be in the count...
I thought of maintaining a queue of all the messages, searching for the youngest message that's older than one second, and inferring the count from the index. However, this seems like it would be too slow, and would eat up a lot of memory.
What can I do to keep track of these counts in my program so that I can efficiently get these values in real-time?
This is easiest handled by a cyclic buffer.
A cyclic buffer has a fixed number of elements, and a pointer to it. You can add an element to the buffer, and when you do, you increment the pointer to the next element. If you get past the fixed-length buffer you start from the beginning. It's a space and time efficient way to store "last N" items.
Now in your case you could have one cyclic buffer of 1,000 counters, each one counting the number of messages during one millisecond. Adding all the 1,000 counters gives you the total count during last second. Of course you can optimize the reporting part by incrementally updating the count, i.e. deduct form the count the number you overwrite when you insert and then add the new number.
You can then have another cyclic buffer that has 60 slots and counts the aggregate number of messages in whole seconds; once a second, you take the total count of the millisecond buffer and write the count to the buffer having resolution of seconds, etc.
Here C-like pseudocode:
int msecbuf[1000]; // initialized with zeroes
int secbuf[60]; // ditto
int msecptr = 0, secptr = 0;
int count = 0;
int msec_total_ctr = 0;
void msg_received() { count++; }
void every_msec() {
msec_total_ctr -= msecbuf[msecptr];
msecbuf[msecptr] = count;
msec_total_ctr += msecbuf[msecptr];
count = 0;
msecptr = (msecptr + 1) % 1000;
}
void every_sec() {
secbuf[secptr] = msec_total_ctr;
secptr = (secptr + 1) % 60;
}
You want exponential smoothing, otherwise known as an exponential weighted moving average. Take an EWMA of the time since the last message arrived, and then divide that time into a second. You can run several of these with different weights to cover effectively longer time intervals. Effectively, you're using an infinitely long window then, so you don't have to worry about expiring data; the reducing weights do it for you.
For the last millisecord, keep the count. When the millisecord slice goes to the next one, reset count and add count to a millisecond rolling buffer array. If you keep this cummulative, you can extract the # of messages / second with a fixed amount of memory.
When a 0,1 second slice (or some other small value next to 1 minute) is done, sum up last 0,1*1000 items from the rolling buffer array and place that in the next rolling buffer. This way you kan keep the millisecord rolling buffer small (1000 items for 1s max lookup) and the buffer for lookup the minute also (600 items).
You can do the next trick for whole minutes of 0,1 minutes intervals. All questions asked can be queried by summing (or when using cummulative , substracting two values) a few integers.
The only disadvantage is that the last sec value wil change every ms and the minute value only every 0,1 secand the hour value (and derivatives with the % in last 1/2 hour) every 0,1 minute. But at least you keep your memory usage at bay.
Your rolling display window can only update so fast, lets say you want to update it 10 times a second, so for 1 second's worth of data, you would need 10 values. Each value would contain the number of messages that showed up in that 1/10 of a second. Lets call these values bins, each bin holds 1/10 of a second's worth of data. Every 100 milliseconds, one of the bins gets discarded and a new bin is set to the number of messages that have show up in that 100 milliseconds.
You would need an array of 36K bins to hold an hour's worth information about your message rate if you wanted to preserve a precision of 1/10 of a second for the whole hour. But that seems overkill.
But I think it would be more reasonable to have the precision drop off as the time inteval gets larger.
Maybe you keep 1 second's worth of data accurate to 100 milliseconds, 1 minutes worth of data accurate to the second, 1 hour's worth of data accurate to the minute, and so on.
I thought of maintaining a queue of all the messages, searching for the youngest message that's older than one second, and inferring the count from the index. However, this seems like it would be too slow, and would eat up a lot of memory.
A better idea would be maintaining a linked list of the messages, adding new messages to the head (with a timestamp), and popping them from the tail as they expire. Or even not pop them - just keep a pointer to the oldest message that came in in the desired timeframe, and advance it towards the head when that message expires (this allows you to keep track of multiply timeframes with one list).
You could compute the count when needed by walking from the tail to the head, or just store the count separately, incrementing it whenever you add a value to the head, and decrementing it whenever you advance the tail.