In the image the time 1 is overlapping on time 2, I want result as per image of time 2
(Note: both the times are in array)
Related
Let's, for example here, look at this widget. It reads from sysfs, more precisely the files:
/sys/class/net/wlan0/statistics/tx_bytes
/sys/class/net/wlan0/statistics/rx_bytes
And displays the bandwidth in Megabits per second. Now, the drill is, the widget is set to update every 1/4 of a second, 250ms. How, can the widget them calculate speed per second, if a second did not pass? Does is multiply the number it gets with 4? What's the drill?
The values read from tx_bytes and rx_bytes are always current. The Widget just has to read the values every 250 ms and memorize at least the last 4 values. On each update, the difference between the current value and value read 1 second ago can be taken, divided by 125.000 and correctly be reported as the bandwidth in Megabits per second.
I have a 15 min sliding window, and can aggregate at any given time over this data within this window. Due to memory constraints, I can't increase the size of the window. I still think I should be able to get aggregates (like trending items which is basically freq counter etc.) over a day/a week.
It doesn't have to be a very accurate count, just needs to filter out the top 3-5.
Will running a cron job every 15 mins and putting it into 4 (15min) counters work?
Can I get update some kind of a rolling counter over the aggregate?
Is there any other method to do this?
My suggestion is an exponentially decaying moving average. Like is done for the Unix load average. (See http://www.howtogeek.com/194642/understanding-the-load-average-on-linux-and-other-unix-like-systems/ for an explanation.)
What you do is pick a constant 0 < k < 1 then update every 5 minutes as follows:
moving_average = k * average_over_last_5_min + (1-k) * moving_average
This will behave something like an average over the last 5/k minutes. So if you set k = 1/(24.0 * 60.0 / 5.0) = 0.00347222222222222 then you get roughly a daily moving average. Divide that by 7 and you get roughly a weekly moving average.
The averages won't be exact, but should work perfectly well to identify what things are trending recently.
I have a database with stock values in a table, for example:
id - unique id for this entry
stockId - ticker symbol of the stock
value - price of the stock
timestamp - timestamp of that price
I would like to create separate arrays for a timeframe of 24 hour, 7 days and 1 month from my database entries, each array containing datapoints for a stock chart.
For some stockIds, I have just a few data points per hour, for others it could be hundreds or thousands.
My question:
What is a good algorithm to "aggregate" the possibly many datapoints into a few - for example, for the 24 hours chart I would like to have at a maximum 10 datapoints per hour. How do I handle exceptionally high / low values?
What is the common approach in regards to stock charts?
Thank you for reading!
Some options: (assuming 10 points per hour, i.e. one roughly every 6 minutes)
For every 6 minute period, pick the data point closest to the centre of the period
For every 6 minute period, take the average of the points over that period
For an hour period, find the maximum and minimum for each 4 minutes period and pick the 5 maximum and 5 minimum in these respective sets (4 minutes is somewhat randomly chosen).
I originally thought to pick the 5 minimum points and the 5 maximum points such that each maximum point is at least 8 minutes apart, and similarly for minimum points.
The 8 minutes here is so we don't have all the points stacked up on each other. Why 8 minutes? At an even distribution, 60/5 = 12 minutes, so just moving a bit away from that gives us 8 minutes.
But, in terms of the implementation, the 4 minutes approach will be much simpler and should give similar results.
You'll have to see which one gives you the most desirable results. The last one is likely to give a decent indication of variation across the period, whereas the second one is likely to have a more stable graph. The first one can be a bit more erratic.
I have a dataset containing > 100,000 records where each record has a timestamp.
This dataset has been aggregated from several "controller" nodes which each collect their data from a set of children nodes. Each controller collects these records periodically, (e.g. once every 5 minutes or once every 10 minutes), and it is the controller that applies the timestamp to the records.
E.g:
Controller One might have 20 records timestamped at time t, 23 records timestamped at time t + 5 minutes, 33 records at time t + 10 minutes.
Controller Two might have 30 records timestamped at time (t + 2 minutes) + 10 minutes, 32 records timestamped at time (t + 2 minutes) + 20 minutes, 41 records timestamped at time (t + 2 minutes) + 30 minutes etcetera.
Assume now that the only information you have is the set of all timestamps and a count of how many records appeared at each timestamp. That is to say, you don't know i) which sets of records were produced by which controller, ii) the collection interval of each controller or ii) the total number of controllers. Is there an algorithm which can decompose the set of all timestamps into individual subsets such that the variance in difference between consecutive (ordered) elements of each given subset is very close to 0, while adding any element from one subset i to another subset j would increase this variance? Keep in mind, for this dataset, a single controller's "periodicity" could fluctuate by +/- a few seconds because of CPU timing/network latency etc.
My ultimate objective here is to establish a) how many controllers there are and b) the sampling interval of each controller. So far I've been thinking about the problem in terms of periodic functions, so perhaps there are some decomposition methods from that area that could be useful.
The other point to make is that I don't need to know which controller each record came from, I just need to know the sampling interval of each controller. So e.g. if there were two controllers that both started sampling at time u, and one sampled at 5-minute intervals and the other at 50-minute intervals, it would be hard to separate the two at the 50-minute mark because 5 is a factor of 50. This doesn't matter, so long as I can garner enough information to work out the intervals of each controller despite these occasional overlaps.
One basic approach would be to perform an FFT decomposition (or, if you're feeling fancy, a periodogram) of the dataset and look for peaks in the resulting spectrum. This will give you a crude approximation of the periods of the controllers, and may even give you an estimate of their number (and by looking at the height of the peaks, it can tell you how many records were logged).
While displaying the download status in a window, I have information like:
1) Total file size (f)
2) Downloaded file size (f')
3) Current download speed (s)
A naive time-remaining calculation would be (f-f')/(s), but this value is way-to-shaky (6m remaining / 2h remaining / 5m remaining! deja vu?! :)
Would there be a calculation which is both stabler and not extremely wrong (showing 1h even when the download is about to complete)?
We solved a similar problem in the following way. We weren't interested in how fast the download was over the entire time, just roughly how long it was expected to take based on recent activity but, as you say, not so recent that the figures would be jumping all over the place.
The reason we weren't interested in the entire time frame was that a download could so 1M/s for half an hour then switch up to 10M/s for the next ten minutes. That first half hour will drag down the average speed quite severely, despite the fact that you're now honkin' along at quite a pace.
We created a circular buffer with each cell holding the amount downloaded in a 1-second period. The circular buffer size was 300, allowing for 5 minutes of historical data, and every cell was initialized to zero.
We also maintained a total (the sum of all entries in the buffer, so also initially zero) and the count (zero, obviously).
Every second, we would figure out how much data had been downloaded since the last second and then:
subtract the current cell from the total.
put the current figure into that cell and advance the cell pointer.
add that current figure to the total.
increase the count if it wasn't already 300.
update the figure displayed to the user, based on total / count.
Basically, in pseudo-code:
def init (sz):
buffer = new int[sz]
for i = 0 to sz - 1:
buffer[i] = 0
total = 0
count = 0
index = 0
maxsz = sz
def update (kbps):
total = total - buffer[index] + kbps
buffer[index] = kbps
index = (index + 1) % maxsz
if count < maxsz:
count = count + 1
return total / count
You can change your resolution (1 second) and history (300) to suit your situation but we found 5 minutes was more than long enough that it smoothed out the irregularities but still gradually adjusted to more permanent changes in a timely fashion.
Smooth s (exponential moving avg. or similar).
I prefer using average speed for the last 10 seconds and divide remaining part with that. Dividing to current speed to way too unstable while dividing to average of whole progress cannot handle permanent speed changes (like another download is starting).
Why not compute the download speed as an average over the whole download, that is:
s = f' / elapsed time
That way it would smooth out over time.