Download time remaning predictor - algorithm

Are there any widgets for predicting when a download (or any other process) will finish based on percent done history?
The trivial version would just do a 2 point fit based on the start time, current time and percent done but better option are possible.
A GUI widgest would be nice but a class that just returns the value would be just fine.

For the theoretical algorithm that I would attempt, if I would write such a widget, would be something like:
Record the amount of data transferred within a one second period (a literal KiB/s)
Remember the last 5 or 10 such periods (to get an an recent average KiB/s)
Subtract the total size from the transferred size (to get a "bytes remaining")
???
Widget!
That oughta do it...
(the missing step being: kibibytes remaining divided by average KiB/s)

Most progress bar widgets have an update() method which takes a percentage done.
They then calculate the time remainging based on the time since start and the latest percentage. If you have a process with a long setup time you might want to add more functionality so that it doesn't include this time, perhaps by having the prediction clock reset when sent a 0% update.

It would sort of have the same effect as some of the other proposals but by collecting percent done data as a function of time statistical methods could be used to generate a R^2 best fit line and project it to 100% done. To make it take into account the current operation a heavier weight could be placed on the newer data (or the older data could be thinned) and to make it ignore short term fluctuations, a heavy weighting can be put on the first data point.

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.
 
From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.
In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

How do programs estimate how long a process will take to complete?

In some loading bars there will be something like "2 minutes remaining". Does the programmer time how long the process takes on their computer and use that value some how? Or does the program calculate it all itself? Or another method?
This calculation is actually done internally because timing how long it would take to execute or download a certain program is based on internet speed, RAM, processor speed, etc. so it would be hard to have one universally predicted time based on the programmer's computer. Typically how this is calculated is based on how much of the program is already downloaded in comparison to the size of the file and takes into account how long it took to download that much data. From there the program extrapolates how much longer it will take to finish your download based on how fast it has operated until that point in time.
Those 'x minutes remaining' interface elements, which (ideally) indicate how much time it will take to complete a certain task are simply forecasts based on how long that task has taken so far, and how much work on that task has been accomplished.
For example, suppose I have a app that will upload a batch of images to a server (all of which are the same size, for simplicity). Here's a general idea of how the code that indicates time remaining will work:
Before we begin, we assign the current time to a variable. Also, at this point the time remaining indicator is not visible. Then, in a for... loop:
for (var i = 0; i < batchOfImages.length; i++)
{
We upload an image
After the image is uploaded, we get the difference between the current time and the start time. This is the total time expended on this task so far.
We then divide the total time expended by the total number of images uploaded so far, i + 1, to get the amount of time it has taken on average to upload each image so far
We then multiply the average upload time per image by the number of images remaining to upload to get the likely amount of time it will take to upload the remaining images
We update the text on the indicator to show amount of time remaining, and we make sure that the indicator is visible
}

Specific Cache Hit Rate calculation

Scenario:
Suppose we have infinite cache memory size. Caching is just limited by timeout, value of this timeout is half an hour. Cache is initially empty.
Problem:
We have 50,000 distinct request. Our system is querying, randomly, at the rate of 15 request/second i.e. 27,000 request in half an hour . What kind of curve or average value of cache hit rate could we expect for first 5 hours?
Note: This scenario is fixed. I need an approach to find out hit rate. If you think tag is wrong, please suggest appropriate tag.
I think you're right and this is a math question (certainly not a programming
problem).
One approach is to consider the extremes -- what is the hit rate for the
first query when the the system starts running? For the second query?
After one second? After 10? After a minute? And what is the likelyhood
that any random query will be found in the cache once the system has been
running a long time?
These are few specific values, and together they give you a curve.
I don't think great numeric precision is necessary; the long-term average
and the shape of the curve is more interesting.

Bin packing parts of a dynamic set, considering lastupdate

There's a large set of objects. Set is dynamic: objects can be added or deleted any time. Let's call the total number of objects N.
Each object has two properties: mass (M) and time (T) of last update.
Every X minutes a small batch of those should be selected for processing, which updates their T to current time. Total M of all objects in a batch is limited: not more than L.
I am looking to solve three tasks here:
find a next batch object picking algorithm;
introduce object classes: simple, priority (granted fit into at least each n-th batch) and frequent (fit into each batch);
forecast system capacity exhaust (time to add next server = increase L).
What kind of model best describes such a system?
The whole thing is about a service that processes the "objects" in time intervals. Each object should be "measured" each N hours. N can vary in a range. X is fixed.
Objects are added/deleted by humans. N grows exponentially, rather slow, with some spikes caused by publications. Of course forecast can't be precise, just some estimate. M varies from 0 to 1E7 with exponential distribution, most are closer to 0.
I see there can be several strategies here:
A. full throttle - pack each batch as much as close to 100%. As N grows, average interval a particular object gets a hit will grow.
B. equal temperament :) - try to keep an average interval around some value. A batch fill level will be growing from some low level. When it reaches closer to 100% – time to get more servers.
C. - ?
Here is a pretty complete design for your problem.
Your question does not optimally match your description of the system this is for. So I'll assume that the description is accurate.
When you schedule a measurement you should pass an object, a first time it can be measured, and when you want the measurement to happen by. The object should have a weight attribute and a measured method. When the measurement happens, the measured method will be called, and the difference between your classes is whether, and with what parameters, they will reschedule themselves.
Internally you will need a couple of priority queues. See http://en.wikipedia.org/wiki/Heap_(data_structure) for details on how to implement one.
The first queue is by time the measurement can happen, all of the objects that can't be measured yet. Every time you schedule a batch you will use that to find all of the new measurements that can happen.
The second queue is of measurements that are ready to go now, and is organized by which scheduling period they should happen by, and then weight. I would make them both ascending. You can schedule a batch by pulling items off of that queue until you've got enough to send off.
Now you need to know how much to put in each batch. Given the system that you have described, a spike of events can be put in manually, but over time you'd like those spikes to smooth out. Therefore I would recommend option B, equal temperament. So to do this, as you put each object into the "ready now" queue, you can calculate its "average work weight" as its weight divided by the number of periods until it is supposed to happen. Store that with the object, and keep a running total of what run rate you should be at. Every period I would suggest that you keep adding to the batch until one of three conditions has been met:
You run out of objects.
You hit your maximum batch capacity.
You exceed 1.1 times your running total of your average work weight. The extra 10% is because it is better to use a bit more capacity now than to run out of capacity later.
And finally, capacity planning.
For this you need to use some heuristic. Here is a reasonable one which may need some tweaking for your system. Maintain an array of your past 10 measurements of running total of average work weight. Maintain an "exponentially damped average of your high water mark." Do that by updating each time according to the formula:
average_high_water_mark
= 0.95 * average_high_water_mark
+ 0.5 * max(last 10 running work weight)
If average_high_water_mark ever gets within, say, 2 servers of your maximum capacity, then add more servers. (The idea is that a server should be able to die without leaving you hosed.)
I think answer A is good. Bin packing is to maximize or minimize and you have only one batch. Sort the objects by m and n.

Sliding Window over Time - Data Structure and Garbage Collection

I am trying to implement something along the lines of a Moving Average.
In this system, there are no guarantees of a quantity of Integers per time period. I do need to calculate the Average for each period. Therefore, I cannot simply slide over the list of integers by quantity as this would not be relative to time.
I can keep a record of each value with its associated time. We will have a ton of data running through the system so it is important to 'garbage collect' the old data.
It may also be important to note that I need to save the average to disk after the end of each period. However, they may be some overlap between saving the data to disk and having data from a new period being introduced.
What are some efficient data structures I can use to store, slide, and garbage collect this type of data?
The description of the problem and the question conflict: what is described is not a moving average, since the average for each time period is distinct. ("I need to compute the average for each period.") So that admits a truly trivial solution:
For each period, maintain a count and a sum of observations.
At the end of the period, compute the average
I suspect that what is actually wanted is something like: Every second (computation period), I want to know the average observation over the past minute (aggregation period).
This can be solved simply with a circular buffer of buckets, each of which represents the value for one computation period. There will be aggregation period / computation period such buckets. Again, each bucket contains a count and a sum. Also, a current total/sum and a cumulative total sum/count are maintained. Each observation is added to the current total/sum.
At the end of a each computation period:
subtract the sum/count for the (circularly) first period from the cumulative sum/count
add the current sum/count to the cumulative sum/count
report the average based on the cumulative sum/count
replace the values of the first period with the current sum/count
clear the current sum/count
advance the origin of the circular buffer.
If you really need to be able to compute at any time at all the average of the previous observations over some given period, you'd need a more complicated data structure, basically an expandable circular buffer. However, such precise computations are rarely actually necessary, and a bucketed approximation, as per the above algorithm, is usually adequate for data purposes, and is much more sustainable over the long term for memory management, since its memory requirements are fixed from the start.

Resources