Number of passes in external merge

Number of passes in external merge - algorithm

There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.
First, a numerical example:
Data: 100 GB
Ram: 1 GB
Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1) size buckets = 1024MB/101 = 10.14MB. There are 100 10.14MB buckets for each of the 100 chunks, and one output bucket also of size 10.14MB. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1). I get this from the fact that we have ram/(chunks+1) sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size) times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size) = datasize/bucket_size or 100*(1024MB/10.14MB). BucketSize = ram/(chunks+1) so 100*(1024/10.14) = (data/ram) * (chunks+1) = 1024*100MB/1024MB * 101 = 10100 reads.
For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1)). We also have A*B = Data/Ram. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram. This is because Data/Ram was the original number of chunks. For 2 pass, we want to break Data/Ram into A groups of B #chunks.
I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM
A*(D/R)*(B+1) + 1*(D/R)*(A+1) - This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.
A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)
(D^2/R^2)*[1 + 2/B] + D/R is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1) where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!

You ignore the fact that the total time it takes to read a block of data from disk is the sum of
The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
The transfer time which depends on the size of the data block and the transfer rate.
As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.
That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.
Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.
The total transfer time can be calculated easily. All the data has to read and written once per pass:
total_transfer_time = num_passes * 2 * data / transfer_rate
The total access time for buffer reads is:
total_access_time = num_passes * num_buffer_reads * access_time
Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:
total_access_time1 = num_chunks^2 * access_time
For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:
total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
= 2 * num_chunks^1.5 * access_time
So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:
total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
= 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
= 4000 s + 20 s = 4020 s
The effect of access time is still very small. So let's change data to 1000 GB:
total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
= 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
= 40000 s + 632 s = 40632 s
Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:
total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
= 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
= 200000 s + 7071 s = 207071 s
Now the two-pass version is faster.

To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.
If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.
With total data size of c times RAM size, there are c chunks to be merged.
In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).
In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is
c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )
Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.
It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.
I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).
If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.
This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.

Related

How to compute the achieved FLOPS of a MPI program which calls cuBlas function

I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.
I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:
The model count of an operation divided by the cost time of the operation
Using NVIDIA's profiling tools
The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?
The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.
I have been searching for this question for two days but still can't find an acceptable solution.

But I'm not sure the meaning of model count. It's O(f(N))? Like the
model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5
and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 =
120? So the FLOPS is 120 / 0.5 = 240?
The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.
For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.
I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.

The Using the CLI to Analyze MPI Codes states clearly how to use nsys to collect MPI program runtime information.
And the gitlab Roofline Model on NVIDIA GPUs uses ncu to collect real time FLOPS and memory usage of the program. The methodology to compute these metrics is:
Time:
sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_dfma_pred_on.sum +
sm__sass_thread_inst_executed_op_dmul_pred_on.sum
SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_ffma_pred_on.sum +
sm__sass_thread_inst_executed_op_fmul_pred_on.sum
HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_hfma_pred_on.sum +
sm__sass_thread_inst_executed_op_hmul_pred_on.sum
Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
DRAM: dram__bytes.sum
L2: lts__t_bytes.sum
L1: l1tex__t_bytes.sum

What's the best SAS PROC SURVEYSELECT options for accomplishing a semi-controlled random set?

The scenario I'm working with is creating a macro that takes in a data set and produces a random stratified sample, the stratification should be by the column STATE that also needs equal total number of representation (when possible) when creating the random sample.
The size of the sample needed has some set rules that we have to abide by which are:
If the total data set size is <= 50 then let the sample size = the entire data set
Else if the total data set size is between 51 and 500 then let the sample size = 50
Else if the total data set size is between 501 and 999 then let the sample size = 10% of the total data set size (n*.10) given that n = the total data set size.
Else if the total data set size is > 999 then let the sample size = 100
SAMPLESIZE is currently defined in code as:
/*sets sample size in accordance to standards*/
%if &num>=0 and &num<=50 %then %let samplesize=&num;
%else %if &num<501 %then %let samplesize=50;
%else %if &num<1000 %then %let samplesize=%sysevalf((&num*.10),ceil);
%else %let samplesize=100;
The data set I used for testing has a total number of records of 550 (so the sample size needed would be 55) with each state totaling the following number:
IN = 100
KY = 217
MO = 189
OH = 8
WI = 36
Applying the STRATA option for SURVEYSELECT works great when each state has the minimum number needed to satisfy the sample size. In this case the SAMPLESIZE for each STRATA would be 11
You can see that the OH STRATUM does not satisfy the minimum requirement for the SAMPLESIZE here since there is only 8 records with OH in the data set, hence leading to the following error:
ERROR: The sample size, 11, is greater than the number of sampling units, 8.
UPDATE (7/14/21) I was able to resolve the error by using the SELECTALL option, I was also able to grab from other states to fill in the missing records for OH using the ALLOC option for STRATA, so my updated SURVEYSELECT statement now looks like this.
```PROC SURVEYSELECT DATA=UniqueList OUT=UniqueListsamp METHOD=SRS SAMPSIZE=&samplesize
SELECTALL NOPRINT;
STRATA PROVIDER_STATE / ALLOC=(.2 .2 .2 .2 .2) ;
RUN;```
What I would like to achieve in this scenario is to make the ALLOC option function in a way that would be able to handle any number of states found in the input file. My understanding is the option requires hard coded decimals that add up to 1, dependent on the number of strata used (in this case 5, so 1/5 would be 5 instances of .2 that add up to a total of 1). This works great if we know the total number of states ahead of time, but that will not be the case when the code gets implemented for use. Is there a way to do a calculation (1 / num of states = .2) then input that value as many times as the number of states found seperated by a comma or a space (.2 .2 .2 .2 .2) into the ALLOC option?

You can pass a dataset as the argument to SAMPSIZE in surveyselect. I think that's what you need here.
Taking your counts as a starting point, I first just create a dataset matching your actual input. Then I run a tabulate to get your counts back. Then I parse the tabulate to figure out how many to pull, and how many per state, and make sure it's not asking for too many. This gives us a first idea of what's going to be pulled per state, and gives us a dataset that lets us modify that number.
The question of how to pull those last 3 is complicated, because it's not straightforward - how do you want to pull those 3? Should you pick the states "randomly" to add one to? What if a state only had 1 left, and you actually want 3 per state? It gets a bit messy to do this, and if you're not doing this frequently, it might be easier to just do it analytically. A proper system will have detailed checks, several passes, and the assumption that everything that can go wrong, will.
In this example I just go ahead and take the "extra" - so I sample 56. That gets you very close to your sample desired while sticking to your sampling plan ratios evenly and not having different amounts per state (among those states that can). If you want to actually sample 55 exactly, you need to decide how to allocate that 12th - to the 3 largest states? To three random states? Up to you, but the work is similar.
data for_gen;
input state $ count;
do id = 1 to count;
state_id = cats(state,put(id,z3.));
output;
end;
keep state state_id;
datalines;
IN 100
KY 217
MO 189
OH 8
WI 36
;;;;
run;
*create a listing, including the overall row (which will be on top);
proc tabulate data=for_gen out=state_counts(keep=state n);
class state;
table (all state),n;
run;
*now distribute the sample, first pass;
data sample_counts;
set state_counts nobs=statecount end=eof;
retain total_sample sample_per_state states_left;
if _n_ = 1 then do;
*the sample size rules;
if n lt 500 then total_sample = min(50,n);
else total_sample = min(100,floor(n/10));
*how many per state;
sample_per_state = ceil(total_sample/(statecount-1)); *or maybe floor?;
end;
else do;
*here we are in the per-state section;
_NSIZE_ = min(n,sample_per_state);
*allocate sample amounts, remove the used sample quantity from the total quantity, and keep track of how many states still have sample remaining;
total_sample = total_sample - _NSIZE_;
if n ne _nsize_ then states_left+1;
end;
*save the remaining info in macro variables to use later;
if eof then do;
call symputx('sample_left',total_sample);
call symputx('states_left',states_left);
end;
if state ne ' ' then output;
run;
*allocate the remaining sample - we assume we want "at least" the sample count;
data sample_secondpass;
set sample_counts end=eof;
retain total_sample_left &sample_left.
total_states_left &states_left.
leftover 0
;
if total_sample_left gt 0 and total_states_left gt 0 then do;
per_state = ceil(total_sample_left/total_states_left);
if n gt (_nsize_ + per_State) then do;
_nsize_ = _nsize_ + per_state;
end;
else do;
leftover = leftover + (_nsize_ + per_state - n);
_nsize_ = n;
end;
end;
if eof then call symputx('leftover',leftover);
run;
* Use the sample counts dataset to run the surveyselect;
proc surveyselect sampsize=sample_secondpass data=for_gen;
strata state;
run;

Huge memory allocation running a julia function?

I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.

I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.

aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)

Implementing an algorithm?

I have to write a small program to implement the following algorithm:
Assume you have a search algorithm which, at each level of recursion, excludes half of the data from consideration when searching for a specific data item. Search stops only when one data item is left. How many levels of recursion are required when the number of elements in the data is 1024?
Do anybody has idea about how to analyze or any suggestion on how to start ?

You need to find the minimal value of d such that:
1 * 2 * 2 * 2 * .... * 2 = 1024
____________________
total of d times
The above is true, because each multiplication by 2 is actually one level up in the recursion, you go up from the stop clause of 1 element, until you get the initial data size, which is 1024.
The above equation is actually 2^d = 1024
And it is solved easily with extracting log_2 from both sides:
log_2(2^d) = log^2(1024)
d = 10
P.S. Note that the above is the number of recursive calls, exclusive of the initial call, so total number of calls to the method is d+1=11, one from the calling environment, and 10 from the method itself.

What's the best way to calculate remaining download time?

Let's say you want to calculate the remaining download time, and you have all the information needed, that is: File size, dl'ed size, size left, time elapsed, momentary dl speed, etc'.
How would you calculate the remaining dl time?
Ofcourse, the straightforward way would be either: size left/momentary dl speed, or: (time elapsed/dl'ed size)*size left.
Only that the first would be subject to deviations in the momentary speed, and the latter wouldn't adapt well to altering speeds.
Must be some smarter way to do that, right? Take a look at the pirated software and music you currently download with uTorrent. It's easy to notice that it does more than the simple calculation mentioned before. Actually, I notices that sometimes when the dl speed drops, the time remaining also drops for a couple of moments until it readjusts.

Well, as you said, using the absolutely current download speed isn't a great method, because it tends to fluctuate. However, something like an overall average isn't a great idea either, because there may be large fluctuations there as well.
Consider if I start downloading a file at the same time as 9 others. I'm only getting 10% of my normal speed, but halfway through the file, the other 9 finish. Now I'm downloading at 10x the speed I started at. My original 10% speed shouldn't be a factor in the "how much time is left?" calculation any more.
Personally, I'd probably take an average over the last 30 seconds or so, and use that. That should do calculations based on recent speed, without fluctuating wildly. 30 seconds may not be the right amount, it would take some experimentation to figure out a good amount.
Another option would be to set a sort of "fluctuation threshold", where you don't do any recalculation until the speed changes by more than that threshold. For example (random number, again, would require experimentation), you could set the threshold at 10%. Then, if you're downloading at 100kb/s, you don't recalculate the remaining time until the download speed changes to either below 90kb/s or 110kb/s. If one of those changes happens, the time is recalculated and a new threshold is set.

You could use an averaging algorithm where the old values decay linearly. If S_n is the speed at time n and A_{n-1} is the average at time n-1, then define your average speed as follows.
A_1 = S_1
A_2 = (S_1 + S_2)/2
A_n = S_n/(n-1) + A_{n-1}(1-1/(n-1))
In English, this means that the longer in the past a measurement occurred, the less it matters because its importance has decayed.
Compare this to the normal averaging algorithm:
A_n = S_n/n + A_{n-1}(1-1/n)
You could also have it geometrically decay, which would weight the most recent speeds very heavily:
A_n = S_n/2 + A_{n-1}/2
If the speeds are 4,3,5,6 then
A_4 = 4.5 (simple average)
A_4 = 4.75 (linear decay)
A_4 = 5.125 (geometric decay)
Example in PHP
Beware that $n+1 (not $n) is the number of current data points due to PHP's arrays being zero-indexed. To match the above example set n == $n+1 or n-1 == $n
<?php
$s = [4,3,5,6];
// average
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
else
{
// $n+1 = number of data points so far
$weight = 1/($n+1);
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
// linear decay
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
elseif ($n == 1)
$a[$n] = ($s[$n] + $s[$n-1]) / 2;
else
{
// $n = number of data points so far - 1
$weight = 1/($n);
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
// geometric decay
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
else
{
$weight = 1/2;
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
Output
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4
3 => float 4.5
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4.25
3 => float 4.8333333333333
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4.25
3 => float 5.125

The obvious way would be something in between, you need a 'moving average' of the download speed.

I think it's just an averaging algorithm. It averages the rate over a few seconds.

What you could do also is keep track of your average speed and show a calculation of that as well.

For anyone interested, I wrote an open-source library in C# called Progression that has a "moving-average" implementation: ETACalculator.cs.
The Progression library defines an easy-to-use structure for reporting several types of progress. It also easily handles nested progress for very smooth progress reporting.

EDIT: Here's what I finally suggest, I tried it and it provides quite satisfying results:
I have a zero initialized array for each download speed between 0 - 500 kB/s (could be higher if you expect such speeds) in 1 kB/s steps.
I sample the download speed momentarily (every second is a good interval), and increment the coresponding array item by one.
Now I know how many seconds I have spent downloading the file at each speed. The sum of all these values is the elapsed time (in seconds). The sum of these values multiplied by the corresponding speed is the size downloaded so far.
If I take the ratio between each value in the array and the elapsed time, assuming the speed variation pattern stabalizes, I can form a formula to predict the time each size will take. This size in this case, is the size remaining. That's what I do:
I take the sum of each array item value multiplied by the corresponding speed (the index) and divided by the elapsed time. Then I divide the size left by this value, and that's the time left.
Takes a few seconds to stabalize, and then works preety damn well.
Note that this is a "complicated" average, so the method of discarding older values (moving average) might improve it even further.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio