Token bucket vs Fixed window (Traffic Burst) - algorithm

I was comparing Token bucket and Fixed window rate limiting algorithm, But a bit confused with traffic bursts in both algorithm.
Let's say i want to limit traffic to 10 requests/minute.
In Token bucket, tokens are added at the rate of 10 tokens per minute.
Time Requests AvailableTokens
10:00:00 0 10 (added 10 tokens)
10:00:58 10 0
10:01:00 0 10 (added 10 tokens)
10:01:01 10 0
Now if we see at timestamp 10:01:01, in last minute 20 requests were allowed, more than our limit.
Similarly, With Fixed window algorithms.
Window size: 1 minute.
Window RequestCount IncomingRequests
10:00:00 10 10 req at 10:00:58
10:01:00 10 10 req at 10:01:01
Similar problem is here as well.
Does both the algorithms suffer from this problem, or is there a gap in my understanding?

I had the same confusion about those algorithms.
The trick with the Token Bucket is that Bucket size(b) and Refill rate(r) don't have to be equal.
For your particular example, you could set Bucket size to be b = 5 and refill rate r = 1/10 (1 token per 10 seconds).
With this example, the client is still able to make 11 requests per minute, but that's already less than 20 as in your example, and they are spread over time. And I also believe if you play with the parameters you can achieve a strategy when >10 requests/min is not allowed at all.
Time Requests AvailableTokens
10:00:00 0 5 (we have 5 tokens initially)
10:00:10 0 5 (refill attempt failed cause Bucket is full)
10:00:20 0 5 (refill attempt failed cause Bucket is full)
10:00:30 0 5 (refill attempt failed cause Bucket is full)
10:00:40 0 5 (refill attempt failed cause Bucket is full)
10:00:50 0 5 (refill attempt failed cause Bucket is full)
10:00:58 5 0
10:01:00 0 1 (refill 1 token)
10:01:10 0 2 (refill 1 token)
10:01:20 0 3 (refill 1 token)
10:01:30 0 4 (refill 1 token)
10:01:40 0 5 (refill 1 token)
10:01:49 5 0
10:01:50 0 1 (refill 1 token)
10:01:56 1 0
Other options:
b = 10 and r = 1/10
b = 9 and r = 1/10

Related

MQTT Keep Alive byte format

The MQTT 3.1.1 documentation is very clear and helpful, however I am having trouble understanding the meaning of one section regarding the Keep Alive byte structure in the connect message.
The documentation states:
The Keep Alive is a time interval measured in seconds. Expressed as a 16-bit word, it is the maximum time interval that is permitted to elapse between the point at which the Client finishes transmitting one Control Packet and the point it starts sending the next.
And gives an example of a keep alive payload:
Keep Alive MSB (0) 0 0 0 0 0 0 0 0
Keep Alive LSB (10) 0 0 0 0 1 0 1 0
I have interpreted this to represent a keep alive interval of 10 seconds, as the interval is given in seconds and that makes the most sense. However I'm not sure how you would represent longer intervals of, for example, 10 minutes.
Finally, would the maximum keep alive interval of 65535 seconds (~18 hours) be represented by these bytes
Keep Alive MSB (255) 1 1 1 1 1 1 1 1
Keep Alive LSB (255) 1 1 1 1 1 1 1 1
Thank you for your help
2^16=65536 seconds 65536/60 = 1092.27 minutes 1092.27/60 = 18.20 hours
0.20hour*60 = 12minutes y 0.27min*60 = 16.2sec
result: 18 hours,12minutes, 16sec
10 minutes = 600 seconds
600 in binary -> 0000 0010 0101 1000
And yes 65535 is the largest number that can be represented by a 16bit binary field, but there are very few situations where an 18 hour keep alive interval would make sense.

How can you improve computation time when predicting KNN Imputation?

I feel like my run time is extremely slow for my data set, this is the code:
library(caret)
library(data.table)
knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])
the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:
testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime
user 969.78
system 38.70
elapsed 1010.72
Additionally, it should be noted that caret preprocess uses "RANN".
Now my full dataset is:
str(mainData[ , imputeColumns])
'data.frame': 1809032 obs. of 16 variables:
$ V1: int 3 5 5 4 4 4 3 4 3 3 ...
$ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
$ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V4: int 2 5 5 12 4 5 11 8 7 8 ...
$ V5: int 2 0 0 2 0 0 1 3 2 8 ...
$ V6: int 648 489 489 472 472 472 497 642 696 696 ...
$ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
$ V8: int 0 0 0 0 0 0 0 1 1 1 ...
$ V9: num 0 0 0 0 0 ...
$ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
$ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V12: num 2 5 5 12 4 5 11 8 7 8 ...
$ V13: num 2 0 0 2 0 0 1 3 2 8 ...
$ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
$ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
$ V16: num 657 756 756 756 756 ...
So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?
Also it looks like system time is very low and user time is very high, is that normal?
My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU # 2.40Ghz processor.
Additionally would this improve the runtime of the predict function?
cl <- makeCluster(4)
registerDoParallel()
I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.
FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?
Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.
You can speed this up via the imputation package and use of canopies which can be installed from Github:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")
Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:
library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE,
parallel= TRUE, n_canopies= 20)
NOTE:
The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.
You'll also get some mean vector imputation if you have fulling missing rows.
# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0
# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing.
These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
Rows with entirely missing values imputed to column means.

Dynamic Programming - Two spies at the river

I think this is a very complicated dynamic programming problem.
Two spies each have a secret number in [1..m]. To exchange numbers they agree to meet at the river and "innocently" take turns throwing stones: from a pile of n=26 identical stones, each spy in turn throws at least one stone in the river.
The only information is in the number of stones each thrown in each turn. What is the largest m can be so they are sure they can complete the exchange?
Develop a recursive formula to count. Here is the start of the table; complete it to n=26. (You should not expect a closed form.)
n 1 2 3 4 5 6 7 8 9 10 11 12
m 1 1 1 2 2 3 4 6 8 12 16 23
Here are some hints from our professor: I suggest changing the problem to making the following table: Let R(n,m) be the range of numbers [1..R(n,m)] that A can indicate to B if they start with n stones, and both know that A has to also receive a number in [1..m] from B.
For example, if A needs no more information, R(n,1) can be computed by considering how many stones A could throw (one to n), then B thows 1 (if any remain) and A gets to decide again. The base cases R(0,1) = R(1,1) = 1, and you can write a recursive rule if you are careful at the boundaries. (You should find the Fibonacci numbers for R(n,1).)
If A needs information, then B has to send it by his or her choices, so things are a little more complicated. Here is the start of the table:
n\ m 1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 2 0 0 0 0
3 3 1 0 0 0
4 5 2 1 0 0
5 8 4 2 1 1
6 13 7 4 3 2
7 21 12 8 6 4
8 34 20 15 11 8
9 55 33 27 19 16
From the R(n,m) table, how would you recover the entries of the earlier table (the table showing m as a function of n)?

Looking for failing test case to DP solution to MARTIAN on SPOJ

I am trying to solve the MARTIAN problem on SPOJ
My algorithm is as follows:
Define dp[i][j]=max amount of minerals that can be mined in the rectangle form 0,0 to i,j.
Use the recurrence
dp[i][j] = max(dp[i-1][j] + total amount of yeyenum
in the i-th row up to the j-th column,
dp[i][j-1] + total amount of bloggium
in the j-th column up to the cell i-th row)
However such an approach yields a WA (Wrong Answer). Can someone please provide me with a test case where such and approach will not work?
I am not looking for the correct algorithm just a test case where this approach fails as. I've been unable to find the bug myself.
Try this on your code(modified from the example given):
4 4
0 0 10 60
1 3 10 0
4 2 1 3
1 1 20 0
10 0 0 0
1 1 1 10
0 0 5 3
5 10 10 10
0 0
If you start by looking at [4][4], you'll choose Bloggium, because you can get 23 bloggium by going up, and only 22 Yeyenum from going left. However, you're going to miss a huge amount of Yeyenum.
Using your algorithm, you'll get 23 + 22 + 7 + 14 + 10 = 76.
If you choose the large Yeyenum, you'll get 70 + 14 + 10 + 22 = 116(all Yeyenum, since the bloggium gets blocked).

How to substitute a for-loop with vecorization acting several thousand times per data.frame row?

Being still quite wet behind the ears concerning R and - more important - vectorization, I cannot get my head around how to speed up the code below.
The for-loop calculates a number of seeds falling onto a road for several road segments with different densities of seed-generating plants by applying a random propability for every seed.
As my real data frame has ~200k rows and seed numbers are up to 300k/segment, using the example below would take several hours on my current machine.
#Example data.frame
df <- data.frame(Density=c(0,0,0,3,0,120,300,120,0,0))
#Example SeedRain vector
SeedRainDists <- c(7.72,-43.11,16.80,-9.04,1.22,0.70,16.48,75.06,42.64,-5.50)
#Calculating the number of seeds from plant densities
df$Seeds <- df$Density * 500
#Applying a probability of reaching the road for every seed
df$SeedsOnRoad <- apply(as.matrix(df$Seeds),1,function(x){
SeedsOut <- 0
if(x>0){
#Summing up the number of seeds reaching a certain distance
for(i in 1:x){
SeedsOut <- SeedsOut +
ifelse(sample(SeedRainDists,1,replace=T)>40,1,0)
}
}
return(SeedsOut)
})
If someone might give me a hint as to how the loop could be substituted by vectorization - or maybe how the data could be organized better in the first place to improve performance - I would be very grateful!
Edit: Roland's answer showed that I may have oversimplified the question. In the for-loop I extract a random value from a distribution of distances recorded by another author (that's why I can't supply the data here). Added an exemplary vector with likely values for SeedRain distances.
This should do about the same simulation:
df$SeedsOnRoad2 <- sapply(df$Seeds,function(x){
rbinom(1,x,0.6)
})
# Density Seeds SeedsOnRoad SeedsOnRoad2
#1 0 0 0 0
#2 0 0 0 0
#3 0 0 0 0
#4 3 1500 892 877
#5 0 0 0 0
#6 120 60000 36048 36158
#7 300 150000 90031 89875
#8 120 60000 35985 35773
#9 0 0 0 0
#10 0 0 0 0
One option is generate the sample() for all Seeds per row of df in a single go.
Using set.seed(1) before your loop-based code I get:
> df
Density Seeds SeedsOnRoad
1 0 0 0
2 0 0 0
3 0 0 0
4 3 1500 289
5 0 0 0
6 120 60000 12044
7 300 150000 29984
8 120 60000 12079
9 0 0 0
10 0 0 0
I get the same answer in a fraction of the time if I do:
set.seed(1)
tmp <- sapply(df$Seeds,
function(x) sum(sample(SeedRainDists, x, replace = TRUE) > 40)))
> tmp
[1] 0 0 0 289 0 12044 29984 12079 0 0
For comparison:
df <- transform(df, GavSeedsOnRoad = tmp)
df
> df
Density Seeds SeedsOnRoad GavSeedsOnRoad
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 3 1500 289 289
5 0 0 0 0
6 120 60000 12044 12044
7 300 150000 29984 29984
8 120 60000 12079 12079
9 0 0 0 0
10 0 0 0 0
The points to note here are:
try to avoid calling a function repeatedly in a loop if you the function is vectorised or can generate the entire end result with a single call. Here you were calling sample() Seeds times for each row of df, each call returning a single sample from SeedRainDists. Here I do a single sample() call asking for sample size Seeds, for each row of df - hence I call sample 10 times, your code called it 271500 times.
even if you have to repeatedly call a function in a loop, remove from the loop anything that is vectorised that could be done on the entire result after the loop is done. An example here is your accumulating of SeedsOut, which is calling +() a large number of times.
Better would have been to collect each SeedsOut in a vector, and then sum() that vector outside the loop. E.g.
SeedsOut <- numeric(length = x)
for(i in seq_len(x)) {
SeedsOut[i] <- ifelse(sample(SeedRainDists,1,replace=TRUE)>40,1,0)
}
sum(SeedOut)
Note that R treats a logical as if it were numeric 0s or 1s where used in any mathematical function. Hence
sum(ifelse(sample(SeedRainDists, 100, replace=TRUE)>40,1,0))
and
sum(sample(SeedRainDists, 100, replace=TRUE)>40)
would give the same result if run with the same set.seed().
There may be a fancier way of doing the sampling requiring fewer calls to sample() (and there is, sample(SeedRainDists, sum(Seeds), replace = TRUE) > 40 but then you need to take care of selecting the right elements of that vector for each row of df - not hard, just a light cumbersome), but what i show may be efficient enough?

Resources