Raft algorithm ,prevent term increase - etcd

In the Raft algorithm, the term is always increasing. Is there any good way to solve this problem and prevent the term from reaching the limit in the future? Because I use a term of type tinyint and I don't want to modify the type, there will be a round of elections in a minute, so the term will grow rapidly.
d.Log.Term(d.Log.Term() + 1)
d.Log.State(BeCandidate)
I want to modify term=0 after the follower receives the heartbeat, but this does not work.

max(uint64) = 18446744073709551615
one year =525600 (minute)
so :
max(uint64) / one year = 35096545041304.32(year)
i should worry about it

Related

Which algorithm for assigning pattern shifts

I'm looking for a good algorithm or technic to find the best solution for the following problem. First, I’ll introduce the context and then, the problem.
I work for a company with more than 2000 employees; all of them work with pattern shift, this means that any employee has a pattern which specifies the sequence of workday and free day. We have these patterns:
5-2-5-2 (5 days work, 2 free, 5 days work, 2 free) and so on.
5-2-4-3
5-4-5-3
5-3-5-3
At this moment we have all these patterns and different numbers of start, that is to say, a pattern can start at a certain date at a specific part inside the pattern, for e.g., the pattern 5-4-5-3 has 17 possible starting sequences, this number is a sum of 5+4+5+3 = 17 possible sequences.
https://en.wikipedia.org/wiki/Shift_plan
Now the problem,
Every 6 months each employee can change the pattern and start in any sequence number of the pattern.
But we must analyze all the requirements and accept or reject in order to obtain the better combination for the company operation, because we need that every day have the same work force but we understand that this is impossible but the algorithm will help us find a good solution, not perfect.
I was reading about the "Nurse scheduling problem" with Google Or-Tool but I don’t understand how to set Pattern Sequence to create a solution for this problem. I read some opinions about GA (genetic algorithms) and all of them said that this kind of solution is not good for this kind of problem.
Does anyone have a similar problem? Can someone give me a more accurate example with Google OR-tools than the example in GitHub.
it's not necessary to find a strictly optimal solution; the roster is currently done manual, and I'm pretty sure the result is considerably sub-optimal most of the time.
Does anyone have a similar problem?
Sounds a lot like OptaWeb Employee Rostering, which is a vertical on top of OptaPlanner, the constraint solver. Take a look at the source code. It's all open source.
I think this can be modeled as a MIP model.
Thinking out loud:
Introduce a binary decision variable:
δ(i,p) = 1 if pattern i is selected for person p
0 otherwise
This includes the current pattern (say i=0). This would allow the cases:
an employee does not submit a new pattern (then we only have i=0 for this employee)
an employee submits one or more preferable patterns
We have the constraints:
sum(i, δ(i,p)) = 1 ∀p
sum((i,p), pattern(i,p,t)*δ(i,p)) ≈ requiredlevel(t) ∀t
δ(i,p) ∈ {0,1}
here pattern(i,p,t) describes pattern i: it is 1 if period t is covered when pattern (i,p) is used and 0 otherwise. Here I use ≈ to indicate "approximately". (This is easily modeled using slacks and possibly penalty terms in the objective).
Now we maximize
maximize sum((i,p), weight(i,p) * δ(i,p))
where weight(i,p) indicates the preference for a pattern (e.g. weight(0,p)=0 i.e. no bonus points when not selected a newly, preferred pattern).
Something like this should not be too difficult to set up. Of course many refinements are possible. These type of model tend to solve quite quickly.
What is the workflow ?
If you have a fixed roster, and one person proposes a new pattern. Just remove this person contribution, test all (17) starting points of the new pattern and score them.
If you can change patterns, or starting points for more than 1 employee, create an integer variable per starting point. From this starting point, it is easy to compute the persons contribution for each shifted day of the pattern. Then you can optimize quality of service w.r.t. the starting points of each pattern, summing potential contributions per day of the week for each employee.
Is that clear ?

Algorithm recommendation for calculating score jumps

I've to come up with an algorithm which would determine jumps and changes of a person's personal scores. Just imagine that every day a person participates in a competition and the scores are recorded on daily basis. Now my task is to come up with a person's performance ratio based on provided time interval. For example: A person scored 7 yesterday, scored 6 today, which means the performance is negative: -1 .
My current solution:
I've two collection of numbers which represents scores, where each element is a daily score of a person (three days):
dataFor2014-07-11/13 = {6,6,6}
dataFor2014-07-13/15 = {6,3,5}
double personsScores = AVG(dataFor2014-07-13/15) - AVG(dataFor2014-07-11/13);
Output: 4.6 - 6 = -1.3 (person's performance for past six days is negative.)
Do you think this is reasonable algorithm? Do you have any suggestions how can I improve it and recommend any better solution?
I think this is more a mathematical problem and the Math forum would probably better to ask.
Probably, convolution filters are the way to go. This is a technique to make a 'graph' through some points (the individual scores).

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Algorithm to determine most popular article last week, month and year?

I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!
I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.
You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)
There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. There's quite a bit of data to process (roughly one year).
What I've tried so far is to use an exponential moving average, this was good enough to let me get some interesting measures out, but comparing what I've seen with external data sources seems to be a bit too aggressive in flagging things as abnormal.
I've considered using a combination of the exponential moving average plus historical data (possibly from 7 days in the past, thinking that there ought to be a weekly cycle to what I am seeing), as some papers I've read seem to have managed to model resource usage that way with good success.
So, does anyone knows of a good method or somewhere to go and read up on this sort of thing.
The moving average I've been using looks roughly like:
avg = avg+0.96*(new-avg)
With avg being the EMA and new being the new measure. I have been experimenting with what thresholds to use, but found that a combination of "must be a given factor higher than the average prior to weighing the new value in" and "must be at least 3 higher" to give the least bad result.
This is widely studied in intrusion detection literature. This is a seminal paper on the issue which shows, among other things, how to analyze tcpdump data to gain relevant insights.
This is the paper: http://www.usenix.org/publications/library/proceedings/sec98/full_papers/full_papers/lee/lee_html/lee.html here they use the RIPPER rule induction system, I guess you could replace that old one for something newer such as http://www.newty.de/pnc2/ or http://www.data-miner.com/rik.html
I would apply two low-pass filters to the data, one with a long time constant, T1, and one with a short time constant, T2. You would then look at the magnitude difference in output from these two filters and when it exceeds a certain threshold, K, then that would be a spike. The hardest part is tuning T1, T2 and K so that you don't get too many false positives and you don't miss any small spikes.
The following is a single pole IIR low-pass filter:
new = k * old + (1 - k) * new
The value of k determines the time constant and is usually close to 1.0 (but < 1.0 of course).
I am suggesting that you apply two such filters in parallel, with different time constants, e.g. start with say k = 0.9 for one (short time constant) and k = 0.99 for the other (long time constant) and then look at the magnitude difference in their outputs. The magnitude difference will be small most of the time, but will become large when there is a spike.

Resources