Should we give concrete examples when tasking in TDD? - tdd

Here takes the "taximeter" requirement as example. Say we have go an simple requirement from client:
If the distance is <= 2.0km, the price is $6 fixed. After that, the charge of every km should be $0.8, and the final result should be rounded
Now I want to develop this requirement in TDD and I should split it into small tasks, and write them on paper. I have several different versions, but not sure which is the best
version 1, simplest
the starting fare is fixed
after starting fare distance, every km have fixed charge
the final price should be rounded
version 2, give some concrete numbers appears in the requirement
the starting fare is fixed $6 for distance <= 2km
after starting fare distance, every km have fixed charge $0.8
the final price should be rounded
version 3, with some examples
the starting fare is fixed $6 for distance <= 2km
e.g. 0km == $6, 2km == $6
after starting fare distance, every km have fixed charge $0.8
e.g. 3km == $6.8, 10km == $12.4
the final price should be rounded
e.g. $4 => $4, $4.49 => $4, $4.5 => $5
I'm not sure which one is the best as a task in TDD, is there any principles, examples or best practise?

Writing a list of tasks is simply a reminder list to help guide your TDD session, so the first version is a good example of that. Obviously, as you start each task, you will look at the requirement to get the actual numbers you'll need for your tests.

I'm not sure I get the point of writing the tasks out on paper (or anywhere else). Why not just write them as unit tests? No sense in duplicating the the effort.
Between the selections, it's a toss up between v2 and v3. It would depend on where the example numbers came from. If the examples come from the users, then I like v3.

Related

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

Algorithm recommendation for calculating score jumps

I've to come up with an algorithm which would determine jumps and changes of a person's personal scores. Just imagine that every day a person participates in a competition and the scores are recorded on daily basis. Now my task is to come up with a person's performance ratio based on provided time interval. For example: A person scored 7 yesterday, scored 6 today, which means the performance is negative: -1 .
My current solution:
I've two collection of numbers which represents scores, where each element is a daily score of a person (three days):
dataFor2014-07-11/13 = {6,6,6}
dataFor2014-07-13/15 = {6,3,5}
double personsScores = AVG(dataFor2014-07-13/15) - AVG(dataFor2014-07-11/13);
Output: 4.6 - 6 = -1.3 (person's performance for past six days is negative.)
Do you think this is reasonable algorithm? Do you have any suggestions how can I improve it and recommend any better solution?
I think this is more a mathematical problem and the Math forum would probably better to ask.
Probably, convolution filters are the way to go. This is a technique to make a 'graph' through some points (the individual scores).

Sorting People into Groups based on Votes

I have a problem with finding a algorithm for sorting a dataset of people. I try to explain as detailed as possible:
The story starts with a survey. A bunch of people, lets say 600 can choose between 20-25 projects. They make a #1-wish, #2-wish and #3-wish, where #1 is the most wanted project they want to take part and wish 3 the "not-perfect-but-most-acceptable-choose".
These project are limited in their number of participants. Every project can join around 30 people (based on the number of people and count of projects).
The algorithm puts the people in the different projects and should find the best possible combination.
The problem is that you can't just put all the people with their number 1 wish X in the certain project and stuff all the other with also number 1 wish X in there number 2 wish because that would not be the most "happiest" situation for everybody.
You may can think of what I mean when you imagine that for everybody who get his number 1 wish you get 100 points, for everybody who get his number 2 wish 60 points, number 3 wish 30 points and who get not in one of his wishes 0 points. And you want to get as most points as possible.
I hope you get my problem. This is for a school-project day.
Is there something out there that could help me? Do you have any idea? I would be thankful for every tipp!!
Kind regards
You can solve this optimally by formulating it as a min cost network flow problem.
Add a node for each person, and one for each project.
Set cost for a flow between a person and a project according to their preferences.
(As Networkx provides a min cost flow, but not max cost flow I have set the costs to be
negative.)
For example, using Networkx and Python:
import networkx as nx
G=nx.DiGraph()
prefs={'Tom':['Project1','Project2','Project3'],
'Dick':['Project2','Project1','Project3'],
'Harry':['Project1','Project3','Project1']}
capacities={'Project1':2,'Project2':10,'Project3':4}
num_persons=len(prefs)
G.add_node('dest',demand=num_persons)
A=[]
for person,projectlist in prefs.items():
G.add_node(person,demand=-1)
for i,project in enumerate(projectlist):
if i==0:
cost=-100 # happy to assign first choices
elif i==1:
cost=-60 # slightly unhappy to assign second choices
else:
cost=-30 # very unhappy to assign third choices
G.add_edge(person,project,capacity=1,weight=cost) # Edge taken if person does this project
for project,c in capacities.items():
G.add_edge(project,'dest',capacity=c,weight=0)
flowdict = nx.min_cost_flow(G)
for person in prefs:
for project,flow in flowdict[person].items():
if flow:
print person,'joins',project
In this code Tom's number 1 choice is Project1, followed by Project2, then Project3.
The capacities dictionary specifies the upper limit on how many people can join each project.
My algorithm would be something like this:
mainloop
wishlevel = 1
loop
Distribute people into all projects according to wishlevel wish
loop through projects, counting population
If population exceeds maximum
Distribute excess non-redistributed people into their wishlevel+1 projects that are under-populated
tag distributed people as 'redistributed' to avoid moving again
endif
endloop
wishlevel = wishlevel + 1
loop until wishlevel == 3
mainloop until no project exceeds max population
This should make several passes through the data set until everything is evened out. This algorithm may result in an endless loop if you restrict redistribution of already-redistributed people in the event that one project fills up with such people as the algorithm progresses, so you might try it without that restriction.

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Algorithm to determine most popular article last week, month and year?

I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!
I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.
You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)
There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)

Resources