How do I estimate task size for an open source project? - estimation

The scale of an open source project is completely different from the projects I do at the office. Work is done in spare time, volunteer work that may not materialize, personal development resources, not corporate, etc.
Clearly the chestnut "do the smallest thing that works" applies, but beyond that, are there any more formal methods to estimating the appropriate size for an open source project, for example, number of tables, number of web pages, or--heaven forbid--function points counting?
What estimation tools would work best for these sorts of projects?

I was recently asked to estimate how long it would take to build an enormous system just by looking at screen shot mockups. Mgmt was asking for a gut feel in under an hour without asking any questions.
I listed out all the modules (pages, reports, big queries, etc.) that I could see and started giving them relative estimates. e.g.:
Task 1: 8 units
Task 2: 16 units
Task 3: 4 units
Then I added a bunch of modules we had already done for this customer along with the relative number of units and actual number of hours/days. This told me what my ratio of units to hours was so I could guess (more than estimate) how long the unknown tasks should take. For example, if I found that an 8 unit task took us 16 hours in the past (2 hours/unit), I'd estimate that the above tasks might take:
Task 1: 8 units * 2 hours/unit = 16 hours
Task 2: 16 units * 2 hours/unit = 32 hours
Task 3: 4 units * 2 hours/unit = 8 hours
This approach enabled me to methodically consider the work to be done and apply some structure around guessing how long it would take to implement.
Of course I delivered my +/- guess with a generous disclaimer.
Then, if you want a calendar schedule from this, estimate how many hours per week you will work on the project and see what you come up with.

Related

Algorithm to find the best possible available times

Here is my scenario,
I run a Massage Place which offers various type of massages. Say 30 min Massage, 45 min massage, 1 hour massage, etc. I have 50 rooms, 100 employees and 30 pieces of equipment.When a customer books a massage appointment, the appointment requires 1 room, 1 employee and 1 piece of equipment to be available.
What is a good algorithm to find available resources for 10 guests for a given day
Resources:
Room – 50
Staff – 100
Equipment – 30
Business Hours : 9AM - 6PM
Staff Hours: 9AM- 6PM
No of guests: 10
Services
5 Guests- (1 hour massages)
3 Guests - (45mins massages)
2 Guests - (1 hour massage).
They are coming around the same time. Assume there are no other appointment on that day
What is the best way to get ::
Top 10 result - Fastest search which meets all conditions gets the top 10 result set. Top ten is defined by earliest available time. 9 – 11AM is best result set. 9 – 5pm is not that good.
Exhaustive search (Find all combinations) - all sets – Every possible combination
First available met (Only return the first match) – stop after one of the conditions have been met
I would appreciate your help.
Thanks
Nick
First, it seems the number of employees, rooms, and equipment are irrelevant. It seems like you only care about which of those is the lowest number. That is your inventory. So in your case, inventory = 30.
Next, it sounds like you can service all 10 people at the same time within the first hour of business. In fact, you can service 30 people at the same time.
So, no algorithm is necessary to figure that out, it's a static solution. If you take #Mario The Spoon's advice and weight the different duration massages with their corresponding profits, then you can start optimizing when you have more than 30 customers at a time.
Looks like you are trying to solve a problem for which there are quite specialized software applications. If your problem is small enough, you could try to do a brute force approach using some looping and backtracking, but as soon as the problem becomes too big, it will take too much time to iterate through all possibilities.
If the problem starts to get big, look for more specialized software. Things to look for are "constraint based optimization" and "constraint programming".
E.g. the ECLIPSe tool is an open-source constraint programming environment. You can find some examples on http://eclipseclp.org/examples/index.html. One nice example you can find there is the SEND+MORE=MONEY problem. In this problem you have the following equation:
S E N D
+ M O R E
-----------
= M O N E Y
Replace every letter by a digit so that the sum is correct.
This also illustrates that although you can solve this brute-force, there are more intelligent ways to solve this (see http://eclipseclp.org/examples/sendmore.pl.txt).
Just an idea to find a solution:
You might want to try to solve it with a constraint satisfaction problem (CSP) algorithm. That's what some people do if they have to solve timetable problems in general (e.g. room reservation at the University).
There are several tricks to improve CSP performance like forward checking, building a DAG and then do a topological sort and so on...
Just let me know, if you need more information about CSP :)

slot machine payout calculation

There's this question but it has nothing close to help me out here.
Tried to find information about it on the internet yet this subject is so swarmed with articles on "how to win" or other non-related stuff that I could barely find anything. None worth posting here.
My question is how would I assure a payout of 95% over a year?
Theoretically, of course.
So far I can think of three obvious variables to consider within the calculation: Machine payout term (year in my case), total paid and total received in that term.
Now I could simply shoot a random number between the paid/received gap and fix slots results to be shown to the player but I'm not sure this is how it's done.
This method however sounds reasonable, although it involves building the slots results backwards..
I could also make a huge list of all possibilities, save them in a database randomized by order and simply poll one of them each time.
This got many flaws - the biggest one is the huge list I'm going to get (millions/billions/etc' records).
I certainly hope this question will be marked with an "Answer" (:
You have to make reel strips instead of huge database. Here is brief example for very basic 3-reel game containing 3 symbols:
Paytable:
3xA = 5
3xB = 10
3xC = 20
Reels-strip is a sequence of symbols on each reel. For the calculations you only need the quantity of each symbol per each reel:
A = 3, 1, 1 (3 symbols on 1st reel, 1 symbol on 2nd, 1 symbol on 3rd reel)
B = 1, 1, 2
C = 1, 1, 1
Full cycle (total number of all possible combinations) is 5 * 3 * 4 = 60
Now you can calculate probability of each combination:
3xA = 3 * 1 * 1 / full cycle = 0.05
3xB = 1 * 1 * 2 / full cycle = 0.0333
3xC = 1 * 1 * 1 / full cycle = 0.0166
Then you can calculate the return for each combination:
3xA = 5 * 0.05 = 0.25 (25% from AAA)
3xB = 10 * 0.0333 = 0.333 (33.3% from BBB)
3xC = 20 * 0.0166 = 0.333 (33.3% from CCC)
Total return = 91.66%
Finally, you can shuffle the symbols on each reel to get the reels-strips, e.g. "ABACA" for the 1st reel. Then pick a random number between 1 and the length of the strip, e.g. 1 to 5 for the 1st reel. This number is the middle symbol. The upper and lower ones are from the strip. If you picked from the edge of the strip, use the first or last one to loop the strip (it's a virtual reel). Then score the result.
In real life you might want to have Wild-symbols, free spins and bonuses. They all are pretty complicated to describe in this answer.
In this sample the Hit Frequency is 10% (total combinations = 60 and prize combinations = 6). Most of people use excel to calculate this stuff, however, you may find some good tools for making slot math.
Proper keywords for Google: PAR-sheet, "slot math can be fun" book.
For sweepstakes or Class-2 machines you can't use this stuff. You have to display a combination by the given prize instead. This is a pretty different task, so you may try to prepare a database storing the combinations sorted by the prize amount.
Well, the first problem is with the keyword assure, if you are dealing with random, you cannot assure, unless you change the logic of the slot machine.
Consider the following algorithm though. I think this style of thinking is more reliable then plotting graphs of averages to achive 95%;
if( customer_able_to_win() )
{
calculate_how_to_win();
}
else
no_win();
customer_able_to_win() is your data log that says how much intake you have gotten vs how much you have paid out, if you are under 95%, payout, then customer_able_to_win() returns true; in that case, calculate_how_to_win() calculates how much the customer would be able to win based on your %, so, lets choose a sampling period of 24 hours. If over the last 24 hours i've paid out 90% of the money I've taken in, then I can pay out up to 5%.... lets give that 5% a number such as 100$. So calculate_how_to_win says I can pay out up to 100$, so I would find a set of reels that would pay out 100$ or less, and that user could win. You could add a little random to it, but to ensure your 95% you'll have to have some other rules such as a forced max payout if you get below say 80%, and so on.
If you change the algorithm a little by adding random to the mix you will have to have more of these caveats..... So to make it APPEAR random to the user, you could do...
if( customer_able_to_win() && payout_percent() < 90% )
{
calculate_how_to_win(); // up to 5% payout
}
else
no_win();
With something like that, it will go on a losing streak after you hit 95% until you reach 90%, then it will go on a winning streak of random increments until you reach 95%.
This isn't a full algorithm answer, but more of a direction on how to think about how the slot machine works.
I've always envisioned this is the way slot machines work especially with video poker. Because the no_win() function would calculate how to lose, but make it appear to be 1 card off to tease you to think you were going to win, instead of dealing with a 'fair' game and the random just happens to be like that....
Think of the entire process of.... first thinking if you are going to win, how are you going to win, if you're not going to win, how are you going to lose, instead of random number generators determining if you will win or not.
I worked many years ago for an internet casino in Australia, this one being the only one in the world that was regulated completely by a government body. The algorithms you speak of that produce "structured randomness" are obviously extremely complex especially when you are talking multiple lines in all directions, double up, pick the suit, multiple progressive jackpots and the like.
Our poker machine laws for our state demand a payout of 97% of what goes in. For rudely to be satisfied that our machine did this, they made us run 10 million mock turns of the machine and then wanted to see that our game paid off at what the law states with the tiniest range of error (we had many many machines running a script to auto playing using a script to simulate the click for about a week before we hit the 10 mil).
Anyhow the algorithms you speak of are EXPENSIVE! They range from maybe $500k to several million per machine so as you can understand, no one is going to hand them over for free, that's for sure. If you wanted a single line machine it would be easy enough to do. Just work out you symbols/cards and what pay structure you want for each. Then you could just distribute those payouts amongst non-payouts till you got you respective figure. Obviously the more options there are means the longer it will take to pay out at that respective rate, it may even payout more early in the piece. Hit frequency and prize size are also factors you may want to consider
A simple way to do it, if you assume that people win a constant number of times a time period:
Create a collection of all possible tumbler combinations with how much each one pays out.
The first time someone plays, in that time period, you can offer all combinations at equal probability.
If they win, take that amount off the total left for the time period, and remove from the available options any combination that would payout more than you have left.
Repeat with the reduced combinations until all the money is gone for that time period.
Reset and start again for the next time period.

Why does task X, appear two times for unit 0 at clock cycles 4 and 5?

In the image below, why does task X, appear two times for unit 0 at clock cycles 4 and 5?
have to make a program for the arrangement of the pipeline, but I need to know why the above happens to complete it.
Is it just because the author wants it to repeat??
I'm pretty certain it just means that a task takes two clocks in unit 0 the second time through. The fact that it takes seven clocks in total alludes to this, 1 in unit0, 1 in unit1, 1 in unit2, 1 in unit3, 2 more in unit0 and finally 1 in unit4.
It may well just be a contrived example so that there was a conflict when shifting by one clock (the author had to do something to ensure that task 2 would catch up to task 1 and that seems the easiest solution) or unit0 may well be a non-linear processor of some sort.
Another example would have been trying to pump in a task at the point where the previous task was re-entering unit0.
What they're trying to show is that, given a maximum duration within a unit of N cycles in a pipeline, you have to limit your injections of work to one every N cycles to be sure of no conflict.
My bet (based on the small number of authors I know) would be on the author doing the minimal amount of work to describe the problem :-)

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

How far should you go with estimating your tasks? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
What is the good resolution of time to use for estimating duration of your tasks?
is it like 0.5, 1, 2, 5 days or should you go down to hours like 0.5, 1, 2, 4 hours and then continue up to days?
should a change to a label text be task at all? ( ETA < 1min )
Suggestions?
Scrum uses values of 0, 0.5, 1, 2, 3, 5, 8, 13, 20, 40 and 100. The time of the values is in hours, when you plan in finer detail (breaking down feature requests into technical tickets) and days, when you plan the bigger picture (big-picture features).
In general, if you estimate a ticket over 20 hours (or 20 days), your tasks should be split up into smaller pieces.
Well, it really depends. Personally, I like tasks to be smaller (should be estimated in hours, usually using the something close to Fibonacci series: 0.5, 1, 2, 3, 5, 8, ...).
About small tasks, usually they should be estimated too. Even minor changes require some work like creating tests, seeing if none of the other ones broke, sending to server, etc. You could create a 15 min in the series for stuff like these :)
It's really hard to predict the future.
The units (minutes, hours, days, weeks, fortnights) don't matter.
Pick a unit that makes your manager happy.
Just be clear that an estimate of 30 minutes, .5 hour or .0625 days is only a guess, not a fact.
An estimate of 0.0625 days or 30 minutes looks really precise because it has a lot of decimal places. However, any ambiguity about the requirements, the architecture, the language, the libraries, the unit tests, or anything else will make this number incorrect.
The very best you can hope for is that the average of all your estimates is reasonably close to the actual facts as they unfold. This means that half your estimates will be too low and half will be too high. It also means that some fraction of your estimates will be really, really far from your manager's hoped-for accuracy.
Planning and estimating time required is never the goal of your project, so those units must serve some purpose.
The good rule to use is this: split the task into smaller chunks, until you know exactly what you should do next (and it is not planning). This "knowing exactly what to do" thing is a little subjective, but tasks longer than 2 days rarely fit into this category.
I guess it depends on how accurate you want to be.
I personally use minutes as "days" or "months" can be misleading time periods. For example if you say something will take 1 day - does that mean 24 hours solid work? Or 8 hours? Or the average 3-4 hours of productivity within a working day?
All tasks should be listed, but if they are small you can often group them. But remember that just because it's only changing a label text there is more time involved that just changing the label, you have to find it, open the file, make the change, commit the changes, update any documentation, test it etc - so tasks very rarely only take less than 1 minute.
Also try to put an upper limit on the task time. If your adding tasks that are taking over 3-4 hours, break them down into smaller sub-tasks and list those. This will give you a much more accurate estimate.
I have found myself sorting tasks into these categories:
Half a day
Whole day
Whole week
Several weeks
In my work environment where we do a lot of second-level support, it doesn't make sense to have a more granular system. It is also good to have some slack in the planning, so that you can take an hour to make small improvements to your work environment.
I tend to use units of hours should they be appropriate (1h, 4.5h, 6h, etc). Once days creep into the equation then I tend to not use hours and keep days as the only unit being used (3d, 7d, 10d). Overestimating slightly gives room to accommodate those undesired but expected complications you will hit along the way.
There's a difference between est. work effort - how long something should take to complete - and est. when something will be done (aka duration) - when that something is expected to get completed based on other tasks.
You can never predict the future and any est. is just that - an est. - the likelihood of being able to predict 1 week out in 1/2 hour increments isn't that great, what if the first 6 tasks take 5 min. more to complete, you're 1/2 off (after 3 hours of work).
I would suggest creating a work breakdown structure (WBS) to determine effort and then group the tasks into no less then day groupings and no more then week groupings (depending on the overall duration of the project - you never want to have an increment representing more then 2-3% of overall work effort). This will allow programs to switch between tasks (normal working conditions) without the pressure of having to meet a specific 1/2 hour delivery.

Resources