What are some of the applications of Order Statistics Algorithms? - algorithm

I am watching the MIT lectures and Eric Demaine says that they discussed some of the applications of Order Statistics Algorithms. I was wondering if the SO community would help me figure out some of the applications of the selection algorithms.

Finding the median is a common application of such algorithm. E.g. I've used it in image processing for the median filter. Min, max, k-NN also use order statistic algorithms, so that's another application.

Here are some other applications that I can think of in addition to what Jacob said:
Most of the services care about 95-th or 99-th percentile of latency rather than mean 'coz they want to keep most of the users happy.
In machine-learning if you want to convert a continuous valued feature into Boolean features by bucketing it, one common approach is to partition it by percentile so that the cardinality of each Boolean feature is somewhat similar.
There are probably hundreds of applications of order statistics. The algorithms to compute them can change based on what kind of scaling you need and what approximations you can tolerate. If you can give more context on what Eric Demaine says, probably you can get better answers.

Related

Best Algorithm for Minimum Cost Maximum Flow?

Can someone tell me which is the best algorithm for minimum cost maximum flow (and easy to implement) and from where to read will be helpful? I searched online and got names of many algorithms and unable to decide which one to study.
From my experience benchmarking MCF in an industry setting, there are three publicly available implementations that are competitive,
Andrew V Goldberg's cost scaling implementation.
Coin-OR's Lemon library cost scaling implementation.
Coin-OR's Network Simplex implementation.
I would try those in that order if you are limited for time. Other honorable mentions are,
Google-OR's cost scaling implementation. I haven't benchmarked this, but I'd expect it to be competitive with those above.
MCFClass has several implementations listed under various restricted licenses for commercial use. RelaxIV is very competitive but restrictive.
In terms of studying literature and a survey of competitive algorithms, the work of Kirarly and Kovacs are an excellent starting point.

Algorithm for clustering people with similar interests

I want to cluster people into groups based on their interests. For eg. people who like machine learning and graphs may be placed in a group and people who have interest in mathematics and economics etc. may be placed in a different group.
The algorithm should be able to decide which people have most matching interests based on the interests of the people and create clusters.It should also be able to output about other persons in the group in which a particular person is placed.
This does not sound like a particularly difficult clustering problem, and any of the off-the-shelf clustering algorithm will probably work well. If you know how many clusters you want, then try k-means or k-medoid clustering. If you don't know how many clusters, then try agglomerative clustering.
The difficult part of the problem will be the features. You mentioned that 'interests' could be used as the features upon which to cluster, but feature engineering and selection will always involve some trial and error.
Without more context of your problem, I can't really give a definite answer. Most clustering algorithms will work though, the problem is how "good" are your results. I'm quoting the word "good" because you'll need some sort of metric to measure that (generally inter-cluster and intra-cluster distance).
Here's the advice given to me when I was taught on how to decide on an algorithm for data mining: Try the simplest algorithms first - quite often these are overlooked but perform quite well (Naive Bayes for supervised learning is a classic example).
To start you off, try something like K-means which is a simple and popular method, you can find more info here http://en.wikipedia.org/wiki/K-means_clustering (if you look at the Software section you can also find a list of implementations that you could try).
The second part of the criteria is to be able to output the other people in the group based on a target person. This is doable in all clustering algorithms since you'll have X subsets of people, you simply need to find the subset which the target person is in and then iterate that subset and print all the people within out.
I think the right approach will be Kmeans clustering. The most important part of your problem is feature selection.
Try with some features that you think are most important and simply apply kmeans in some statistical programing language like R, inspect the result and improve it by feature modification or selecting more appropriate features.
Hit and trial can give you insight if you are not sure about feature selection.
If you can provide some sample data, it will help to give some specific solutions to your problem.
Its coming a bit late, but there's actually an app in the windows store that is doing exactly that : finding profiles having similar characteristics
its called k-modo

using software metrics for measuring productivity of pair programming

what are the software metrics that can be used to measuring the performance of pair programming ?
to be clear
is there any metrics used to measure pair programming specifically and does not use to measure the individual programmer ? what are the parameters used for measuring ?
for example:if we want to measure the cost for both individual and pair programming
let's assume that for the individual programming Cost = x so for the pair will be Cost = 2* x
right
and the same for the time for individual Time = t while for pair Time = 2* t
so if I would like to use Lines of code for measuring the product size , is there any different between individual and pair by using this metric?
any idea
Sorry to spoil your party, but lines of code is one of the worst metrics possible, especially if people know their assessment or bonus is in any way tied to the metric. It actively encourages cut and paste programming and other attrocities. It's more effort, but why don't you categorise the workload in terms of expected effort for one person, based on your historical data? Or, get some programmers to agree to do a few projects redundantly, rotating between pair-programming and individual, so you can see how the same programmers go at each. As one good programmer can be more productive than two average programmers (I vaguely remember an old IBM study concluding someone in the top percentile was 27x more productive than median), it's useful to see the same programmers doing it both ways. If objectively discovering the right process through such an experiment is too costly in terms of lost short-term productivity, then you're better off not bothering with the LOC metrics anyway... good programmers knowing their work arrangements are being based on such will probably be highly unimpressed.
Remember that there are also intangibles involved... pair programming - IMHO - forces people to keep focused, and to make design decisions that are more rounded and professional. Just the social contact can help relieve boredom, though it may stress some people too. My suspicion is that - whether or not it's faster to begin with - it makes for better, more maintainable results. It also ensures skill and knowledge transfer. You should factor in such intangible aspects as best you can - maybe doing interviews or anonymous surveys with the trial participants.
I guess what you try to ask, is how to measure efficiency of the team that uses pair programming. If yes, then answer is the measurement of efficiency doesn't depend on method or proccess of work team is using. You should try to evaluate the quality of their product releases, with metrics like number of issues identified post release. Probably the velocity.
and, please, don't use lines of code for efficiency measurement. It doesn't make sense. Lines of code is a measure of product size and not developer efficiency. It's like using height or weight to judge how smart you are. There is no correlation between amount of code and individual efficiency.
if you are interested in more software metrics, take a look at http://www.sdlcmetrics.org

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

Staff Rostering algorithms

We are embarking on some R&D for a staff rostering system, and I know that there are some suggested algorithms such as the memetic algorithm etc., but I cannot find any additional information on the web.
Does anyone know any research journals, or pseudocode out there which better explains these algorithms?
Thanks,
Devan
Here is a useful document:
Memetic Algorithms for Nurse Rostering (pdf)
It contains a little bit of theory and pseudo-code.
Scheduling problem is NP-hard and usually being solved using genetic algorithms (GA).
You can start learning GA from Wikipedia article
You may also want to look at a technique called "simulated annealing". Like genetic algorithms, this uses an evaluation function to determine the quality of candidate solutions - but the generating of the candidates tends to be simpler. Each type of algorithm gives better results in certain circumstances - from a brief Google survey it feels like genetic has the edge, but annealing will be quicker to implement.
Here is a comparison paper (for a different domain, not scheduling):
http://www.ee.utulsa.edu/~tmanikas/Pubs/gasa-TR-96-101.pdf
We have used simulated annealing in a large scheduling application and it did work well.
To be honest, if the volume of staff is less than about 40, I would recommend giving a visual representation of the roster and letting the user finalise the schedule. Perhaps you would use an algorithm to produce a candidate schedule to start with, and then let the user play with it. You could still use the evaluation function to check the user's work and give feedback on how good their solution is.
There are many many many issues to consider when setting up a roster schedule, so aku's tip about genetic algorithms is the best one.
You need a good evaluation function to determine the quality of the roster for such an algorithm, and you can, and should, consider things like the following (but not limited to):
have you solved the workload problem with this roster? (ie. do you have enough people at work at all times?)
if not, can you live with the consequences? (for hospitals, you might have to postpone lunch 15 min one day in order to have enough people available for it, or just drag it slightly out in time)
is the roster a good one, considering things like shift stability for each person, their days off, whether or not they get weekends off with some regularity
is the roster legal? taking things like local regulations into account, that regulate things like how much time must pass between one shift and another (downtime), how much can each person work inside a given interval (day, week, month)
I read a rostering algo paper by these guys a while back.
Or by using OR ;)

Resources