I'd like to set up a system that crowd sources the best 10 items from a set that can vary from 20-2000 items (the ranking amoung the top ten is not important). There is an excellent stackoverflow post on algorithms for doing the actual sort at
How to rank a million images with a crowdsourced sort. I am leaning toward asking users which they like best between two items and then using the TrueSkill algorithm.
My question is given I am using something like TrueSkill, what is the best algorithm for deciding which pairs of items to show a user to rate? I will have a limited number of opportunities to ask people which items they like best so it is important that the pairs presented will give the system the most valuable information in identifying the top 10. Again, I am mostly interested in finding the top ten, less so how the rest of the items rank amongst themselves or even how the top ten rank amongst themselves.
This problem is very similar to organizing a knock-out tournament where skills of the players are not well known and number of players is very high (think school level tennis tournaments). Since round robin ( O(n^2) matches) is very expensive, but a simple knock-out tournament is too simplistic, the usual option is to go with k-elimination structure. Essentially, every player (in your context a item) is knocked out of contention after losing k games. Take a look at the double elimination structure: http://en.wikipedia.org/wiki/Double-elimination_tournament .
Perhaps you can modify it sufficiently to meet your needs.
Another well known algorithm for this was produced to calculate rankings in Go or Chess tournaments. You can have a look at the MacMahon Algorithms which calculate such pairings and the ranks at the same time. It should be possible to truncate this algorithm, so that it will only produce a set of 10 best Items.
You can find more details in Christian Gerlach's thesis, where he describes the actual optimization algorithm (unfortunately the thesis is in German).
Related
Assuming that I have an array of twitter users and their followers, and I want to identify 5 users who have the most number of unqiue followers such that if I ask them to retweet an advertisement for my product, it would reach the most number of users.
I do not have formal programming or computer science training. However I do understand algorithms and basic CS concepts. It would be great if a solution can be provided in a way a layman could follow.
This is the "Maximum coverage problem", which is a class of problems thought to be difficult to solve efficiently (so-called NP-hard problems). You can read about the problem on wikipedia: https://en.wikipedia.org/wiki/Maximum_coverage_problem
A simple algorithm to solve it is to enumerate all subsets of size 5 of your friends, and measure the size of union of their followers. It is not an efficient solution, since if you've got n friends, then there's around n^5 subsets of size 5 (assuming n is large).
If you wanted a solution that's feasible to code and may be reasonably efficient in real-world cases, you might look at representing the problem as an "integer linear program" (ILP) and use a solver such as GLPK. The details of how to represent max-coverage as an ILP is given on the wikipedia page. Getting it working will require some effort though, and may still not work well if your problem is large.
I have a folder full of images. There are too many to just 'rank'. I made a program that shows two at a time and let's the user pick which one of the two is better. At the end I would like all of the photos to be ordered from best to worst.
I am purely trying to optimize for the fewest amount of comparisons possible. I don't care if the program runs in n cubed time. I've read the other questions here with similar questions but I'm looking for something more advanced.
I'm thinking maybe some sort of algorithm that based on what comparisons you've already made, the program chooses two images to compare that will offer the most information. Maybe even an algorithm that makes complex connections to help determine the orders and potential orders.
Like I said I don't care if it is slow just purely trying to minimize comparisons
If total order exists, you need at least nlog2(n) comparisons. It can be easily proved mathematically. No way around. So regular sorting algorithms in nlog(n) will do the job.
What you are trying to do is called 'topological sort'. Google it and read about it in wikipedia. You can achieve partial sorts in less comparisons. Its kind of a graduate sort. The more comparisons you get, the better the result will be.
However, what do you do if no total order exists? Humans are not able to generate a total order for subjective tasks.
For example picture 1 is better than 2, 2 is better than 3 but 3 is better than 1.
In this case no sorting algorithm can produce a permutation which will match all the decisions. During topological sort, you can detect those inconsitent decisions and get rid of them.
You are looking for a sorting algorithm - pick one. Most algorithms just need a comparison function (a < b?). This is when you show the user two pictures and he has to choose the better one.
You might wan't to read trough some of the algorithms and choose the best one for you. E.g. on quicksort, you would pick a random picture and the user have to compare this picture against all other pictures in the first round - might be too boring from the end user perspective.
What would be a good heuristic to use to solve the following challenge?
Quality Blimps Inc. is looking to expand their sales to other cities
(N), so they hired you as a salesman to fly to other cities to sell
blimps. Blimps can be expensive to travel with, so you will need to
determine how many blimps to take along with you on each trip and when
to return to headquarters to get more. Quality Blimps has an unlimited
supply of blimps.
You will be able to sell only one blimp in each city you visit, but
you do not need to visit every city, since some have expensive travel
costs. Each city has an initial price that blimps sell for, but this
goes down by a certain percentage as more blimps are sold (and the
novelty wears off). Find a good route that will maximize profits.
https://www.hackerrank.com/codesprint4/challenges/tbsp
This challenge is similar to the standard Travelling Salesman Problem, but with some extra twists: The salesman needs to track both his own travel costs and the blimps'. Each city has different prices which blimps sell for, but these prices go down over his journey. What would be a fast algorithm (i.e. n log n ) to use to maximize profit?
The prices of transporting the items in a way makes the TSP simpler. If the salesman is in city A and wants to go to B, he can compare The costs of going directly to B vs. costs of going back to Headquarters first to pick up more blimps. I.e. is it cheaper to take an extra blimp to B via A or to go back in-between. This check will create a series of looped trips, which the salesman could then go through in order of highest revenue. But what would be a good way to determine these loops in the first place?
This is a search problem. Assuming the network is larger than can be solved by brute force, the best algorithm would be Monte Carlo Tree Search.
Algorithms for such problems are usually of the "run the solution lots of times, choose the best one" kind. Also, to choose what solution to try next, use results of previous iterations.
As for a specific algorithm, try backtracking with pruning, simulated annealing, tabu search, genetic algorithms, neural networks (in order of what I find relevant). Also, the monte carlo tree search idea proposed by Tyler Durden looks pretty cool.
I have read the original problem. Since the number of cities is large, it is impossible to get the exact answer. Approximation algorithm is the only choice. As #maniek mentioned, there are many choices of AA. If you have experience with AA before, that would be best, you can choose one the those which you are familar with and get a approximate answer. However, if you didn't do AA before, maybe you can start with backtracking with pruning.
As for this problem, you can easily get these pruning rules:
Bring as few blimps as possible. That means when you start from HQ and visit city A,B,C..., and then return to HQ(you can regard it as a single round), the number of blimps you brought is the same as the cities you will visit.
With the sale price gets lower, when it becomes lower than the travel expense, the city will never be visited. That gives the ending of backtracking method.
You can even apply KNN first to cluster several cities which located nearby. Then start from HQ and visit each group.
To sum up, this is indeed an open problem. There is no best answer. Maybe in case 1, using backtracking gives the best answer while in case 2, simulated annealing is the best. Just calculate the approximate answer is enough and there are so many ways to achive this goal. All in all, the really best approach is to write as many AAs as you can, and then compare these results and output the best one.
This looks like a classic optimization problem, which I know can be handled with the simulated annealing algorithm (having worked on what I think was the first commercial use of simulated annealing, the Wintek electronic CAD autoplacement program back in the 1980's). Most optimization algorithms can handle problems of many variables like yours -- the question is setting up the fitness algorithm correctly so you get the optimum solution (in your case, lowest cost).
Other optimization algorithms may be worth a look -- I just have experience in implementing the simulated annealing algorithm.
(If you go the simulated annealing route, you might want to get "Simulated Annealing: Theory and Applications (Mathematics and Its Applications" by P.J. van Laarhoven and E.H. Aarts, but you will have to hunt it down since it is out of print (it might even be the book I used back in the 1980's).)
I'm working on university scheduling problem and using simple genetic algorithm for this. Actually it works great and optimizes the objective function value for 1 hour from 0% to 90% (approx). But then the process getting slow down drammatically and it takes days to get the best solution. I saw a lot of papers that it is reasonable to mix other algos with genetiŃ one. Could you, please, give me some piece of advise of what algorithm can be mixed with genetic one and of how this algorithm can be applied to speed up the solving process. The main question is how can any heuristic can be applied to such complex-structured problem? I have no idea of how can be applied there, for instance, greedy heuristics.
Thanks to everyone in advance! Really appreciate your help!
Problem description:
I have:
array filled by ScheduleSlot objects
array filled by Lesson objects
I do:
Standart two-point crossover
Mutation (Move random lesson to random position)
Rough selection (select only n best individuals to next population)
Additional information for #Dougal and #izomorphius:
I'm triyng to construct a university schedule, which will have no breaks between lessons, overlaps and geographically distributed lessons for groups and professors.
The fitness function is really simple: fitness = -1000*numberOfOverlaps - 1000*numberOfDistrebutedLessons - 20*numberOfBreaks. (or something like that, we can simply change coefficients in fron of the variables)
At the very beggining I generate my individuals just placing lessons in random room, time and day.
Mutation and crossover, as described above, a really trivial:
Crossover - take to parent schedules, randomly choose the point and the range of crossover and just exchange the parts of parent schedules, generating two child schedules.
Mutation - take a child schedule and move n random lessons to random position.
My initial observation: you have chosen the coefficients in front of the numberOfOverlaps, numberOfDistrebutedLessons and numberOfBreaks somewhat randomly. My experience shows that usually these choices are not the best one and you should better let the computer choose them. I propose writing a second algorithm to choose them - could be neural network, second genetic algorithm or a hill climbing. The idea is - compute how good a result you get after a certain amount of time and try to optimize the choice of these 3 values.
Another idea: after getting the result you may try to brute-force optimize it. What I mean is the following - if you had the initial problem the "silly" solution would be back track that checks all the possibilities and this is usually done using dfs. Now this would be very slow, but you may try using depth first search with iterative deepening or simply a depth restricted DFS.
For many problems, I find that a Lamarckian-style of GA works well, combining a local search into the GA algorithm.
For your case, I would try to introduce a partial systematic search as the local search. There are two obvious ways to do this, and you should probably try both.
Alternate GA iterations with local search iterations. For your local search you could, for example, brute force all the lessons assigned in a single day while leaving everything else unchanged. Another possibility is to move a randomly selected lesson to all free slots to find the best choice for that. The key is to minimise the cost of the brute-search while still having the chance to find local improvements.
Add a new operator alongside mutation and crossover that performs your local search. (You might find that the mutation operator is less useful in the hybrid scheme, so just replacing that could be viable.)
In essence, you will be combining the global exploration of the GA with an efficient local search. Several GA frameworks include features to assist in this combination. For example, GAUL implements the alternate scheme 1 above, with either the full population or just the new offspring at each iteration.
Simple online games of 20 questions powered by an eerily accurate AI.
How do they guess so well?
You can think of it as the Binary Search Algorithm.
In each iteration, we ask a question, which should eliminate roughly half of the possible word choices. If there are total of N words, then we can expect to get an answer after log2(N) questions.
With 20 question, we should optimally be able to find a word among 2^20 = 1 million words.
One easy way to eliminate outliers (wrong answers) would be to probably use something like RANSAC. This would mean, instead of taking into account all questions which have been answered, you randomly pick a smaller subset, which is enough to give you a single answer. Now you repeat that a few times with different random subset of questions, till you see that most of the time, you are getting the same result. you then know you have the right answer.
Of course this is just one way of many ways of solving this problem.
I recommend reading about the game here: http://en.wikipedia.org/wiki/Twenty_Questions
In particular the Computers section:
The game suggests that the information
(as measured by Shannon's entropy
statistic) required to identify an
arbitrary object is about 20 bits. The
game is often used as an example when
teaching people about information
theory. Mathematically, if each
question is structured to eliminate
half the objects, 20 questions will
allow the questioner to distinguish
between 220 or 1,048,576 subjects.
Accordingly, the most effective
strategy for Twenty Questions is to
ask questions that will split the
field of remaining possibilities
roughly in half each time. The process
is analogous to a binary search
algorithm in computer science.
A decision tree supports this kind of application directly. Decision trees are commonly used in artificial intelligence.
A decision tree is a binary tree that asks "the best" question at each branch to distinguish between the collections represented by its left and right children. The best question is determined by some learning algorithm that the creators of the 20 questions application use to build the tree. Then, as other posters point out, a tree 20 levels deep gives you a million things.
A simple way to define "the best" question at each point is to look for a property that most evenly divides the collection into half. That way when you get a yes/no answer to that question, you get rid of about half of the collection at each step. This way you can approximate binary search.
Wikipedia gives a more complete example:
http://en.wikipedia.org/wiki/Decision_tree_learning
And some general background:
http://en.wikipedia.org/wiki/Decision_tree
It bills itself as "the neural net on the internet", and therein lies the key. It likely stores the question/answer probabilities in a spare matrix. Using those probabilities, it's able to use a decision tree algorithm to deduce which question to ask that would best narrow down the next question. Once it narrows the number of possible answers to a few dozen, or if it's reached 20 questions already, then it starts reading off the most likely.
The really intriguing aspect of 20q.net is that unlike most decision tree and neural network algorithms I'm aware of, 20q supports a sparse matrix and incremental updates.
Edit: Turns out the answer's been on the net this whole time. Robin Burgener, the inventor, described his algorithm in detail in his 2005 patent filing.
It is using a learning algorithm.
k-NN is a good example of one of these.
Wikipedia: k-Nearest Neighbor Algorithm