Algorithm used in the game Ruzzle - algorithm

I have been thinking about the algorithm used in the game Ruzzle for a while. The aim in the game is to find the presence of any word in a given grid, word can be matched in any direction up-down, down-up, left-right, right-left, both diagonals up and down etc.
Let's make it simple. Find a given word in a 2-D grid with the same matching constraints (directions). What is the algorithm with the best time-complexity possible ?
For example, FOREVER can be found in this grid.
H O F E R
L R E T O
S N V O R
P Q T E N

You can further optimise it by using a Trie data structure. Once you fill-in the structure with all the English words (or a different language), than you can check in O(1) if a particular neighbour character needs to be explored or not.
Please, notice that at this point you are trading storage for time: you presumably will need more RAM to store the entire Trie but you will query it faster than checking in the ordered list of words.
In terms of architecture behind the game, I think they use a dedicated server, which works full time to store in a DB new games (matrixes) and their list of admissible words. During the game, your device receives an ID, it downloads the matrix and the list of admissible words and this is enough to allow you to play. At the end of each game everything is deleted and the final score (just an integer) is submitted to the server, which will update your profile. In the actual game there's something more because they also have badges and statistics (but that's trivial stuff to collect). Bear in mind that this is just the way I would design and develop it.
What do you think? Can we do even better?

I have created a course about making ruzzle game mechanics in Unity. Please check it out and you will find exact answers:
https://www.udemy.com/word-game-unity/?couponCode=wordgamecourse

Related

Quick relative ranking algorithm

Let's say I own 100 video games, and I want to order them from most liked to least liked. It's very hard to give each video game a numeric value that represents how much I like it, so I thought of comparing them to each other.
One solution I came up with is picking 2 random video games, and selecting which one I liked more, and discarding the other one. Unfortunately this solution only lets me know the #1 video game since that would be the last one remaining, and provides little information about the others. I could then repeat the process for the other 99 video games, and so on but that is very impractical: O(n^2).
Are there any O(n) (or just reasonable) algorithms that can be used to sort data based on relative criteria?
If you want to present the games in a sequential order, you need to decide upon it.
It is possible to derive a sequential order from a set of pairwise comparisons.
Here is an example. You have 100 video games. We assume that every video game is associated with a parameter ai (where i ranges from 1 to 100). It is a real number that describes how "much" you like the game. We don't know the values of those parameters yet. We then choose a function that describes how likely it is that you prefer video game i over video game j in terms of the parameters. We choose the logistic curve and define
P[i preferred over j] = 1/(1+eaj - ai)
Now when ai = aj you have P = 0.5, and when, say, ai = 1 and aj = 0 you have P = 1/(1 + e-1) = 0.73, showing that a relative higher parameter values increases the probability that the corresponding video game is preferred.
Now then, when you have your actual comparison results in a table, you use the method of maximum likelihood to calculate the actual values for the parameters ai. Then you sort your video games in descending order of the calculated parameters.
What happens is that the maximum likelihood method calculates those values for the parameters ai that make the actual observed preferences as likely as possible, so the calculated parameters represent the best guess about a total ordering between the video games. Note that for this to work, you need to compare video games to other video games enough many times---every game needs at least one comparison, and the comparisons cannot form disjoint subsets (e.g. you compare A to B to C to A, and D to E to F to D, but there is no comparison between a game from {A,B,C} and a game from {D,E,F}).
You could use quicksort aka pivot sort. Pick a game, and compare every other game to it, so you have a group of worse game and better games. Repeat for each half recursively. Average case performance is n log n.
http://en.wikipedia.org/wiki/Quicksort
Other way would be to extend your idea. Display more that 2 games and sort them out by your rating. The idea i similar to a merge sort to rate your games. If you pick games for rating correctly you wont need to do a lot of iterations. Just a fiew. IMO O(n) will be quite hard, because your (as a human) observation is limited.
As a start, you could keep a list and insert each element one-by-one using binary search, giving you an O(n log n) approach.
I'm also certain that you can't beat O(n log n), unless I misunderstood what you want. Basically, what you're telling me is that you want to be able to sort some elements (in your example, video games) using only comparisons.
Think of your algorithm as this: you start with the n! possible ways to arrange your games, and every time you make a comparison, you split the arrangements into POSSIBLE and IMPOSSIBLE, discarding the latter group. (POSSIBLE here meaning that the arrangement is consistent with the comparisons you have made)
In the worst-case scenario, the POSSIBLE group is always at least as big as the IMPOSSIBLE group. In that instance, none of your comparisons reduce the search space by at least a factor of 2, meaning you need at least log_2(n!) = O(n log n) comparisons to reduce the space to 1, giving you your ordering of games.
As to whether there is an O(n) way to sort n objects, there aren't. The lower bound on such a sort would be O(nlogn).
There is a special case, however. If you have a unique and bounded preference, then you can do what's called a bucket sort.
A preference is unique if no two games tie.
A preference is bounded if there is a minimum and maximum value for your preference.
Let 1 .. m be the bound for your set of games.
Just create an array with m elements, and place each game in the index according to your preference.
Now you can just do a linear scan over the array for your sorted order.
But of course, that's not comparison-based.
One possibility would be to create several criteria C1, C2, ..., Cn like:
video quality
difficulty
interest of scenario
...
You pass each game thru this sieve.
Then you compare a subset of game pairs (2-rank choice), and tells which one you prefer. There exist some Multi-Criteria-Decision-Making/Analysis (MCDM or MCDA) algorithm that will transform your 2-rank choices into a multi-criteria-ranking function, for example one could calculate coefficients a1, ..., an to build a linear ranking function a1*C1+a2*C2+...+an*Cn.
Good algorithms won't let you choose pairs at random but will propose you the pairs to compare based on a non dominated subset.
See wikipedia http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis which gives some usefull links, and be prepared to do/read some math.
Or buy a software like ModeFrontier which has some of these algorithms embedded (a bit expensive if just for ranking a library).
I dont think it can be done in O(n) time. Best we can get is O(nlogn)using merge or quick sort.
the way I would approach this is to have an array with the game title and a counting slot.
Object[][] Games = new Object[100][2];
Games[0][0] = "Game Title1";
Games[0][1] = 2;
Games[1][0] = "Game Title2";
Games[1][1] = 1;
for every time you vote it should add one to the Games[*][1] slot and from there you can sort based on that.
While not O(n), a pairwise comparison is one way to rank the elements of a set relative to one another.
To implement the algorithm:
Create a 100x100 matrix
Each row represents a movie, each column represents a movie. The movie at r1 is the same as the movie at c1, r2=c2...r100=c100.
Here is some quick pseudo-code to describe the algorithm:
for each row
for each column
if row is better than column
row.score++
else
column.score++
end
end
movie_rating = movie[row] + movie[column]
sort_by_movie_rating()
I understand its hard to quantify how much you like something but what if you created several "fields" that you would judge each game on:
graphics
story
multiplayer
etc...
give each 1-5 out of 5 for each category (changing weights for categories you deem more important). Try to create an objective scale for judging (possibly using external sources, e.g. metacritic)
Then you add them all up which gives an overall rating of how much you like them. Then use a sorting algorithm (MergeSort? InsertionSort?) to place them in order. That would be O(n*m+nlogn) [n = games, m = categories] which is pretty good considering m is likely to be very small.
If you were really determined, you could use machine learning to approximate future games based on your past selection.

Optimal placement of objects wrt pairwise similarity weights

Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.

Algorithm for Connect 4 Evaluation of Data Set

I am working on a connect 4 AI, and saw many people were using this data set, containing all the legal positions at 8 ply, and their eventual outcome.
I am using a standard minimax with alpha/beta pruning as my search algorithm. It seems like this data set could could be really useful for my AI. However, I'm trying to find the best way to implement it. I thought the best approach might be to process the list, and use the board state as a hash for the eventual result (win, loss, draw).
What is the best way for to design an AI to use a data set like this? Is my idea of hashing the board state, and using it in a traditional search algorithm (eg. minimax) on the right track? or is there is better way?
Update: I ended up converting the large move database to a plain test format, where 1 represented X and -1 O. Then I used a string of the board state, an an integer representing the eventual outcome, and put it in an std::unsorted_map (see Stack Overflow With Unordered Map to for a problem I ran into). The performance of the map was excellent. It built quickly, and the lookups were fast. However, I never quite got the search right. Is the right way to approach the problem to just search the database when the number of turns in the game is less than 8, then switch over to a regular alpha-beta?
Your approach seems correct.
For the first 8 moves, use alpha-beta algorithm, and use the look-up table to evaluate the value of each node at depth 8.
Once you have "exhausted" the table (exceeded 8 moves in the game) - you should switch to regular alpha-beta algorithm, that ends with terminal states (leaves in the game tree).
This is extremely helpful because:
Remember that the complexity of searching the tree is O(B^d) - where B is the branch factor (number of possible moves per state) and d is the needed depth until the end.
By using this approach you effectively decrease both B and d for the maximal waiting times (longest moves needed to be calculated) because:
Your maximal depth shrinks significantly to d-8 (only for the last moves), effectively decreasing d!
The branch factor itself tends to shrink in this game after a few moves (many moves become impossible or leading to defeat and should not be explored), this decreases B.
In the first move, you shrink the number of developed nodes as well
to B^8 instead of B^d.
So, because of these - the maximal waiting time decreases significantly by using this approach.
Also note: If you find the optimization not enough - you can always expand your look up table (to 9,10,... first moves), of course it will increase the needed space exponentially - this is a tradeoff you need to examine and chose what best serves your needs (maybe even store the entire game in file system if the main memory is not enough should be considered)

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Guaranteeing Unique Surrogate Key Assignment - Maximum Matching for Non-Bipartite Graph

I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.
In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.
A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:
Wikipedia - Matching
MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching
I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.
EDIT:
A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.
In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.
Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.
I get the impression you're going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both ABC and DEF identify the same record is disastrous, but ultimately repairable. I'm not even sure why you need surrogate keys at all; while they do have many advantages, I'd give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that's a decision that would have to be made after looking at your whole schema.
To address your potential solution, I've pulled out my copy of D. B. West's Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You'll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it's sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you're not finding what you need.
3.3.17. Algorithm. (Edmonds' Blossom Algorithm [1965a]---sketch).
Input. A graph G, a matching M in G, an M-unsaturated vertex u.
Idea. Explore M-alternating paths from u, recording for each vertex the vertex from which it was reached, and contracting blossoms when found. Maintain sets S and T analogous to those in Algorithm 3.2.1, with S consisting of u and the vertices reached along saturated edges. Reaching an unsaturated vertex yields an augmentation.
Initialization. S = {u} and T = {} (empty set).
Iteration. If S has no unmarked vertex, stop; there is no M-augmenting path from u. Otherwise, select an unmarked v in S. To explore from v, successively consider each y in N(v) such that y is not in T.
If y is unsaturated by m, then trace back from y (expanding blossoms as needed) to report an M-augmenting (u, y)-path.
If y is in S, then a blossom has been found. Suspend the exploration of v and contract the blossom, replacing its vertices in S and T by a single new vertex in S. Continue the search from this vertex in the smaller graph.
Otherwise, y is matched to some w by M. Include y in T (reached from v), and include w in S (reached from y).
After exploring all such neighbors of v, mark v and iterate.
The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds' original paper, just ask and I'll dig them out of the index (which kind of sucks in this book).
I think you would be better off by establishing a set of rules and attacking your key mapping table with a set of simple queries that enforce each rule, in an iterative fashion. Maybe I am oversimplifying because your example is simple.
The following are examples of rules - only you can decide which ones apply:
if there are duplicates, use the lowest (oldest) surrogate key
use the natural keys from the row with the highest (latest) surrogate key
use the natural keys from the most complete mapping row
use the most recent occurence of every natural key
... ?
Writing queries that rebuild your key mapping is trivial, once you have established the rules. I am not sure how this could be a math problem?
If you are looking for an implementation, Eppsteins PADS library has a matching algorithm, this should be fast enough for your purposes, the general matching algorithm is in CardinalityMatching.py. The comments in the implementation explain what is going on. The library is easy to use, to supply a graph in Python you can represent the graph using a dictionary G, such that G[v] gives a list (or set) of neighbors of the vertex v.
Example:
G = {1: [1], 2:[1,3], 3: [2,4], 4:[3]}
gives a line graph with 4 vertices.

Resources