Whenever I play Sudoko, I see the finished puzzle as an overspecified version of the original input. Like 8b/10b, Reed-Solomon codes, turbo codes, or low-density parity-check codes. With ECC the computer has to solve a puzzle to produce the correct data, and with Sudoku the human has to solve a puzzle to produce 81 digits of fun.
Do you think any of these ECC codes would make a good pencil and paper game? (8b/10b -- the home version!)
Is there a good way to represent data as Sudoku puzzles to make the most ridiculous ECC available?
Representing arbitrary data as a sudoku puzzle is not particularly feasible as the total number of sudoku grids (and thus, the number of distinct pieces of information that could be represented by a puzzle) is far too low (approx 6E21) to encode a significant amount of data (more than about 9 bytes).
Add to this the computational difficulty of producing a non-ambiguous puzzle for a given solution, and the widely varying data density of the optimal puzzle for different solutions.
Another way to look at the overspecification in the final result is to consider the original state as the result of a compression algorithm.
Nonograms are another example of a very information-sparse result being represented in the form of an information-dense puzzle.
Related
I am working on a project that uses a form of the recursive division algorithm that is normally used to create a fractal-like maze. Now I would like to cite the creator/author of this algorithm to give them credit, but I am unsure who invented it.
Jamis Buck has a very famous blog entry https://weblog.jamisbuck.org/2011/1/12/maze-generation-recursive-division-algorithm , that is titled: "A novel method for generating fractal-like mazes is presented, with sample code and an animation" but did he actually came up with that idea? I do not understand his post as that he invented it, to me it sounds more like he is just describing it.
I was not able to find a clear publication/author through Google and Wikipedia(en) or any other specifically maze-focused site. Does anyone know about the origin of this algorithm?
No, the algorithm is not mine; I learned it from Walter Pullen’s page, here: https://www.astrolog.org/labyrnth/algrithm.htm. He also does not cite an author for this algorithm. It should be noted that many maze algorithms are adaptations of more general graph algorithms, so it could be that this one has an analog in graph theory somewhere...
At any rate, it’s not mine. I used the word “novel” in my post in the second sense given here (https://www.merriam-webster.com/dictionary/novel) “original or striking especially in conception or style”, not in the sense of “new”.
The recursive division algorithm was invented and named by John Perry, who added it to Wikipedia's Maze generation algorithm page in November 2006.
To assist with questions like this, Walter Pullen's Maze generation algorithms page has been updated to add a "Source" column to the table of Maze generation and solving algorithms, to list who invented, named, or otherwise popularized each algorithm. Yes, it is true that many Maze algorithms are adaptations of general graph theory algorithms, and that many algorithms are similar to or variations on others, etc.
For example, recursive division is a type of "nested Mazes", "fractal Mazes", or "Mazes by subdivision", which means generating smaller Mazes within each cell of a larger Maze. That is an older and more general concept, and others (for example the Maze generation program Daedalus) have done nested cell Mazes since 2003. However, it would always divide the Maze into the SAME sized submazes, e.g. a 2x2 equal sized cell Maze with equal sized 2x2 cell Mazes within each cell, with the process repeated 6 times.
John Perry introduced the idea of dividing a Maze into RANDOM sized 2x2 cell Mazes instead of just fixed sized 2x2 cell Mazes, which can generate a Maze of any dimensions instead of just 2^N power passages as with nested cell. The result looks a bit more organic (like a tree trunk) instead of grid based (like a Borg cube). Note that the implementation of recursive division in Daedalus is a bit more generalized, in that Daedalus divides the Maze into either 1x2 or 2x1 cell Mazes (option chosen randomly too) which looks even more random and organic. Also, note that both same and random sized division ("nested cell" algorithm and "recursive division" algorithm) can be implemented in a virtual manner to produce virtual Mazes of enormous size (in which not all of the Maze has to be in memory at once).
This may be a repeat of the question here: Predict Huffman compression ratio without constructing the tree
So basically, I have the probabilistic distribution of two datasets with the same variables but different probabilities. Now, is there any way that by looking at the variable distribution, I can to some degree confidently say that the dataset, when passed through a Huffman Coding implementation would achieve a higher compression ratio than the other?
One of the solutions that I came across was to calculate the upper bound using conditional entropy and then compute the average code length. Is there any other approach that can I can probably explore before using the said method?
Thanks a lot.
I don't know what "to some degree confidently" means, but you can get a lower bound on the compressed size of each set by computing the zero-order entropy as done in the linked question (the negative of the sum of the probabilities times the log of the probabilities). Then the lower entropy very likely produces a shorter Huffman coding than the higher entropy. It is not definite, as I am sure that one could come up with a counter-example.
You also need to send a description of the code itself if you want to decode it on the other end, which adds a wrinkle to the comparison. However if the data is much larger than the code description, then that will be lost in the noise.
Simply generating the code, the coded data, and the code description is very fast. The best solution is to do that, and compare the resulting number of bits directly.
I got the following interesting task:
Given a list of 1 million numbers with 16 digits (say, credit card numbers), which includes 990,000 purely random numbers generated by a computer system, and 10,000 created manually by fraudsters. These numbers are labeled as genuine or fraud. Build an algorithm to predict non-random numbers.
My approach so far is a bit of a brute-force: looking at non-random numbers to find patterns (such as repeated numbers: 22222, or 01234).
I wonder if there's a ready-made algorithm or tool for this kind of task. I imagine this task should be quite common among fraud analytic community.
Thanks.
First off, if you know they're credit card numbers, use Luhn's algorithm, which is a quick checksum algorithm for valid credit card numbers.
However, if they are simply 16 digit integers, there are a couple of approaches that you can use. It is hard to tell if an individual number came from a random source(as the number 1111111111111111 is just as likely as any other number out of a random number generator). As for your repeated numbers and patterns, that is very reminiscent of the concept of Kolmogorov complexity(see links below). You could try looking for patterns in this brute force method, but I feel like it would be quite inaccurate, as humans might actually tend to avoid putting digits and sequences in these numbers!
Instead, I suggest focusing on the way people generate numbers. You can treat human input like a very poor random number generator. So I recommend just making a list yourself of random human entered numbers, if you don't have another dataset. Then, you can use machine learning to generate a classifier algorithm to distinguish between purely random numbers(those without 'human-like' attributes that your machine learning algorithm has recognized). In terms of the metrics for the statistical classifier, Kolmogorov complexity could be one, perhaps frequency of digits for another metric(see Benford's law on Wikipedia), and number of repeating digits for another(humans might try to avoid repeating digits to look non-random, so let your classifier do the work!)
From my personal experience, tough problems like this are a textbook case for machine learning algorithms and statistical classifiers.
Hope this helps!
Links:
Kolmogorov Complexity
Complexity calculator
Imagine, there are two same-sized sets of numbers.
Is it possible, and how, to create a function an algorithm or a subroutine which exactly maps input items to output items? Like:
Input = 1, 2, 3, 4
Output = 2, 3, 4, 5
and the function would be:
f(x): return x + 1
And by "function" I mean something slightly more comlex than [1]:
f(x):
if x == 1: return 2
if x == 2: return 3
if x == 3: return 4
if x == 4: return 5
This would be be useful for creating special hash functions or function approximations.
Update:
What I try to ask is to find out is whether there is a way to compress that trivial mapping example from above [1].
Finding the shortest program that outputs some string (sequence, function etc.) is equivalent to finding its Kolmogorov complexity, which is undecidable.
If "impossible" is not a satisfying answer, you have to restrict your problem. In all appropriately restricted cases (polynomials, rational functions, linear recurrences) finding an optimal algorithm will be easy as long as you understand what you're doing. Examples:
polynomial - Lagrange interpolation
rational function - Pade approximation
boolean formula - Karnaugh map
approximate solution - regression, linear case: linear regression
general packing of data - data compression; some techniques, like run-length encoding, are lossless, some not.
In case of polynomial sequences, it often helps to consider the sequence bn=an+1-an; this reduces quadratic relation to linear one, and a linear one to a constant sequence etc. But there's no silver bullet. You might build some heuristics (e.g. Mathematica has FindSequenceFunction - check that page to get an impression of how complex this can get) using genetic algorithms, random guesses, checking many built-in sequences and their compositions and so on. No matter what, any such program - in theory - is infinitely distant from perfection due to undecidability of Kolmogorov complexity. In practice, you might get satisfactory results, but this requires a lot of man-years.
See also another SO question. You might also implement some wrapper to OEIS in your application.
Fields:
Mostly, the limits of what can be done are described in
complexity theory - describing what problems can be solved "fast", like finding shortest path in graph, and what cannot, like playing generalized version of checkers (they're EXPTIME-complete).
information theory - describing how much "information" is carried by a random variable. For example, take coin tossing. Normally, it takes 1 bit to encode the result, and n bits to encode n results (using a long 0-1 sequence). Suppose now that you have a biased coin that gives tails 90% of time. Then, it is possible to find another way of describing n results that on average gives much shorter sequence. The number of bits per tossing needed for optimal coding (less than 1 in that case!) is called entropy; the plot in that article shows how much information is carried (1 bit for 1/2-1/2, less than 1 for biased coin, 0 bits if the coin lands always on the same side).
algorithmic information theory - that attempts to join complexity theory and information theory. Kolmogorov complexity belongs here. You may consider a string "random" if it has large Kolmogorov complexity: aaaaaaaaaaaa is not a random string, f8a34olx probably is. So, a random string is incompressible (Volchan's What is a random sequence is a very readable introduction.). Chaitin's algorithmic information theory book is available for download. Quote: "[...] we construct an equation involving only whole numbers and addition, multiplication and exponentiation, with the property that if one varies a parameter and asks whether the number of solutions is finite or infinite, the answer to this question is indistinguishable from the result of independent tosses of a fair coin." (in other words no algorithm can guess that result with probability > 1/2). I haven't read that book however, so can't rate it.
Strongly related to information theory is coding theory, that describes error-correcting codes. Example result: it is possible to encode 4 bits to 7 bits such that it will be possible to detect and correct any single error, or detect two errors (Hamming(7,4)).
The "positive" side are:
symbolic algorithms for Lagrange interpolation and Pade approximation are a part of computer algebra/symbolic computation; von zur Gathen, Gerhard "Modern Computer Algebra" is a good reference.
data compresssion - here you'd better ask someone else for references :)
Ok, I don't understand your question, but I'm going to give it a shot.
If you only have 2 sets of numbers and you want to find f where y = f(x), then you can try curve-fitting to give you an approximate "map".
In this case, it's linear so curve-fitting would work. You could try different models to see which works best and choose based on minimizing an error metric.
Is this what you had in mind?
Here's another link to curve-fitting and an image from that article:
It seems to me that you want a hashtable. These are based in hash functions and there are known hash functions that work better than others depending on the expected input and desired output.
If what you want a algorithmic way of mapping arbitrary input to arbitrary output, this is not feasible in the general case, as it totally depends on the input and output set.
For example, in the trivial sample you have there, the function is immediately obvious, f(x): x+1. In others it may be very hard or even impossible to generate an exact function describing the mapping, you would have to approximate or just use directly a map.
In some cases (such as your example), linear regression or similar statistical models could find the relation between your input and output sets.
Doing this in the general case is arbitrarially difficult. For example, consider a block cipher used in ECB mode: It maps an input integer to an output integer, but - by design - deriving any general mapping from specific examples is infeasible. In fact, for a good cipher, even with the complete set of mappings between input and output blocks, you still couldn't determine how to calculate that mapping on a general basis.
Obviously, a cipher is an extreme example, but it serves to illustrate that there's no (known) general procedure for doing what you ask.
Discerning an underlying map from input and output data is exactly what Neural Nets are about! You have unknowingly stumbled across a great branch of research in computer science.
I'm creating a puzzle game that, while playable by hand for easy levels, is meant to be solved by computer programs for harder ones. The puzzle is a flood fill on a hexagonal board. You can try a prototype here.
(source: hacker.org)
Here is how the puzzle works: by choosing a color from the top, you perform a flood fill starting from the upper-left tile. This progressively converts the board to a solid color. The challenge is to do this in a certain number of moves.
I have created several puzzles similar to this, and the key is to use an algorithm that generates boards that are hard to solve without knowing how they were created. For example, here we might produce a board by reversing the flood fill: working backwards from a solid board until it has been unflooded. We know how many steps this took, and can set this as a lower bound on a solution.
The problem I'm facing is that when I try this approach, my upper-bound is way too high. It becomes trivial to solve the puzzle within this number of moves, even by moving randomly.
An approach that is not a solution is generating a random board and then solving it optimally and setting this as the target. The point is to create a puzzle where solving it optimally is NP time or at least a hard P.
So what I'm looking for is an algorithm that can generate extremely hard boards where solving them, as they get larger, becomes a serious challenge.
When doing RSA encryption, we do not find prime numbers, we select random numbers and then apply tests to them that give us increasingly high probability of the number being prime, withou ever proving it.
I suggest the same. Try to find conditions which that give a good likelyhood of the puzzle having the desired properties, and testing for those. Or you could use genetic algorithms/neural networks and train them to recognize the "good" puzzles, which amounts to the same thing.
I would try to prove that it is NP-complete or in P, to get a feel for the configurations which are difficult.
I'd also abstract away the hexagons and use a representation as a graph.
I've played the rectangular flood puzzle a lot (http://labpixies.com/gadget_page.php?id=10). Excited to see a Hex version! I think finding a hard game is as easy as: avoid large blocks of the same color to appear in the puzzle. At least in the rectangular cases I've seen, nearly all the puzzles that can be solved in a small number of steps have large color blocks.
P.S. I think your "lower bound" is not valid. When working forwards, if a good strategy is used, you could actually finish in fewer steps. The "lower bound" is really an upper bound for the optimum solution.