What kind of PRNG would match such scatter plots? - random

I've been given the challenge to find the seed from a series of pseudo-randomly generated alphanumerical IDs and after some analysis, I'm stuck in a dead end that I hope you'll be able to get me out of.
Each ID is obtained by passing the previous one through the encryption algorithm, that I'm supposed to reverse engineer in order to find the seed. The list given to me is composed of the 2070 first IDs (without the seed obviously). The IDs start as 4 alphanumerical characters, and switch to 5 after some time (e.g. "2xm1", "34nj", "avdfe", "2lgq9")
This switch happens once the algorithm, after encrypting an ID, returns an ID that has already been generated previously. At this point, it adds one character to this returned ID, making it longer and thus unique. It then proceeds as usual, generating IDs of the new length. This effectively means that the generation algorithm is surjective.
My first reflex was to try to convert those IDs from base36 to some other base, notably decimal. I used the results to scatter plot a chart of the IDs' decimal values in terms of their rank in the list, when I noticed a pattern that I couldn't understand the origin of.
After isolating the two parts of the list in terms of ID length, I scatter plotted the same graph for the 4-characters IDs sub-list and 5-characters IDs sub-list, allowing me to notice the strange density patterns.
After some analysis, I've observed 2 things :
For each sub-list, the delimitation between the 2 densities is 6x36^(n-1), n being the number of characters in the ID. In other terms, it is 1/6th of the entire range of values for a given ID length. The range of values is [0; (36^n)-1]
The repartition of those IDs in relation to this limit tends towards 50/50, half of them being above the 1/6th limit, half of them being under it.
I've tried to correlate such a behavior with other known PRNG scatter-plots, but none of them matched what I get on my graphs.
I'm hoping some of you might know about an encryption method, formula, or function matching such a specific scatter plot, or have any idea about what could be going on behind the scenes.
Thanks in advance for your answers.

This answer may not be very useful but I think it can help. the graph plot you shown is most likely that it doesn't belong to one of the most known PRNG used and of course it would never belong to cryptographic PRNG.
But I have a notice I dont know if it can help. This PRNG seems to have a full period equals to full cycle of numbers generated for a fixed character places. I mean it operate with a pattern for 4 digits then repeat pattern but with higher magnitude for 5 characters which will propably means that this same pattern of distribution will repeat for 6 characters but with higher magnitude.
So, in summery, this can mean that this pattern can be exploited if you know what is the value of this magnitude so you know the increments for 6 characters graph plot and then you can just stretch the 5 characters graph on the Y-Axis to get some kind of a solution (which would be the seed for 6 characters graph).
EDIT: To clear things more clearly regarding your comment. what I mean is that this PRNG generate random numbers but these random numbers would not be repeated to infinity instead there will be some point in time were the same sequence will be regenerated. The I've inadvertantly left behind a piece of information: confirm this since when it encounter same number generated before ( reached this point in time where same sequence is regenerated ). It will just add 1 extra character to the sequence which would not change the distribution on the graph but instead will make the graph appear like if it was stretched along Y-Axis (like if Y intercept of the graph function just got bigger).

Related

What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences?

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.
A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/
I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Algorithm for global multiple sequence alignment using only indels

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines have the same set of delimiters, but some lines may have extra segments, an optional comma at the end, and so forth.
My idea is to come up with a canonical list of delimiters. Specifically, given several strings of delimiters, I would like to find the shortest string that can be formed from any of the given strings using only insertions, with ties broken in some sensible manner. After some research, I learned that this is the well-known problem of global multiple sequence alignment, except that there are no mismatches, only matches and indels.
The dynamic programming approach, unfortunately, is exponential in the number of strings - at least in the general case. Is there any hope for a faster solution when mismatches are disallowed?
I'm a little hesitant to make a blanket statement that there is no such hope, even when mismatches are disallowed, but I'm pretty sure that there isn't. Here's why.
The size of the dynamic programming table generated when doing sequence alignment is approximately (string length)^(number of strings), hence the exponential run-time/space requirement. To give you a feel of where that comes from, here's an example with two strings, ABC and ACB, each of length 3. This gives us a 3x3 table:
A B C
A 0 1 2
C 1 1 1
B 2 1 2
We initialize this table starting from the upper left and working our way down to the lower right from there. The total cost to get to any location in the table is given by the number at that location (for simplicity, I'm assuming that insertions, deletions, and substitutions all have a cost of 1). The operation used to get to a given location is given by the direction that you moved from the previous value. Moving to the right means you are inserting elements from the top string. Moving down inserts elements from the sideways string. Moving diagonally means you are aligning elements from the top and bottom. If these elements don't match, then this represents a substitution and you increase the cost to get there.
And that's the problem. Saying mismatches aren't allowed doesn't rule out the operations that are responsible for the length and height of the table (insertions/deletions). Worse, disallowing mismatches doesn't even rule out a potential move. Diagonal movements in the table are still possible sometimes, just not when the two elements don't match. Plus, you still need to check to see if the elements match, so you're basically still considering that move. As a result, this shouldn't be able to improve your worst case time and seems unlikely to have a substantial effect on your average or best case time either.
On the bright side, this is a pretty important problem in bioinformatics, so people have come up with some solutions. They have their flaws, but may work well-enough for your case (particularly since it seems likely that you'll be less likely to have spurious alignments than you would with DNA, given that your strings are not-composed of a four-letter alphabet). So take a look at Star Alignment and Neighbor Joining.

Reconstruction a signal from random samples with holes

I've encountered the following problem as part of my master thesis, and having been unable to find a suitable solution over the last few weeks I will ask the masses.
The problem 1
Assume there exist an (unknown) sequence of symbols of a known length. Say for instance
ABCBACBBBAACBAABCCBABBCA... # 2000 Symbols long
Now, given N samples from arbitrary positions in the sequence, the task is to reconstruct the original sequence. For instance:
ABCBACBBBAA
ACBBBAACBAABCCBAB
CBACBBBAACBAAB
BAABCCBABBCA
...
The problem 2 (Harder)
Now, on the bright side, there is no limit to how many samples I can make, whilst on the not so bright side there is more to the story.
The samples are noisy. i.e. There might be errors.
There are known holes in the samples. I am only able to observe every 4-6th symbol.
Thus the samples are actually looking more like this:
A A A
A A A C
C B B
B B C* # The C should have been an A.
...
I have tried the following:
Let S be the set of all partial noisy sequences with holes.
Greedy algorithm with random sampling and sliding window.
Let X be the the "best" sequence thus far.
Set X as a random sample from S.
Choose a sequence v from S
Slide v along X and score the match, and choose the "best" sequence as the new X.
Repeat from 3.
The problem with this algorithm is that I have been unable to find a good metric to score the sequences. Especially when considering the holes + noise. The result tended to favor shorter sequences, and the result was highly divergent in subsequent runs. Ideas to resolve this are most welcome.
Trying to align the start of the sequence.
This approach attempted to use the fact that I might be able to identify a suffix in the strings that likely make up beginning of the unknown sequence. However, due to the holes in the samples, I would need to shift even the matching sequences a few steps right or left. This results in exponential complexity and makes the problem intractable.
I have also played with the idea of using a Hidden Markov Model, but am thwarted on how to deal with the missing data.
Other ideas include, trying max flow through a graph built from the strings (don't think this will work), trellis decoding [Viterbi] (don't see how I can deal with samples starting in the middle of the unknown sequence) and more.
Any fresh Ideas are very welcome. Links/references to relevant articles are like manna!
Specific information about my data set
I have three symbols S (start), A and B.
I am < 60% certain any given symbol is sampled correctly.
The S symbol should only appear a few times at the start of the master sequence, but does occur more often due to misclassification.
The symbol B occurs about 1.5 times as often as A in the master sequence.
Problem 1 is known as the Shortest Common Supersequence problem. It is NP-hard for more than two input strings, even with only two symbols. Problem 2 is an instance of Multiple Sequence Alignment. There are many algorithms and implementations for it, mostly heuristic since it is also NP-hard in general.

Generating minimal/irreducible Sudokus

A Sudoku puzzle is minimal (also called irreducible) if it has a unique solution, but removing any digit would yield a puzzle with multiple solutions. In other words, every digit is necessary to determine the solution.
I have a basic algorithm to generate minimal Sudokus:
Generate a completed puzzle.
Visit each cell in a random order. For each visited cell:
Tentatively remove its digit
Solve the puzzle twice using a recursive backtracking algorithm. One solver tries the digits 1-9 in forward order, the other in reverse order. In a sense, the solvers are traversing a search tree containing all possible configurations, but from opposite ends. This means that the two solutions will match iff the puzzle has a unique solution.
If the puzzle has a unique solution, remove the digit permanently; otherwise, put it back in.
This method is guaranteed to produce a minimal puzzle, but it's quite slow (100 ms on my computer, several seconds on a smartphone). I would like to reduce the number of solves, but all the obvious ways I can think of are incorrect. For example:
Adding digits instead of removing them. The advantage of this is that since minimal puzzles require at least 17 filled digits, the first 17 digits are guaranteed to not have a unique solution, reducing the amount of solving. Unfortunately, because the cells are visited in a random order, many unnecessary digits will be added before the one important digit that "locks down" a unique solution. For instance, if the first 9 cells added are all in the same column, there's a great deal of redundant information there.
If no other digit can replace the current one, keep it in and do not solve the puzzle. Because checking if a placement is legal is thousands of times faster than solving the puzzle twice, this could be a huge time-saver. However, just because there's no other legal digit now doesn't mean there won't be later, once we remove other digits.
Since we generated the original solution, solve only once for each cell and see if it matches the original. This doesn't work because the original solution could be anywhere within the search tree of possible solutions. For example, if the original solution is near the "left" side of the tree, and we start searching from the left, we will miss solutions on the right side of the tree.
I would also like to optimize the solving algorithm itself. The hard part is determining if a solution is unique. I can make micro-optimizations like creating a bitmask of legal placements for each cell, as described in this wonderful post. However, more advanced algorithms like Dancing Links or simulated annealing are not designed to determine uniqueness, but just to find any solution.
How can I optimize my minimal Sudoku generator?
I have an idea on the 2nd option your had suggested will be better for that provided you add 3 extra checks for the 1st 17 numbers
find a list of 17 random numbers between 1-9
add each item at random location provided
these new number added dont fail the 3 basic criteria of sudoku
there is no same number in same row
there is no same number in same column
there is no same number in same 3x3 matrix
if condition 1 fails move to the next column or row and check for the 3 basic criteria again.
if there is no next row (ie at 9th column or 9th row) add to the 1st column
once the 17 characters are filled, run you solver logic on this and look for your unique solution.
Here are the main optimizations I implemented with (highly approximate) percentage increases in speed:
Using bitmasks to keep track of which constraints (row, column, box) are satisfied in each cell. This makes it much faster to look up whether a placement is legal, but slower to make a placement. A complicating factor in generating puzzles with bitmasks, rather than just solving them, is that digits may have to be removed, which means you need to keep track of the three types of constraints as distinct bits. A small further optimization is to save the masks for each digit and each constraint in arrays. 40%
Timing out the generation and restarting if it takes too long. See here. The optimal strategy is to increase the timeout period after each failed generation, to reduce the chance that it goes on indefinitely. 30%, mainly from reducing the worst-case runtimes.
mbeckish and user295691's suggestions (see the comments to the original post). 25%

How to apply the Levenshtein distance to a set of target strings?

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values ​​on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values ​​indicate that the sentence pronounced by the user probably does not belong to TARGET.
I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Resources