Generating random fixed length permutations of a string - algorithm

Lets say my alphabet contains X letters and my language supports only Y letter words (Y < X ofcourse). I need to generate all the words possible in random order.
E.g.
Alphabet=a,b,c,d,e,f,g
Y=3
So the words would be:
aaa
aab
aac
aba
..
bbb
ccc
..
(the above should be generated in random order)
The trivial way to do it would be to generate the words and then randomize the list. I DONT want to do that. I want to generate the words in random order.
rondom(n)=letter[x].random(n-1) will not work because then you'll have a list of words starting with letter[x].. which will make the list not so random.
Any code/pseudocode appreciated.

As other answers have implied, there's two main approaches: 1) track what you've already generated (the proposed solutions in this category suffer from possibly never terminating), or 2) track what permutations have yet to be produced (which implies that the permutations must be pre-generated which was specifically disallowed in the requirements). Here's another solution that is guaranteed to terminate and does not require pre-generation, but may not meet your randomization requirements (which are vague at this point).
General overview: generate a tree to track what's been generated or what's remaining. "select" new permutations by traversing random links in the tree, pruning the tree at the leafs after generation of that permutation to prevent it from being generated again.
Without a whiteboard to diagram this, I hope this description is good enough to describe what I mean: Create a "node" that has links to other nodes for every letter in the alphabet. This could be implemented using a generic map of alphabet letters to nodes or if your alphabet is fixed, you could create specific references. The node represents the available letters in the alphabet that can be "produced" next for generating a permutation. Start generating permutations by visiting the root node, selecting a random letter from the available letters in that node, then traversing that reference to the next node. With each traversal, a letter is produced for the permutation. When a leaf is reached (i.e. a permutation is fully constructed), you'd backtrack up the tree to see if the parent nodes have any available permutations remaining; if not, the parent node can be pruned.
As an implementation detail, the node could store the set of letters that are not available to be produced at that point or the set of letters that are still available to be produced at that point. In order to possibly reduce storage requirements, you could also allow the node to store either with a flag indicating which it's doing so that when the node allows more than half the alphabet it stores the letters produced so far and switch to using the letters remaining when there's less than half the alphabet available.
Using such a tree structure limits what can be produced without having to pre-generate all combinations since you don't need to pre-construct the entire tree (it can be constructed as the permutations are generated) and you're guaranteed to complete because of the purging of the nodes (i.e. you're only traversing links to nodes when that's an allowed combination for an unproduced permutation).
I believe the randomization of the technique is a little odd, however, and I don't think each combination is equally likely to be generated at any given time, though I haven't really thought through this. It's also probably worth noting that even though the full tree isn't necessarily generated up front, the overhead involved will likely be enough such that you may be better off pre-generating all permutations.

I think you can do something pretty straightforward by generating a random array of characters based on the alphabet you have (in c#):
char[] alphabet = {'a', 'b', 'c', 'd'};
int wordLength = 3;
Random rand = new Random();
for (int i = 0; i < 5; i++)
{
char[] word = new char[wordLength];
for (int j = 0; j < wordLength; j++)
{
word[j] = alphabet[rand.Next(alphabet.Length)];
}
Console.WriteLine(new string(word));
}
Obviously this might generate duplicates but you could maybe store results in a hashmap or something to check for duplicates if you need to.

So I take it what you want is to produce a permutation of the set using as little memory as possible.
First off, it can't be done using no memory. For your first string, you want a function that could produce any of the strings with equal likelihood. Say that function is called nextString(). If you call nextString() again without changing anything in the state, of course it will once again be able to produce any of the strings.
So you need to store something. The question is, what do you need to store, and how much space will it take?
The strings can be seen as numbers 0 - X^Y. (aaa=0, aab=1,aac=2...aba=X...) So to store a single string as efficiently as possible, you'd need lg(X^Y) bits. Let's say X = 16 and Y=2. Then you'd need 1 byte of storage to uniquely specify a string.
Of course the most naive algorithm is to mark each string as it is produced, which takes X^Y bits, which in my example is 256 bits (32 bytes). This is what you've said you don't want to do. You can use a shuffle algorithm as discussed in this question: Creating a random ordered list from an ordered list (you won't need to store the strings as you produce them through the shuffle algorithm, but you still need to mark them).
Ok, now the question is, can we do better than that? How much do we need to store, total?
Well, on the first call, we don't need any storage. On the second call, we need to know which one was produced before. On the last call, we only need to know which one is the last one left. So the worst case is when we're halfway through. When we're halfway through, there have been 128 strings produced, and there are 128 to go. We need to know which are left to produce. Assuming the process is truly random, any split is possible. There are (256 choose 128) possibilities. In order to potentially be able to store any of these, we need lg(256 choose 128) bits, which according to google calculator is 251.67. So if you were really clever you could squeeze the information into 4 fewer bits than the naive algorithm. Probably not worth it.
If you just want it to look randomish with very little storage, see this question: Looking for an algorithm to spit out a sequence of numbers in a (pseudo) random order

Related

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Algorithm for finding basis of a set of bitstrings?

This is for a diff utility I'm writing in C++.
I have a list of n character-sets {"a", "abc", "abcde", "bcd", "de"} (taken from an alphabet of k=5 different letters). I need a way to observe that the entire list can be constructed by disjunctions of the character-sets {"a", "bc", "d", "e"}. That is, "b" and "c" are linearly dependent, and every other pair of letters is independent.
In the bit-twiddling version, the character-sets above are represented as {10000, 11100, 11111, 01110, 00011}, and I need a way to observe that they can all be constructed by ORing together bitstrings from the smaller set {10000, 01100, 00010, 00001}.
In other words, I believe I'm looking for a "discrete basis" of a set of n different bit-vectors in {0,1}k. This paper claims the general problem is NP-complete... but luckily I'm only looking for a solution to small cases (k < 32).
I can think of really stupid algorithms for generating the basis. For example: For each of the k2 pairs of letters, try to demonstrate (by an O(n) search) that they're dependent. But I really feel like there's an efficient bit-twiddling algorithm that I just haven't stumbled upon yet. Does anyone know it?
EDIT: I ended up not really needing a solution to this problem after all. But I'd still like to know if there is a simple bit-twiddling solution.
I'm thinking a disjoint set data structure, like union find turned on it's head (rather than combining nodes, we split them).
Algorithm:
Create an array main where you assign all the positions to the same group, then:
for each bitstring curr
for each position i
if (curr[i] == 1)
// max of main can be stored for constant time access
main[i] += max of main from previous iteration
Then all the distinct numbers in main are your different sets (possibly using the actual union-find algorithm).
Example:
So, main = 22222. (I won't use 1 as groups to reduce possible confusion, as curr uses bitstrings).
curr = 10000
main = 42222 // first bit (=2) += max (=2)
curr = 11100
main = 86622 // first 3 bits (=422) += max (=4)
curr = 11111
main = 16-14-14-10-10
curr = 01110
main = 16-30-30-26-10
curr = 00011
main = 16-30-30-56-40
Then split by distinct numbers:
{10000, 01100, 00010, 00001}
Improvement:
To reduce the speed at which main increases, we can replace
main[i] += max of main from previous iteration
with
main[i] += 1 + (max - min) of main from previous iteration
EDIT: Edit based on j_random_hacker's comment
You could combine the passes of the stupid algorithm at the cost of space.
Make a bit vector called violations that is (k - 1) k / 2 bits long (so, 496 for k = 32.) Take a single pass over character sets. For each, and for each pair of letters, look for violations (i.e. XOR the bits for those letters, OR the result into the corresponding position in violations.) When you're done, negate and read off what's left.
You could give Principal Component Analysis a try. There are some flavors of PCA designed for binary or more generally for categorical data.
Since someone showed it as NP complete, for large vocabs I doubt you will do better than a brute force search (with various pruning possible) of the entire set of possibilities O((2k-1) * n). At least in a worst case scenario, probably some heuristics will help in many cases as outlined in the paper you linked. This is your "stupid" approach generalized to all possible basis strings instead of just basis of length 2.
However, for small vocabs, I think an approach like this would do a lot better:
Are your words disjoint? If so, you are done (simple case of independent words like "abc" and "def")
Perform bitwise and on each possible pair of words. This gives you an initial set of candidate basis strings.
Goto step 1, but instead of using the original words, use the current basis candidate strings
Afterwards you also need to include any individual letter which is not a subset of one of the final accepted candidates. Maybe some other minor bookeeping for things like unused letters (using something like a bitwise or on all possible words).
Considering your simple example:
First pass gives you a, abc, bc, bcd, de, d
Second pass gives you a, bc, d
Bookkeeping gives you a, bc, d, e
I don't have a proof that this is right but I think intuitively it is at least in the right direction. The advantage lies in using the words instead of the brute force's approach of using possible candidates. With a large enough set of words, this approach would become terrible, but for vocabularies up to say a few hundred or maybe even a few thousand I bet it would be pretty quick. The nice thing is that it will still work even for a huge value of k.
If you like the answer and bounty it I'd be happy to try to solve in 20 lines of code :) and come up with a more convincing proof. Seems very doable to me.

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources