finding long repeated substrings in a massive string - performance

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for.
I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.
This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree:
storing a string's suffix tree typically requires significantly more space than storing the string itself.
The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.
And that was wikipedia's comments on the tree, not trie.
How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?
(Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

The effective way to do this is to create an index of the sub-strings, and sort them. This is an O(n lg n) operation.
BWT compression does this step, so its a well understood problem and there are radix and suffix (claim O(n)) sort implementations and such to make it as efficient as possible. It still takes a long time, perhaps several seconds for large texts.
If you want to use utility code, C++ std::stable_sort() performs much better than std::sort() for natural language (and much faster than C's qsort(), but for different reasons).
Then visiting each item to see the length of its common substring with its neighbours is O(n).

You could look at disk-based suffix trees. I found this Suffix tree implementation library through Google, plus a bunch of articles that could help implementing it yourself.

You could solve this using divide and conquer. I think this should be the same algorithmic complexity as using a trie, but maybe less efficient implementation-wise
void LongSubstrings(string data, string prefix, IEnumerable<int> positions)
{
Dictionary<char, DiskBackedBuffer> buffers = new Dictionary<char, DiskBackedBuffer>();
foreach (int position in positions)
{
char nextChar = data[position];
buffers[nextChar].Add(position+1);
}
foreach (char c in buffers.Keys)
{
if (buffers[c].Count > 1)
LongSubstrings(data, prefix + c, buffers[c]);
else if (buffers[c].Count == 1)
Console.WriteLine("Unique sequence: {0}", prefix + c);
}
}
void LongSubstrings(string data)
{
LongSubstrings(data, "", Enumerable.Range(0, data.Length));
}
After this, you would need to make a class that implemented DiskBackedBuffer such that it was a list of numbers, and when the buffer got to a certain size, it would write itself out to disk using a temporary file, and recall from disk when read from.

Answering my own question:
Given that a long match is also a short match, you can trade multiple passes for RAM by first finding shorter matches and then seeing if you can 'grow' these matches.
The literal approach to this is to build a trie (with counts in each node) of all sequences of some fixed length in the data. You then cull all those nodes that are not matching your criteria (e.g. the longest match). Then then do a subsequent pass through the data, building the trie out deeper, but not broader. Repeat until you've found the longest repeated sequence(s).
A good friend suggested to use hashing. By hashing the fixed-length character sequence starting at each character you now have the issue of finding duplicate hash values (and verifying the duplication, as hashing is lossy). If you allocate an array the length of the data to hold the hash values, you can do interesting things e.g. to see if a match is longer than your fixed-length pass of the data, you can just compare the sequences of hashes rather than regenerating them. Etc.

what about a simple program like this :
S = "ABAABBCCAAABBCCM"
def findRepeat(S):
n = len(S)
#find the maxim lenth of repeated string first
msn = int(floor(n/2))
#start with maximum length
for i in range(msn,1,-1):
substr = findFixedRepeat(S, i)
if substr:
return substr
print 'No repeated string'
return 0
def findFixedRepeat(str, n):
l = len(str)
i = 0
while ((i + n -1) < l):
ss = S[i:i+n]
bb = S[i+n:]
try:
ff = bb.index(ss)
except:
ff = -1
if ff >= 0:
return ss;
i = i+1
return 0
print findRepeat(S)

Is this text with word breaks? Then I'd suspect you want a variation of keyword-in-context: make a copy of each line n times for n words in a line, breaking each line at each word; sort alpha of the whole thing; look for repeats.
If it's a single long honking string, like say bioinformatic DNA sequences, then you want to build something like your trie on disk; build a record for each character with a disk offset for the next-nodes. I'd have a look at Volume 3 of Knuth, section 5.4, "external sorting".

Can you solve your problem by building a suffix array instead? Otherwise you'll likely need to use one of the disk-based suffix trees mentioned in the other answers.

Just a belated thought that occurred to me...
Depending on your OS/environment. (E.g. 64 bit pointers & mmap() available.)
You might be able to create a very large Suffix-tree on disk through mmap(), and then keep a cached most-frequently-accessed subset of that tree in memory.

The easiest way might just be to plunk down the $100 for a bunch more RAM. Otherwise, you'll likely have to look at disk backed structures for holding your suffix tree.

Related

How to find the correct "word part" records that make up an input "word string", given a word part dataset?

In agglutinative languages, "words" is a fuzzy concept. Some agglutinative languages are like Turkish, Inuktitut, and many Native American languages (amongst others). In them, "words" are often/usually composed of a "base", and multiple prefixes/suffixes. So you might have ama-ebi-na-mo-kay-i-mang-na (I just made that up), where ebi is the base, and the rest are affixes. Let's say this means "walking early in the morning when the birds start singing", ama/early ebi/walk na/-ing mo/during kay/bird i/plural mang/sing na-ing. These words can get quite long, like 30+ "letters".
So I was playing around with creating a "dictionary" for a language like this, but it's not realistic to write definitions or "dictionary entries" as your typical English "words", because there are a possibly infinite number of words! (All combinations of prefixes/bases/suffixes). So instead, I was trying to think maybe you could have just these "word parts" in the database (prefixes/suffixes/bases, which can't stand by themselves actually in the real spoken language, but are clearly distinct in terms of adding meaning). By having a database of word parts, you would then (in theory) query by passing as input a long say 20-character "word", and it would figure out how to break this word down into word parts because of the database (somehow).
That is, it would take amaebinamokayimangna as input, and know that it can be broken down into ama-ebi-na-mo-kay-i-mang-na, and then it simply queries the database for those parts to return whatever metadata is associated with those parts.
What would you need to do to accomplish this basically, at a high level? Assuming you had a database (SQL or just in a text file) containing these affixes and bases, how could you take the input and know that it breaks down into these parts organized in this way? Maybe it turns out there is are other parts in the DB which can be arrange like a-ma-e-bina-mo-kay-im-ang-na, which is spelled the the exact same way (if you remove the hyphens), so it would likely find that as a result too, and return it as another possible match.
The only way (naive way) I can think of solving this currently, is to break the input string into ngrams like this:
function getNgrams(str, { min = 1, max = 8 } = {}) {
const ngrams = []
const points = Array.from(str)
const n = points.length
let minSize = min
while (minSize <= max) {
for (let i = 0; i < (n - minSize + 1); i++) {
const ngram = points.slice(i, i + minSize)
ngrams.push(ngram.join(''))
}
minSize++
}
return ngrams
}
And it would then check the database if any of those ngrams exist, maybe passing in if this is a prefix (start of word), infix, or suffix (end of word) part. The database parts table would have { id, text, is_start, is_end } sort of thing. But this would be horribly inefficient and probably wouldn't work. It seems really complex how you might go about solving this.
So wondering, how would you solve this? At a high level, what is the main vision you see of how you would tackle this, either in a SQL database or some other approach?
The goal is, save to some persisted area the word parts, and how they are combined (if they are a prefix/infix/suffix), and then take as input a string which could be generated from those parts, and try and figure out what the parts are from the persisted data, and then return those parts in the correct order.
First consider the simplified problem where we have a combination of prefixes only. To be able to split this into prefixes, we would do:
Store all the prefixes in a trie.
Let's say the input has n characters. Create an array of length n (of numbers, if you need just one possible split, or sets of numbers, if you need all possible splits). We will store in this array for each index, from which positions of the input string this index can be reached by adding a prefix from the dictionary.
For each substring starting with the 1st character of the input, if it belongs to the Trie, mark the index as can be reached from 0th position (i.e. there is a path from 0th position to k-th position). Trie allows us to do this in O(n)
For all i = 2..n, if the i-th character can be reached from the beginning, repeat the previous step for the substrings starting at i, mark their end position as "can be reached from (i-1)th position" as appropriate (i.e. there is a path from (i-1)th position to ((i-1)+k)th position).
At the end, we can traverse these indices backwards, starting at the end of the array. Each time we jump to an index stored in the array, we are skipping a prefix in the dictionary. Each path from the last position to the first position gives us a possible split. Since we repeated the 4-th step only for positions that can be reached from the 0-th position, all paths are guaranteed to end up at the 0-th position.
Building the array takes O(n^2) time (assuming we have the trie built already). Traversing the array to find all possible splits is O(n*s), where s is the number of possible splits. In any case, we can say if there is a possible split as soon as we have built the array.
The problem with prefixes, suffixes and base words is a slight modification of the above:
Build the "previous" indices for prefixes and "next" for suffixes (possibly starting from the end of the input and tracking the suffixes backwards).
For each base word in the string (all of which we can also find efficiently -O(n^2)- using a trie) see if the starting position can be reached from the left using prefixes, and end position can be reached from right using suffixes. If yes, you have a split.
As you can see, the keywords are trie and dynamic programming. The problem of finding only a single split requires O(n^2) time after the tries are built. Tries can be built in O(m) time where m is the total length of added strings.

Discovering Consecutive Repetitive Patterns in a String

I am trying to search for the maximal number of substring repetitions inside a string, here are some few examples:
"AQMQMB" => QM (2x)
"AQMPQMB" => <nothing>
"AACABABCABCABCP" => A (2x), AB (2x), ABC (3x)
As you can see I am searching for consecutive substrings only and this seems to be a problem because all compression algorithms (at least that I am aware of) don't care about the consecutivity (LZ*), or too simple to handle consecutive patterns instead of single data items (RLE). I think using suffix tree-related algorithms is also not useful due to the same problem.
I think there are some bio-informatics algorithms that can do this, does anyone have an idea about such algorithm?
Edit
In the second example there might be multiple possibilities of consecutive patterns (thanks to Eugen Rieck for the notice, read comments below), however in my use case any of these possibilities is actually acceptable.
This is what I used for a similar problem:
<?php
$input="AACABABCABCABCP";
//Prepare index array (A..Z) - adapt to your character range
$idx=array();
for ($i="A"; strlen($i)==1; $i++) $idx[$i]=array();
//Prepare hits array
$hits=array();
//Loop
$len=strlen($input);
for ($i=0;$i<$len;$i++) {
//Current character
$current=$input[$i];
//Cycle past occurrences of character
foreach ($idx[$current] as $offset) {
//Check if substring from past occurrence to now matches oncoming
$matchlen=$i-$offset;
$match=substr($input,$offset,$matchlen);
if ($match==substr($input,$i,$matchlen)) {
//match found - store it
if (isset($hits[$match])) $hits[$match][]=$i;
else $hits[$match]=array($offset,$i);
}
}
//Store current character in index
$idx[$current][]=$i;
}
print_r($hits);
?>
I suspect it to be O(N*N/M) time with N being string length and M being the width of the character range.
It outputs what I think are the correct answers for your example.
Edit:
This algo hast the advantage of keeping valid scores while running, so it is usable for streams, asl long as you can look-ahaead via some buffering. It pays for this with efficiency.
Edit 2:
If one were to allow a maximum length for repetition detection, this will decrease space and time usage: Expelling too "early" past occurrences via something like if ($matchlen>MAX_MATCH_LEN) ... limits index size and string comparison length
Suffix tree related algorithms are useful here.
One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.
The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.

Is there a better way to calculate the frequency of all symbols in a file?

Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.
I was considering doing something like this (in pseudocode):
function add_to_heap (symbol)
freq = heap.find(symbol).frequency
if (freq.exists? == true)
freq++
else
symbol.freq = 1
heap.insert(symbol)
MaxBinaryHeap heap
while somefile != EOF
symbol = read_byte(somefile)
heap.add_to_heap(symbol)
heap.sort_by_frequency()
while heap.root != empty
root = heap.extract_root()
do_stuff(root)
I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?
You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.
However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.
If you're looking for a "best" solution based on running times, here's what I'd suggest:
When you're reading the file, you should have your symbols sorted (or hashed) by the value of the symbols themselves, not their frequencies. This'll let you find the current symbol in your list of already seen symbols quickly, rather than having to search through your entire list. You should also have that initial structure be able to perform fast inserts - I'd recommend a binary tree of a hash.
Once you've read all your symbols, then you should switch your ordering to be based on the frequency counts. I'd read everything into an array and then perform an in-place sort, but there are a bunch of equivalent ways to do this.
Hope this helps!

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Generating random fixed length permutations of a string

Lets say my alphabet contains X letters and my language supports only Y letter words (Y < X ofcourse). I need to generate all the words possible in random order.
E.g.
Alphabet=a,b,c,d,e,f,g
Y=3
So the words would be:
aaa
aab
aac
aba
..
bbb
ccc
..
(the above should be generated in random order)
The trivial way to do it would be to generate the words and then randomize the list. I DONT want to do that. I want to generate the words in random order.
rondom(n)=letter[x].random(n-1) will not work because then you'll have a list of words starting with letter[x].. which will make the list not so random.
Any code/pseudocode appreciated.
As other answers have implied, there's two main approaches: 1) track what you've already generated (the proposed solutions in this category suffer from possibly never terminating), or 2) track what permutations have yet to be produced (which implies that the permutations must be pre-generated which was specifically disallowed in the requirements). Here's another solution that is guaranteed to terminate and does not require pre-generation, but may not meet your randomization requirements (which are vague at this point).
General overview: generate a tree to track what's been generated or what's remaining. "select" new permutations by traversing random links in the tree, pruning the tree at the leafs after generation of that permutation to prevent it from being generated again.
Without a whiteboard to diagram this, I hope this description is good enough to describe what I mean: Create a "node" that has links to other nodes for every letter in the alphabet. This could be implemented using a generic map of alphabet letters to nodes or if your alphabet is fixed, you could create specific references. The node represents the available letters in the alphabet that can be "produced" next for generating a permutation. Start generating permutations by visiting the root node, selecting a random letter from the available letters in that node, then traversing that reference to the next node. With each traversal, a letter is produced for the permutation. When a leaf is reached (i.e. a permutation is fully constructed), you'd backtrack up the tree to see if the parent nodes have any available permutations remaining; if not, the parent node can be pruned.
As an implementation detail, the node could store the set of letters that are not available to be produced at that point or the set of letters that are still available to be produced at that point. In order to possibly reduce storage requirements, you could also allow the node to store either with a flag indicating which it's doing so that when the node allows more than half the alphabet it stores the letters produced so far and switch to using the letters remaining when there's less than half the alphabet available.
Using such a tree structure limits what can be produced without having to pre-generate all combinations since you don't need to pre-construct the entire tree (it can be constructed as the permutations are generated) and you're guaranteed to complete because of the purging of the nodes (i.e. you're only traversing links to nodes when that's an allowed combination for an unproduced permutation).
I believe the randomization of the technique is a little odd, however, and I don't think each combination is equally likely to be generated at any given time, though I haven't really thought through this. It's also probably worth noting that even though the full tree isn't necessarily generated up front, the overhead involved will likely be enough such that you may be better off pre-generating all permutations.
I think you can do something pretty straightforward by generating a random array of characters based on the alphabet you have (in c#):
char[] alphabet = {'a', 'b', 'c', 'd'};
int wordLength = 3;
Random rand = new Random();
for (int i = 0; i < 5; i++)
{
char[] word = new char[wordLength];
for (int j = 0; j < wordLength; j++)
{
word[j] = alphabet[rand.Next(alphabet.Length)];
}
Console.WriteLine(new string(word));
}
Obviously this might generate duplicates but you could maybe store results in a hashmap or something to check for duplicates if you need to.
So I take it what you want is to produce a permutation of the set using as little memory as possible.
First off, it can't be done using no memory. For your first string, you want a function that could produce any of the strings with equal likelihood. Say that function is called nextString(). If you call nextString() again without changing anything in the state, of course it will once again be able to produce any of the strings.
So you need to store something. The question is, what do you need to store, and how much space will it take?
The strings can be seen as numbers 0 - X^Y. (aaa=0, aab=1,aac=2...aba=X...) So to store a single string as efficiently as possible, you'd need lg(X^Y) bits. Let's say X = 16 and Y=2. Then you'd need 1 byte of storage to uniquely specify a string.
Of course the most naive algorithm is to mark each string as it is produced, which takes X^Y bits, which in my example is 256 bits (32 bytes). This is what you've said you don't want to do. You can use a shuffle algorithm as discussed in this question: Creating a random ordered list from an ordered list (you won't need to store the strings as you produce them through the shuffle algorithm, but you still need to mark them).
Ok, now the question is, can we do better than that? How much do we need to store, total?
Well, on the first call, we don't need any storage. On the second call, we need to know which one was produced before. On the last call, we only need to know which one is the last one left. So the worst case is when we're halfway through. When we're halfway through, there have been 128 strings produced, and there are 128 to go. We need to know which are left to produce. Assuming the process is truly random, any split is possible. There are (256 choose 128) possibilities. In order to potentially be able to store any of these, we need lg(256 choose 128) bits, which according to google calculator is 251.67. So if you were really clever you could squeeze the information into 4 fewer bits than the naive algorithm. Probably not worth it.
If you just want it to look randomish with very little storage, see this question: Looking for an algorithm to spit out a sequence of numbers in a (pseudo) random order

Resources