Python: all possible words (permutations) of fixed length in mini-alphabet - algorithm

Let's say I have a string like so:
abcdefghijklmnopqrstuvwxyz1234567890!##$%^&*()-_+={}[]\:;"'?/>.<,`~|€
This is basicly a list of all the characters on my keyboard. How could I get all possible combinations for, let's say, a "word" made up of 8 of these chars? I know there are going to be millions of possibilities.
Cheers!

Difference between permutations and combinations
You are either looking for a permutation or a combination.
'abc' and 'bac' are different permutations, but they are the same combination {a,b,c}.
Permutations of 'abc': '', 'a', 'b', 'c', 'ab', 'ba', 'ac', 'ca', 'bc', 'cb', 'abc', 'acb', 'bac', 'bca', 'cab', 'cba'
Combinations of 'abc': {}, {'a'}, {'b'}, {'c'}, {'a','b'}, {'b','c'}, {'a','c'}, {'a','b','c'}
In python
Use from itertools import * (since the functions there really should be in the default namespace), or import itertools if you'd like.
If you care about permutations:
permutations(yourString, 8)
If you care about combinations :
combinations(yourString, 8)
In other languages
In other languages, there are simple recursive or iterative algorithms to generate these. See wikipedia or stackoverflow. e.g. http://en.wikipedia.org/wiki/Permutation#Systematic_generation_of_all_permutations
Important note
Do note that the number of permutations is N!, so for example your string would have
(69 choose 8) = 8 billion combinations of length 8, and therefore...
(69 choose 8) * 8! ~= 3.37 × 10^14 permutations of length 8.
You'll run out of memory if you are storing every permutation. Even if you don't (because you're reducing them), it'll take a long time to run, maybe somewhere between 1-10 days on a modern computer.

Related

Binary random number with a specific number of ones

I am looking to randomly insert ones into a binary number where each specific set of bits has a fixed number of ones.
For example, if I have a 15 bit number, all 3 sets of 5 bits must have exactly 3 ones each. I need to generate say, 40 such unique binary numbers.
import numpy as np
N = 15
K = 9 # K zeros, N-K ones
arr = np.array([0] * K + [1] * (N-K))
np.random.shuffle(arr)
This is something that I discovered, but the issue is, here, this solution means that it is not necessary that the ones are distributed in the way that I want - through this solution, all ones can be grouped together right at the beginning, such that the last set of 5 bits are all zeroes - and this is not what I'm looking for.
Also, this method does not guarantee that all combinations I have are unique.
Looking for any suggestions regarding this. Thank you!
If I understand the question correctly, you could do something like this in Python:
import random
def valgen():
set_bits = [
*random.sample(range(0, 5), 3),
*random.sample(range(5, 10), 3),
*random.sample(range(10, 15), 3),
]
return sum(1<<i for i in set_bits)
i.e. sample three sets of integer values, without replacement, in each block and set those bits in the result.
if you want 40 unique values, I'd do:
vals = {valgen() for _ in range(40)}
while len(vals) < 40:
vals.add(valgen())
see the birthday problem for why you should expect approx one duplicate per set of 40

Given an dictionary of words and and an array letters, find the maximum number of dictionary words which can be created using those letters

Each letter can be used only once. There may be more than one instance of the same letter in the array.
We can assume that each word in the dict can be spelled using the letters. The goal is to return the maximum number of words.
Example 1:
arr = ['a', 'b', 'z', 'z', 'z', 'z']
dict = ['ab', 'azz', 'bzz']
// returns 2 ( for [ 'azz', 'bzz' ])
Example 2:
arr = ['g', 't', 'o', 'g', 'w', 'r', 'd', 'e', 'a', 'b']
dict = ['we', 'bag', 'got', 'word']
// returns 3 ( for ['we', 'bag', 'got'] )
EDIT for clarity to adhere to SO guidelines:
Looking for a solution. I was given this problem during an interview. My solution is below, but it was rejected as too slow.
1.) For each word in dict, w
- Remove w's letters from the arr.
- With the remaining letters, count how many other words could be spelled.
Put that # as w's "score"
2.) With every word "scored", select the word with the highest score,
remove that word and its letters from the input arrays.
3.) Repeat this process until no more words can be spelled from the remaining
set of letters.
This is a fairly generic packing problem with up to 26 resources. If I were trying to solve this problem in practice, I would formulate it as an integer program and apply an integer program solver. Here's an example formulation for the given instance:
maximize x_ab + x_azz + x_bzz
subject to
constraint a: x_ab + x_azz <= 1
constraint b: x_ab + x_bzz <= 1
constraint z: 2 x_azz + 2 x_bzz <= 4
x_ab, x_azz, x_bzz in {0, 1} (or integer >= 0 depending on the exact variant)
The solver will solve the linear relaxation of this program and in the process put a price on each letter indicating how useful it is to make words, which guides the solver quickly to a provably optimal solution on surprisingly large instances (though this is an NP-hard problem for arbitrary-size alphabets, so don't expect much on artificial instances such as those resulting from NP-hardness reductions).
I don't know what your interviewer was looking for -- maybe a dynamic program whose states are multisets of unused letters.
Expression for One possible Dynamic Programming solution can be following:
WordCount(dict,i,listOfRemainingLetterCounts) =
max(WordCount(dict,i-1,listOfRemainingLetterCounts),
WordCount(dict,i-1,listOfRemainingLetterCountsAfterReducingCountOfWordDict[i]))
I see it as a multidimensional problem. Was the interviewer impressed by your answer?
Turn the list of letters into a set of letter-occurence pairs. Where the occurrence is incremented on each occurrence of the same letter in the list e.g. aba becomes set of a-1 b-1 a-2
Translate each word in the dictionary, independently, in a similar manner; so The word coo becomes a set: c-1 o-2.
A word is accepted if the set of its letter-occurences is a subset of the set generated from the original list of letters.
For fixed alphabet, and maximum letter frequencies, this could be implemented quite quickly using bitsets, but, again, how fast is fast enough?

Encoding Permutations With Repeating Values

I'm trying to generate all combinations of A,B,C,D,E in three positions:
A,A,A
A,A,B
C,A,E
C,B,A
C,B,B
etc...
I've learned about factorial number systems and combinatorial number systems, but I'm still stuck finding the right implementation. Generally in the past I've used recursion to solve this problem, but in this case I don't want to generate the whole list to find one value, so I need an encoding.
Ideally I have an integer encoding for the combinations, so I can simply call a function with an iteration integer to generate the correct permutation.
Also what is this called and how can I learn more about the variations in approaches? Some similar solutions I've seen generate only non-repeating combinations (ABC,ABD) others don't reuse values.
My guess based on my past recursion approach is that permutation(0) would result in aaa and permutation(100) would result in adw.
The specific combinations you look for seem to be just "any of A,B,C,D,E on each position".
In this case, they are much akin a "pentary" (base 5) positional numeral system: you have three digits, and each of them may independently be 0 (A), 1 (B), 2 (C), 3 (D), or 4 (E).
The same goes for encoding these as integers: just number them from 0 to 53-1.
For a number k, the "combination" is "(k div 52) mod 5, (k div 51) mod 5, (k div 50) mod 5, with ABCDE encoded as 01234, respectively.
For a "combination" like "xyz", first map letters ABCDE to digits 01234 as x, y, and z, and then the encoding number is x*52 + y*51 + z*50.

How to generate all unique 4 character permutations of a seed string? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Algorithm to return all combinations of k elements from n
Generate Distinct Combinations PHP
I have an array containing a number of characters/letter,s e.g:
$seed = array('a','b','c','d','e','f',.....,'z','1','2','3',...'9');
I want to get all possible unique 4 character combinations/permutations from the seed, for example:
abcd, azxy, ag12, aaaa, etc
What's the best way to accomplish this?
I have thought about dividing the seed array into 4 letter groups, then go through each group and generate all possible combinations of that group, but that will leave out many combinations (i.e it will process abcd and wxyz, but not abyz and wxcd)
For each character in the array, write that character followed by each of the unique 3 character strings either from the characters after it (if you actually mean combinations) or from all the characters (which is what I think you mean).
How to generate all unique 3 character permutations of a seed string?
See this very similar question.
You may also want to read about recursion.
Python code
>>> def product(chars, n):
if n == 0:
yield ''
else:
for c in chars:
for result in product(x, n - 1): # Recursive call
yield c + result
>>> list(product(['a', 'b', 'c'], 2))
['aa', 'ab', 'ac', 'ba', 'bb', 'bc', 'ca', 'cb', 'cc']
(Note: in real Python code you should use itertools.product rather than writing it yourself.)
Generating permutations is like summing up numbers. This is beautifully explained in the freely available book Higher Order Perl, page 128

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.
The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.
You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.
It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.
I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.
Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.
You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.
Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Resources