Evolutionary Algorithm to guess a string, messed up by replication - algorithm

i am working on a python script to test out genetic programming.
As an exercise i have made a simple Script that tries to guess
a string without the whole population part.
My Code is:
# acts as a gene
# it has three operations:
# Mutation : One character is changed
# Replication: a sequencepart is duplicated
# Extinction : A sequencepart is lost
# Crossover : the sequence is crossed with another Sequence
import random
class StringGene:
def __init__(self, s):
self.sequence = s
self.allowedChars = "ABCDEFGHIJKLMOPQRSTUVWXYZ/{}[]*()+-"
def __str__(self):
return self.sequence
def Mutation(self):
x = random.randint(0, len(self.sequence)-1)
r = random.randint(0, len(self.allowedChars)-1)
d = self.sequence
self.sequence = d[:x-1]+ self.allowedChars[r] + d[x:]
def Replication(self):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(self.sequence)-1)
self.sequence =self.sequence[:x1]+ self.sequence[x1:x2] + self.sequence[x2:]
self.sequence = self.sequence[:32]
def Extinction(self):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(self.sequence)-1)
self.sequence = self.sequence[:x1] + self.sequence[x2:]
def CrossOver(self, s):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(s)-1)
self.sequence = self.sequence[:x1+1]+ s[x2:]
#x1 = random.randint(0, len(self.sequence)-1)
#self.sequence = s[:x2 ] + self.sequence[x1+1:]
if __name__== "__main__":
import itertools
def hamdist(str1, str2):
if (len(str2)>len(str1)):
str1, str2 = str2, str1
str2 = str2.ljust(len(str1))
return sum(itertools.imap(str.__ne__, str1, str2))
g = StringGene("Hi there, Hello World !")
g.Mutation()
print "gm: " + str(g)
g.Replication()
print "gr: " + str(g)
g.Extinction()
print "ge: " + str(g)
h = StringGene("Hello there, partner")
print "h: " + str(h)
g.CrossOver(str(h))
print "gc: " + str(g)
change = 0
oldres = 100
solutionstring = "Hello Daniel. Nice to meet you."
best = StringGene("")
res = 100
print solutionstring
while (res > 0):
g.Mutation()
g.Replication()
g.Extinction()
res = hamdist(str(g), solutionstring)
if res<oldres:
print "'"+ str(g) + "'"
print "'"+ str(best) + "'"
best = g
oldres = res
else :
g = best
change = change + 1
print "Solution:" + str(g)+ " " + str(hamdist(solutionstring, str(g))) + str (change)
I have a crude hamming distance as a measure how far the solution string
differs from the current one. However i want to be able to have a varying
length in the guessing, so i introduced replication and deletion of parts
of the string.
Now, however the string grows infinitely and the Solution String is never
found. Can you point out, where i went wrong?
Can you suggest improvements?
cheers

Your StringGene objects are mutable, which means that when you do an operation like best = g, you are making both g and best reference the same object. Since after that first step you only have a single object, every mutation gets applied permanently, whether or not it's successful, and all comparisons between g and best are comparisons between the same object.
You either need to implement a copy operator, or make instances immutable, and have each mutation operator return a modified version of the 'gene'.
Also, if the first mutation fails to improve the string, you set g to best, which is an empty string, throwing away your starting string entirely.
Finally, the canonical test string is "Methinks it is like a weasel".

The simplest thing might be to limit how long the guessed string is allowed to be. Don't allow guesses above a certain length.
I had a look at your code and I'm not good enough in Python to find any bugs, but it might be that you're simply referencing or indexing the array incorrectly, resulting in always adding new characters to the guess-string, so your string is always increasing in length... I don't know if that's the bug, but things like that have happened to me before, so double-check your array indicies. ;)

I think your fitness function is too simple. I would play with using two variables, one the size distance and the other your "hamdist". The further the size difference is, the more it effects the total fitness. So add the two together with some percentage constant.
I'm also not very familiar with python, but it looks to me that this is not what you're doing.

First of all, what you are doing is a genetic algorithm, not genetic programming (which is a related, but a different concept).
I don't know Python, but it looks you have a major problem in your extinction function. As far as I can tell, if x1 > x2 it causes the string to increase in size instead of decreasing (the part between x1 and x2 is effectively doubled). What would happen in the replication function when x1 > x2, I can't tell without knowing Python.
Also keep in mind, that maintaining a population is key to effectively solving problems with genetic algorithms. Crossovers are the essential part of the algorithm, and they make little or no sense if they are not made between population members (also, the more varied the population is, the better, most of the time). The code you presented is dependant on mutations of a single specimen to achieve your expected result, and thus highly unlikely to produce anything useful faster than a simple brute force method.

Related

How to use the trained char-rnn to generate words?

When the char-rnn is trained, the weights of the network is fixed. If I use the same first char, how can I get the different sentence? Such as the two sentences "What is wrong?" and "What can I do for you?"
have the same first word "W". Can the char-rnn generate the two different sentences?
Yes, you can get different results from the same state by sampling. Take a look at min-char-rnn by Andrej Karpathy. The sample code is at line 63:
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
return ixes
Starting from the same hidden vector h and seed char seed_ix, you'll have a deterministic distribution over the next char - p. But the result is random, because the code performs np.random.choice instead of np.argmax. If the distribution is highly peaked at some char, you'll still get the same outcome most of the time, but in most cases several next chars are highly probable and they will be sampled, thus changing the whole generated sequence.
Note that this isn't the only possible sampling procedure: temperature-based sampling is more popular. You can take a look at, for instance, this post for overview.

Corrects sequences of parenthesis

Corrects sequences of parentesis can be defined recursively:
The empty string "" is a correct sequence.
If "X" and "Y" are correct sequences, then "XY" (the concatenation of
X and Y) is a correct sequence.
If "X" is a correct sequence, then "(X)" is a correct sequence.
Each correct parentheses sequence can be derived using the above
rules.
Given two strings s1 and s2. Each character in these strings is a parenthesis, but the strings themselves are not necessarily correct sequences of parentheses.
You would like to interleave the two sequences so that they will form a correct parentheses sequence. Note that sometimes two different ways of interleaving the two sequences will produce the same final sequence of characters. Even if that happens, we count each of the ways separately.
Compute and return the number of different ways to produce a correct parentheses sequence, modulo 10^9 + 7.
Example s1 = (() and s2 = ())
corrects sequences of parentheses, s1 (red) and s2(blue)
I don't understand the recursive algorithm, what does X and Y mean? And modulo 10^9 + 7?
First, I tried defining all permutations of s1 and s2 and then calculate the number of balanced parentheses. But that way is wrong, isn't it?
class InterleavingParenthesis:
def countWays(self, s1, s2):
sequences = list(self.__exchange(list(s1 + s2)))
corrects = 0
for sequence in sequences:
if self.__isCorrect(sequence):
corrects += 1
def __isCorrect(self, sequence):
s = Stack()
balanced = True
i = 0
while i < len(sequence) and balanced:
if '(' == sequence[i]:
s.stack(sequence[i])
elif s.isEmpty():
balanced = False
else: s.remove()
i += 1
if s.isEmpty() and balanced: return True
else: return False
def __exchange(self, s):
if len(s) <= 0: yield s
else:
for i in range(len(s)):
for p in self.__exchange(s[:i] + s[i + 1:]):
yield [s[i]] + p
class Stack:
def __init__(self):
self.items = []
def stack(self, data):
self.items.append(data)
def remove(self):
self.items.pop()
def isEmpty(self):
return self.items == []
Here's an example that shows how this recursive property works:
Start with:
X = "()()(())"
Through property 2, we break this into further X and Y:
X = "()" ; Y = "()(())"
For X, we can look at the insides with property 3.
X = ""
Because of property 1, we know this is valid.
For Y, we use property 2 again:
X = "()"
Y = "(())"
Using the same recursion as before (property 2, then property 1) we know that X is valid. Note that in code, you usually have to go through the same process, I'm just saving time for humans. For Y, you use property 3:
X = "()"
And again.. :
X = ""
And with property 1, you know this is valid.
Because all sub-parts of "()()(())" are valid, "()()(())" is valid. That's an example of recursion: You keep breaking things down into smaller problems until they are solvable. In code, you would have the function call itself with regards to a smaller part of it, in your case, X and Y.
As for the question you were given, there is a bit that doesn't make sense to me. I don't get how there is any room for doubt in any string of parentheses, like in the image you linked. In "((()())())" for example, there is no way these two parentheses do not match up: "((()())())". Therefore my answer would be that there is only one permutation for every valid string of parentheses, but this obviously is wrong somehow.
Could you or anyone else expand on this?

Homework: Implementing Karp-Rabin; For the hash values modulo q, explain why it is a bad idea to use q as a power of 2?

I have a two-fold homework problem, Implement Karp-Rabin and run it on a test file and the second part:
For the hash values modulo q, explain why it is a bad idea to use q as a power of 2. Can you construct a terrible example e.g. for q=64
and n=15?
This is my implementation of the algorithm:
def karp_rabin(text, pattern):
# setup
alphabet = 'ACGT'
d = len(alphabet)
n = len(pattern)
d_n = d**n
q = 2**32-1
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
...The second part of the question is referring to this part of the code/algo:
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
# the modulo q used to check if the hashes are congruent
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
I don't understand why it would be a bad idea to use q as a power of 2. I've tried running the algorithm on the test file provided(which is the genome of ecoli) and there's no discernible difference.
I tried looking at the formula for how the hash is derived (I'm not good at math) trying to find some common factors that would be really bad for powers of two but found nothing. I feel like if q is a power of 2 it should cause a lot of clashes for the hashes so you'd need to compare strings a lot more but I didn't find anything along those lines either.
I'd really appreciate help on this since I'm stumped. If someone wants to point out what I can do better in the first part (code efficiency, readability, correctness etc.) I'd also be thrilled to hear your input on that.
There is a problem if q divides some power of d, because then only a few characters contribute to the hash. For example in your code d=4, if you take q=64 only the last three characters determine the hash (d**3 = 64).
I don't really see a problem if q is a power of 2 but gcd(d,q) = 1.
Your implementation looks a bit strange because instead of
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
you could also use
if pattern_hash == text_hash and pattern == text[i:i+n]:
which would be better because you get fewer collisions.
The Thue–Morse sequence has among its properties that its polynomial hash quickly becomes zero when a power of 2 is the hash module, for whatever polynomial base (d). So if you will try to search a short Thue-Morse sequence in a longer one, you will have a great lot of hash collisions.
For example, your code, slightly adapted:
def karp_rabin(text, pattern):
# setup
alphabet = '01'
d = 15
n = len(pattern)
d_n = d**n
q = 32
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q : #and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
print(karp_rabin('0110100110010110100101100110100110010110011010010110100110010110', '0110100110010110'))
outputs a lot of positions, although only three of then are proper matches.
Note that I have dropped the and pattern == text[i:i+n] check. Obviously if you restore it, the result will be correct, but also it is obvious that the algorithm will do much more work checking this additional condition than for other q. In fact, because there are so many collisions, the whole idea of algorithm becomes not working: you could almost as effectively wrote a simple algorithm that checks every position for a match.
Also note that your implementation is quite strange. The whole idea of polynomial hashing is to take the modulo operation each time you compute the hash. Otherwise your pattern_hash and text_hash are very big numbers. In other languages this might mean arithmetic overflow, but in Python this will invoke big integer arithmetic, which is slow and once again loses the whole idea of the algorithm.

Simple linear equation with binary search? [duplicate]

This question already has answers here:
Solving a linear equation
(11 answers)
Closed 9 years ago.
Is there a way to solve a simple linear equation like
-x+3 = x+5
Using binary search? or any other numerical method?
BACKGROUND:
My question comes because I want to solve equations like "2x+5-(3x+2)=x+5" Possible operators are: *, -, + and brackets.
I thought first of converting it to infix notation both sides of the equation, and then performing some kind of binary search.
What do you think of this approach? I'm supposed to solve this in less than 40 min in an interview.
It is not hard to write a simple parser that solves $-x+3 -(x+5) = 0$ or any other similar expression algebraically to $a*x + b = 0$ for cumulated constants $a$ and $b$. Then, one could easily compute the exact solution to be $x = -b/a$.
If you really want a numerical approach, observe that both sides describe their own linear function graph, i.e., $y_l = -x_l+3$ on the left an $y_r = x_r + 5$ on the right. Thus, finding a solution to this equation is the same as finding an intersection point of both functions. Therefore you can start with any value $x=x_l=x_r$ and evaluate both sides to get the corresponding left and right $y$-values $y_l$ and $y_r$. If their difference is $0$, then you found a solution (either the unique intersection point by luck, or both lines are equal as in $2x = 2x$). Otherwise, check, e.g., position $x+1$. If the new difference $y_l - y_r$ is unchanged to before, both lines are parallel (for example $2x = 2x + 7$). Otherwise the difference has gone farer away or nearer towards 0 (from positive or negative side). So, now you have all that you need to numerically test further points $x$ (e.g., in a binary search fashion if you at first look for some $x$ that achieves a positive $y$-difference and another $x$ that achieves a negative $y$-difference and then run binary search between them) to approximate the $x$-value for which the difference $y_l - y_r$ is $0$. (Of course, you could alternatively compute the solution algebraically again, since evaluating the lines at two positions gives you all information that you need to compute the intersection point exactly).
Thus, the numerical approach is quite absurd here, but it motivates this algorithmic way of thinking.
Do you really need to solve it with a numerical approach? I'm pretty sure you can, but it's not so hard to parse the expression to solve it analytically. I mean, if it is indeed a linear equation, it's just a matter to discover what is the coeficient of x and the free term when the equation is reduced. In the 26 minutes of this question, I made a simple parser to do that, by hand:
import re, sys, json
TOKENS = {
'FREE': '[0-9]+',
'XTERM': '[0-9]*x',
'ADD': '\+',
'SUB': '-',
'POW': '\^',
'MUL': '\*',
'EQL': '=',
'LPAREN': '\(',
'RPAREN': '\)',
'EOF': '$'
}
class Token:
EOF = lambda p: Token('EOF', '', p)
def __init__(self, name, raw, position):
self.name = name
self.image = raw.strip()
self.raw = raw
self.position = position
class Expr:
def __init__(self, x, c):
self.x = x
self.c = c
def add(self, e):
return Expr(self.x + e.x, self.c + e.c)
def sub(self, e):
return Expr(self.x - e.x, self.c - e.c)
def mul(self, e):
return Expr(self.x * e.c + e.x * self.c, self.c * e.c)
def neg(self):
return Expr(-self.x, -self.c)
class Scanner:
def __init__(self, expr):
self.expr = expr
self.position = 0
def match(self, name):
match = re.match('^\s*'+TOKENS[name], self.expr[self.position:])
return Token(name, match.group(), self.position) if match else None
def peek(self, *allowed):
for match in map(self.match, allowed):
if match: return match
def next(self, *allowed):
token = self.peek(*TOKENS)
self.position += len(token.raw)
return token
def maybe(self, *allowed):
if self.peek(*allowed):
return self.next(*allowed)
def following(self, value, *allowed):
self.next(*allowed)
return value
def expect(self, **actions):
token = self.next(*actions.keys())
return actions[token.name](token)
def evaluate(expr, variables={}):
tokens = Scanner(expr)
def Binary(higher, **ops):
e = higher()
while tokens.peek(*ops):
e = ops[tokens.next(*ops).name](e, higher())
return e
def Equation():
left = Add()
tokens.next('EQL')
right = Add()
return left.sub(right)
def Add(): return Binary(Mul, ADD=Expr.add, SUB=Expr.sub)
def Mul(): return Binary(Neg, MUL=Expr.mul)
def Neg():
return Neg().neg() if tokens.maybe('SUB') else Primary()
def Primary():
return tokens.expect(
FREE = lambda x: Expr(0, float(x.image)),
XTERM = lambda x: Expr(float(x.image[:-1] or 1), 0),
LPAREN = lambda x: tokens.following(Add(), 'RPAREN'))
expr = tokens.following(Equation(), 'EOF')
return -expr.c / float(expr.x)
print evaluate('2+2 = x')
print evaluate('-x+3 = x+5')
print evaluate('2x+5-(3x+2)=x+5')
First, your question must be related to Solving Binary Tree. A method that you can use is to construct a binary try putting the root the operator with highest priority, following lower priority operators and operations are leaf nodes. You can learn about this method in solving equation.

Checking if two strings are permutations of each other in Python

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:
sorted(a) == sorted(b)
and
all(a.count(char) == b.count(char) for char in a)
but the first one is slower when (for example) the first char of a is nowhere in b, and the second is slower when they are actually permutations.
Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?
Here is a way which is O(n), asymptotically better than the two ways you suggest.
import collections
def same_permutation(a, b):
d = collections.defaultdict(int)
for x in a:
d[x] += 1
for x in b:
d[x] -= 1
return not any(d.itervalues())
## same_permutation([1,2,3],[2,3,1])
#. True
## same_permutation([1,2,3],[2,3,1,1])
#. False
"but the first one is slower when (for example) the first char of a is nowhere in b".
This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.
Only do the O-style "overall" analysis.
Overall, the sorts are O( n log( n ) ).
The a.count(char) for char in a solution is O( n 2 ). Each count pass is a full examination of the string.
If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.
In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.
heuristically you're probably better to split them off based on string size.
Pseudocode:
returnvalue = false
if len(a) == len(b)
if len(a) < threshold
returnvalue = (sorted(a) == sorted(b))
else
returnvalue = naminsmethod(a, b)
return returnvalue
If performance is critical, and string size can be large or small then this is what I'd do.
It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.
I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.
Your second example won't actually work:
all(a.count(char) == b.count(char) for char in a)
will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.
If you want to know whether two strings are permutations of the same unique characters, just do:
set(a) == set(b)
To correct your second example:
all(str1.count(char) == str2.count(char) for char in set(a) | set(b))
set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.
That said, the sorted() method is much simpler and more intuitive, and would be what I would use.
Here are some timed executions on very small strings, using two different methods:
1. sorting
2. counting (specifically the original method by #namin).
a, b, c = 'confused', 'unfocused', 'foncused'
sort_method = lambda x,y: sorted(x) == sorted(y)
def count_method(a, b):
d = {}
for x in a:
d[x] = d.get(x, 0) + 1
for x in b:
d[x] = d.get(x, 0) - 1
for v in d.itervalues():
if v != 0:
return False
return True
Average run times of the 2 methods over 100,000 loops are:
non-match (string a and b)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.b)'
100000 loops, best of 3: 9.72 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.b)'
10000 loops, best of 3: 28.1 usec per loop
match (string a and c)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.c)'
100000 loops, best of 3: 9.47 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.c)'
100000 loops, best of 3: 24.6 usec per loop
Keep in mind that the strings used are very small. The time complexity of the methods are different, so you'll see different results with very large strings. Choose according to your data, you may even use a combination of the two.
Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:
public boolean isPermutation(String a, String b) {
if (a.length() != b.length()) {
return false;
}
int[] charCount = new int[256];
for (int i = 0; i < a.length(); ++i) {
++charCount[a.charAt(i)];
}
for (int i = 0; i < b.length(); ++i) {
if (--charCount[b.charAt(i)] < 0) {
return false;
}
}
return true;
}
First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.
Here's martinus code in python. It only works for ascii strings:
def is_permutation(a, b):
if len(a) != len(b):
return False
char_count = [0] * 256
for c in a:
char_count[ord(c)] += 1
for c in b:
char_count[ord(c)] -= 1
if char_count[ord(c)] < 0:
return False
return True
I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:
Testing against 9227 words.
Permutation testing by sorting ... done. 18.582 s
Permutation testing by counting ... done. 14.949 s
If anyone wants the algorithm and test data set, comment away.
First, for solving such problems, e.g. whether String 1 and String 2 are exactly the same or not, easily, you can use an "if" since it is O(1).
Second, it is important to consider that whether they are only numerical values or they can be also words in the string. If the latter one is true (words and numerical values are in the string at the same time), your first solution will not work. You can enhance it by using "ord()" function to make it ASCII numerical value. However, in the end, you are using sort; therefore, in the worst case your time complexity will be O(NlogN). This time complexity is not bad. But, you can do better. You can make it O(N).
My "suggestion" is using Array(list) and set at the same time. Note that finding a value in Array needs iteration so it's time complexity is O(N), but searching a value in set (which I guess it is implemented with HashTable in Python, I'm not sure) has O(1) time complexity:
def Permutation2(Str1, Str2):
ArrStr1 = list(Str1) #convert Str1 to array
SetStr2 = set(Str2) #convert Str2 to set
ArrExtra = []
if len(Str1) != len(Str2): #check their length
return False
elif Str1 == Str2: #check their values
return True
for x in xrange(len(ArrStr1)):
ArrExtra.append(ArrStr1[x])
for x in xrange(len(ArrExtra)): #of course len(ArrExtra) == len(ArrStr1) ==len(ArrStr2)
if ArrExtra[x] in SetStr2: #checking in set is O(1)
continue
else:
return False
return True
Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.
As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.
In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b).
But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.
This is derived from #patros' answer.
from collections import Counter
def is_anagram(a, b, threshold=1000000):
"""Returns true if one sequence is a permutation of the other.
Ignores whitespace and character case.
Compares sorted sequences if the length is below the threshold,
otherwise compares dictionaries that contain the frequency of the
elements.
"""
a, b = a.strip().lower(), b.strip().lower()
length_a, length_b = len(a), len(b)
if length_a != length_b:
return False
if length_a < threshold:
return sorted(a) == sorted(b)
return Counter(a) == Counter(b) # Or use #namin's method if you don't want to create two dictionaries and don't mind the extra typing.
This is an O(n) solution in Python using hashing with dictionaries. Notice that I don't use default dictionaries because the code can stop this way if we determine the two strings are not permutations after checking the second letter for instance.
def if_two_words_are_permutations(s1, s2):
if len(s1) != len(s2):
return False
dic = {}
for ch in s1:
if ch in dic.keys():
dic[ch] += 1
else:
dic[ch] = 1
for ch in s2:
if not ch in dic.keys():
return False
elif dic[ch] == 0:
return False
else:
dic[ch] -= 1
return True
This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?
public function is_anagram($word1, $word2) {
$letters1 = str_split($word1);
$letters2 = str_split($word2);
if (count($letters1) == count($letters2)) {
foreach ($letters1 as $letter) {
$index = array_search($letter, $letters2);
if ($index !== false) {
unset($letters2[$index]);
}
else { return false; }
}
return true;
}
return false;
}
Here's a literal translation to Python of the PHP version (by JFS):
def is_anagram(word1, word2):
letters2 = list(word2)
if len(word1) == len(word2):
for letter in word1:
try:
del letters2[letters2.index(letter)]
except ValueError:
return False
return True
return False
Comments:
1. The algorithm is O(N**2). Compare it to #namin's version (it is O(N)).
2. The multiple returns in the function look horrible.
This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in #patros's answer).
It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory e.g., given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).
def isanagram(iterable1, iterable2):
d = {}
get = d.get
for c in iterable1:
d[c] = get(c, 0) + 1
try:
for c in iterable2:
d[c] -= 1
return not any(d.itervalues())
except KeyError:
return False
It is unclear why this version is faster then defaultdict (#namin's) one for large iterable1 (tested on 25MB thesaurus).
If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings i.e. when there are few duplicates.
In Swift (or another languages implementation), you could look at the encoded values ( in this case Unicode) and see if they match.
Something like:
let string1EncodedValues = "Hello".unicodeScalars.map() {
//each encoded value
$0
//Now add the values
}.reduce(0){ total, value in
total + value.value
}
let string2EncodedValues = "oellH".unicodeScalars.map() {
$0
}.reduce(0) { total, value in
total + value.value
}
let equalStrings = string1EncodedValues == string2EncodedValues ? true : false
You will need to handle spaces and cases as needed.
def matchPermutation(s1, s2):
a = []
b = []
if len(s1) != len(s2):
print 'length should be the same'
return
for i in range(len(s1)):
a.append(s1[i])
for i in range(len(s2)):
b.append(s2[i])
if set(a) == set(b):
print 'Permutation of each other'
else:
print 'Not a permutation of each other'
return
#matchPermutaion('rav', 'var') #returns True
matchPermutaion('rav', 'abc') #returns False
Checking if two strings are permutations of each other in Python
# First method
def permutation(s1,s2):
if len(s1) != len(s2):return False;
return ' '.join(sorted(s1)) == ' '.join(sorted(s2))
# second method
def permutation1(s1,s2):
if len(s1) != len(s2):return False;
array = [0]*128;
for c in s1:
array[ord(c)] +=1
for c in s2:
array[ord(c)] -=1
if (array[ord(c)]) < 0:
return False
return True
How about something like this. Pretty straight-forward and readable. This is for strings since the as per the OP.
Given that the complexity of sorted() is O(n log n).
def checkPermutation(a,b):
# input: strings a and b
# return: boolean true if a is Permutation of b
if len(a) != len(b):
return False
else:
s_a = ''.join(sorted(a))
s_b = ''.join(sorted(b))
if s_a == s_b:
return True
else:
return False
# test inputs
a = 'sRF7w0qbGp4fdgEyNlscUFyouETaPHAiQ2WIxzohiafEGJLw03N8ALvqMw6reLN1kHRjDeDausQBEuIWkIBfqUtsaZcPGoqAIkLlugTxjxLhkRvq5d6i55l4oBH1QoaMXHIZC5nA0K5KPBD9uIwa789sP0ZKV4X6'
b = 'Vq3EeiLGfsAOH2PW6skMN8mEmUAtUKRDIY1kow9t1vIEhe81318wSMICGwf7Rv2qrLrpbeh8bh4hlRLZXDSMyZJYWfejLND4u9EhnNI51DXcQKrceKl9arWqOl7sWIw3EBkeu7Fw4TmyfYwPqCf6oUR0UIdsAVNwbyyiajtQHKh2EKLM1KlY6NdvQTTA7JKn6bLInhFvwZ4yKKbzkgGhF3Oogtnmzl29fW6Q2p0GPuFoueZ74aqlveGTYc0zcXUJkMzltzohoRdMUKP4r5XhbsGBED8ReDbL3ouPhsFchERvvNuaIWLUCY4gl8OW06SMuvceZrCg7EkSFxxprYurHz7VQ2muxzQHj7RG2k3khxbz2ZAhWIlBBtPtg4oXIQ7cbcwgmBXaTXSBgBe3Y8ywYBjinjEjRJjVAiZkWoPrt8JtZv249XiN0MTVYj0ZW6zmcvjZtRn32U3KLMOdjLnRFUP2I3HJtp99uVlM9ghIpae0EfC0v2g78LkZE1YAKsuqCiiy7DVOhyAZUbOrRwXOEDHxUyXwCmo1zfVkPVhwysx8HhH7Iy0yHAMr0Tb97BqcpmmyBsrSgsV1aT3sjY0ctDLibrxbRXBAOexncqB4BBKWJoWkQwUZkFfPXemZkWYmE72w5CFlI6kuwBQp27dCDZ39kRG7Txs1MbsUnRnNHBy1hSOZvTQRYZPX0VmU8SVGUqzwm1ECBHZakQK4RUquk3txKCqbDfbrNmnsEcjFaiMFWkY3Esg6p3Mm41KWysTpzN6287iXjgGSBw6CBv0hH635WiZ0u47IpUD5mY9rkraDDl5sDgd3f586EWJdKAaou3jR7eYU7YuJT3RQVRI0MuS0ec0xYID3WTUI0ckImz2ck7lrtfnkewzRMZSE2ANBkEmg2XAmwrCv0gy4ExW5DayGRXoqUv06ZLGCcBEiaF0fRMlinhElZTVrGPqqhT03WSq4P97JbXA90zUxiHCnsPjuRTthYl7ZaiVZwNt3RtYT4Ff1VQ5KXRwRzdzkRMsubBX7YEhhtl0ZGVlYiP4N4t00Jr7fB4687eabUqK6jcUVpXEpTvKDbj0JLcLYsneM9fsievUz193f6aMQ5o5fm4Ilx3TUZiX4AUsoyd8CD2SK3NkiLuR255BDIA0Zbgnj2XLyQPiJ1T4fjStpjxKOTzsQsZxpThY9Fvjvoxcs3HAiXjLtZ0TSOX6n4ZLjV3TdJMc4PonwqIb3lAndlTMnuzEPof2dXnpexoVm5c37XQ7fBkoMBJ4ydnW25XKYJbkrueRDSwtJGHjY37dob4jPg0axM5uWbqGocXQ4DyiVm5GhvuYX32RQaOtXXXw8cWK6JcSUnlP1gGLMNZEGeDXOuGWiy4AJ7SH93ZQ4iPgoxdfCuW0qbsLKT2HopcY9dtBIRzr91wnES9lDL49tpuW77LSt5dGA0YLSeWAaZt9bDrduE0gDZQ2yX4SDvAOn4PMcbFRfTqzdZXONmO7ruBHHb1tVFlBFNc4xkoetDO2s7mpiVG6YR4EYMFIG1hBPh7Evhttb34AQzqImSQm1gyL3O7n3p98Kqb9qqIPbN1kuhtW5mIbIioWW2n7MHY7E5mt0'
print(checkPermutation(a, b)) #optional
def permute(str1,str2):
if sorted(str1) == sorted(str2):
return True
else:
return False
str1="hello"
str2='olehl'
a=permute(str1,str2)
print(a
from collections import defaultdict
def permutation(s1,s2):
h = defaultdict(int)
for ch in s1:
h[ch]+=1
for ch in s2:
h[ch]-=1
for key in h.keys():
if h[key]!=0 or len(s1)!= len(s2):
return False
return True
print(permutation("tictac","tactic"))

Resources