Checking if two strings are permutations of each other in Python - algorithm

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:
sorted(a) == sorted(b)
and
all(a.count(char) == b.count(char) for char in a)
but the first one is slower when (for example) the first char of a is nowhere in b, and the second is slower when they are actually permutations.
Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?

Here is a way which is O(n), asymptotically better than the two ways you suggest.
import collections
def same_permutation(a, b):
d = collections.defaultdict(int)
for x in a:
d[x] += 1
for x in b:
d[x] -= 1
return not any(d.itervalues())
## same_permutation([1,2,3],[2,3,1])
#. True
## same_permutation([1,2,3],[2,3,1,1])
#. False

"but the first one is slower when (for example) the first char of a is nowhere in b".
This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.
Only do the O-style "overall" analysis.
Overall, the sorts are O( n log( n ) ).
The a.count(char) for char in a solution is O( n 2 ). Each count pass is a full examination of the string.
If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.
In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.

heuristically you're probably better to split them off based on string size.
Pseudocode:
returnvalue = false
if len(a) == len(b)
if len(a) < threshold
returnvalue = (sorted(a) == sorted(b))
else
returnvalue = naminsmethod(a, b)
return returnvalue
If performance is critical, and string size can be large or small then this is what I'd do.
It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.

I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.

Your second example won't actually work:
all(a.count(char) == b.count(char) for char in a)
will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.
If you want to know whether two strings are permutations of the same unique characters, just do:
set(a) == set(b)
To correct your second example:
all(str1.count(char) == str2.count(char) for char in set(a) | set(b))
set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.
That said, the sorted() method is much simpler and more intuitive, and would be what I would use.

Here are some timed executions on very small strings, using two different methods:
1. sorting
2. counting (specifically the original method by #namin).
a, b, c = 'confused', 'unfocused', 'foncused'
sort_method = lambda x,y: sorted(x) == sorted(y)
def count_method(a, b):
d = {}
for x in a:
d[x] = d.get(x, 0) + 1
for x in b:
d[x] = d.get(x, 0) - 1
for v in d.itervalues():
if v != 0:
return False
return True
Average run times of the 2 methods over 100,000 loops are:
non-match (string a and b)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.b)'
100000 loops, best of 3: 9.72 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.b)'
10000 loops, best of 3: 28.1 usec per loop
match (string a and c)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.c)'
100000 loops, best of 3: 9.47 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.c)'
100000 loops, best of 3: 24.6 usec per loop
Keep in mind that the strings used are very small. The time complexity of the methods are different, so you'll see different results with very large strings. Choose according to your data, you may even use a combination of the two.

Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:
public boolean isPermutation(String a, String b) {
if (a.length() != b.length()) {
return false;
}
int[] charCount = new int[256];
for (int i = 0; i < a.length(); ++i) {
++charCount[a.charAt(i)];
}
for (int i = 0; i < b.length(); ++i) {
if (--charCount[b.charAt(i)] < 0) {
return false;
}
}
return true;
}
First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.

Here's martinus code in python. It only works for ascii strings:
def is_permutation(a, b):
if len(a) != len(b):
return False
char_count = [0] * 256
for c in a:
char_count[ord(c)] += 1
for c in b:
char_count[ord(c)] -= 1
if char_count[ord(c)] < 0:
return False
return True

I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:
Testing against 9227 words.
Permutation testing by sorting ... done. 18.582 s
Permutation testing by counting ... done. 14.949 s
If anyone wants the algorithm and test data set, comment away.

First, for solving such problems, e.g. whether String 1 and String 2 are exactly the same or not, easily, you can use an "if" since it is O(1).
Second, it is important to consider that whether they are only numerical values or they can be also words in the string. If the latter one is true (words and numerical values are in the string at the same time), your first solution will not work. You can enhance it by using "ord()" function to make it ASCII numerical value. However, in the end, you are using sort; therefore, in the worst case your time complexity will be O(NlogN). This time complexity is not bad. But, you can do better. You can make it O(N).
My "suggestion" is using Array(list) and set at the same time. Note that finding a value in Array needs iteration so it's time complexity is O(N), but searching a value in set (which I guess it is implemented with HashTable in Python, I'm not sure) has O(1) time complexity:
def Permutation2(Str1, Str2):
ArrStr1 = list(Str1) #convert Str1 to array
SetStr2 = set(Str2) #convert Str2 to set
ArrExtra = []
if len(Str1) != len(Str2): #check their length
return False
elif Str1 == Str2: #check their values
return True
for x in xrange(len(ArrStr1)):
ArrExtra.append(ArrStr1[x])
for x in xrange(len(ArrExtra)): #of course len(ArrExtra) == len(ArrStr1) ==len(ArrStr2)
if ArrExtra[x] in SetStr2: #checking in set is O(1)
continue
else:
return False
return True

Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.
As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.

In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b).
But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.

This is derived from #patros' answer.
from collections import Counter
def is_anagram(a, b, threshold=1000000):
"""Returns true if one sequence is a permutation of the other.
Ignores whitespace and character case.
Compares sorted sequences if the length is below the threshold,
otherwise compares dictionaries that contain the frequency of the
elements.
"""
a, b = a.strip().lower(), b.strip().lower()
length_a, length_b = len(a), len(b)
if length_a != length_b:
return False
if length_a < threshold:
return sorted(a) == sorted(b)
return Counter(a) == Counter(b) # Or use #namin's method if you don't want to create two dictionaries and don't mind the extra typing.

This is an O(n) solution in Python using hashing with dictionaries. Notice that I don't use default dictionaries because the code can stop this way if we determine the two strings are not permutations after checking the second letter for instance.
def if_two_words_are_permutations(s1, s2):
if len(s1) != len(s2):
return False
dic = {}
for ch in s1:
if ch in dic.keys():
dic[ch] += 1
else:
dic[ch] = 1
for ch in s2:
if not ch in dic.keys():
return False
elif dic[ch] == 0:
return False
else:
dic[ch] -= 1
return True

This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?
public function is_anagram($word1, $word2) {
$letters1 = str_split($word1);
$letters2 = str_split($word2);
if (count($letters1) == count($letters2)) {
foreach ($letters1 as $letter) {
$index = array_search($letter, $letters2);
if ($index !== false) {
unset($letters2[$index]);
}
else { return false; }
}
return true;
}
return false;
}
Here's a literal translation to Python of the PHP version (by JFS):
def is_anagram(word1, word2):
letters2 = list(word2)
if len(word1) == len(word2):
for letter in word1:
try:
del letters2[letters2.index(letter)]
except ValueError:
return False
return True
return False
Comments:
1. The algorithm is O(N**2). Compare it to #namin's version (it is O(N)).
2. The multiple returns in the function look horrible.

This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in #patros's answer).
It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory e.g., given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).
def isanagram(iterable1, iterable2):
d = {}
get = d.get
for c in iterable1:
d[c] = get(c, 0) + 1
try:
for c in iterable2:
d[c] -= 1
return not any(d.itervalues())
except KeyError:
return False
It is unclear why this version is faster then defaultdict (#namin's) one for large iterable1 (tested on 25MB thesaurus).
If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings i.e. when there are few duplicates.

In Swift (or another languages implementation), you could look at the encoded values ( in this case Unicode) and see if they match.
Something like:
let string1EncodedValues = "Hello".unicodeScalars.map() {
//each encoded value
$0
//Now add the values
}.reduce(0){ total, value in
total + value.value
}
let string2EncodedValues = "oellH".unicodeScalars.map() {
$0
}.reduce(0) { total, value in
total + value.value
}
let equalStrings = string1EncodedValues == string2EncodedValues ? true : false
You will need to handle spaces and cases as needed.

def matchPermutation(s1, s2):
a = []
b = []
if len(s1) != len(s2):
print 'length should be the same'
return
for i in range(len(s1)):
a.append(s1[i])
for i in range(len(s2)):
b.append(s2[i])
if set(a) == set(b):
print 'Permutation of each other'
else:
print 'Not a permutation of each other'
return
#matchPermutaion('rav', 'var') #returns True
matchPermutaion('rav', 'abc') #returns False

Checking if two strings are permutations of each other in Python
# First method
def permutation(s1,s2):
if len(s1) != len(s2):return False;
return ' '.join(sorted(s1)) == ' '.join(sorted(s2))
# second method
def permutation1(s1,s2):
if len(s1) != len(s2):return False;
array = [0]*128;
for c in s1:
array[ord(c)] +=1
for c in s2:
array[ord(c)] -=1
if (array[ord(c)]) < 0:
return False
return True

How about something like this. Pretty straight-forward and readable. This is for strings since the as per the OP.
Given that the complexity of sorted() is O(n log n).
def checkPermutation(a,b):
# input: strings a and b
# return: boolean true if a is Permutation of b
if len(a) != len(b):
return False
else:
s_a = ''.join(sorted(a))
s_b = ''.join(sorted(b))
if s_a == s_b:
return True
else:
return False
# test inputs
a = 'sRF7w0qbGp4fdgEyNlscUFyouETaPHAiQ2WIxzohiafEGJLw03N8ALvqMw6reLN1kHRjDeDausQBEuIWkIBfqUtsaZcPGoqAIkLlugTxjxLhkRvq5d6i55l4oBH1QoaMXHIZC5nA0K5KPBD9uIwa789sP0ZKV4X6'
b = 'Vq3EeiLGfsAOH2PW6skMN8mEmUAtUKRDIY1kow9t1vIEhe81318wSMICGwf7Rv2qrLrpbeh8bh4hlRLZXDSMyZJYWfejLND4u9EhnNI51DXcQKrceKl9arWqOl7sWIw3EBkeu7Fw4TmyfYwPqCf6oUR0UIdsAVNwbyyiajtQHKh2EKLM1KlY6NdvQTTA7JKn6bLInhFvwZ4yKKbzkgGhF3Oogtnmzl29fW6Q2p0GPuFoueZ74aqlveGTYc0zcXUJkMzltzohoRdMUKP4r5XhbsGBED8ReDbL3ouPhsFchERvvNuaIWLUCY4gl8OW06SMuvceZrCg7EkSFxxprYurHz7VQ2muxzQHj7RG2k3khxbz2ZAhWIlBBtPtg4oXIQ7cbcwgmBXaTXSBgBe3Y8ywYBjinjEjRJjVAiZkWoPrt8JtZv249XiN0MTVYj0ZW6zmcvjZtRn32U3KLMOdjLnRFUP2I3HJtp99uVlM9ghIpae0EfC0v2g78LkZE1YAKsuqCiiy7DVOhyAZUbOrRwXOEDHxUyXwCmo1zfVkPVhwysx8HhH7Iy0yHAMr0Tb97BqcpmmyBsrSgsV1aT3sjY0ctDLibrxbRXBAOexncqB4BBKWJoWkQwUZkFfPXemZkWYmE72w5CFlI6kuwBQp27dCDZ39kRG7Txs1MbsUnRnNHBy1hSOZvTQRYZPX0VmU8SVGUqzwm1ECBHZakQK4RUquk3txKCqbDfbrNmnsEcjFaiMFWkY3Esg6p3Mm41KWysTpzN6287iXjgGSBw6CBv0hH635WiZ0u47IpUD5mY9rkraDDl5sDgd3f586EWJdKAaou3jR7eYU7YuJT3RQVRI0MuS0ec0xYID3WTUI0ckImz2ck7lrtfnkewzRMZSE2ANBkEmg2XAmwrCv0gy4ExW5DayGRXoqUv06ZLGCcBEiaF0fRMlinhElZTVrGPqqhT03WSq4P97JbXA90zUxiHCnsPjuRTthYl7ZaiVZwNt3RtYT4Ff1VQ5KXRwRzdzkRMsubBX7YEhhtl0ZGVlYiP4N4t00Jr7fB4687eabUqK6jcUVpXEpTvKDbj0JLcLYsneM9fsievUz193f6aMQ5o5fm4Ilx3TUZiX4AUsoyd8CD2SK3NkiLuR255BDIA0Zbgnj2XLyQPiJ1T4fjStpjxKOTzsQsZxpThY9Fvjvoxcs3HAiXjLtZ0TSOX6n4ZLjV3TdJMc4PonwqIb3lAndlTMnuzEPof2dXnpexoVm5c37XQ7fBkoMBJ4ydnW25XKYJbkrueRDSwtJGHjY37dob4jPg0axM5uWbqGocXQ4DyiVm5GhvuYX32RQaOtXXXw8cWK6JcSUnlP1gGLMNZEGeDXOuGWiy4AJ7SH93ZQ4iPgoxdfCuW0qbsLKT2HopcY9dtBIRzr91wnES9lDL49tpuW77LSt5dGA0YLSeWAaZt9bDrduE0gDZQ2yX4SDvAOn4PMcbFRfTqzdZXONmO7ruBHHb1tVFlBFNc4xkoetDO2s7mpiVG6YR4EYMFIG1hBPh7Evhttb34AQzqImSQm1gyL3O7n3p98Kqb9qqIPbN1kuhtW5mIbIioWW2n7MHY7E5mt0'
print(checkPermutation(a, b)) #optional

def permute(str1,str2):
if sorted(str1) == sorted(str2):
return True
else:
return False
str1="hello"
str2='olehl'
a=permute(str1,str2)
print(a

from collections import defaultdict
def permutation(s1,s2):
h = defaultdict(int)
for ch in s1:
h[ch]+=1
for ch in s2:
h[ch]-=1
for key in h.keys():
if h[key]!=0 or len(s1)!= len(s2):
return False
return True
print(permutation("tictac","tactic"))

Related

List comprehension that involves summing values

I was presented with a problem where I had to create an algorithm that takes any given input integer of even length and, based on the input, determines whether the sum of the first n digits is equal to the sum of the last n digits, where each n is the equal to the length of the number divided by two (e.g. 2130 returns True, 3304 returns False).
My solution, which works but is rather unwieldy, was as follows:
def ticket(num):
list_num = [int(x) for x in str(num)]
half_length = int(len(list_num)/2)
for i in range(half_length*2):
first_half = list_num[0:half_length]
second_half = list_num[half_length::]
if sum(first_half) == sum(second_half):
return True
else:
return False
In an effort to improve my understanding of list comprehensions, I've tried to look at ways that I can make this more efficient but am struggling to do so. Any guidance would be much appreciated.
EDIT:
Thank you to jarmod:
Reorganised as so:
def ticket(num):
list_num = [int(x) for x in str(num)]
half_length = int(len(list_num)/2)
return sum(list_num[0:half_length]) == sum(list_num[half_length::])
ticket(1230)
You can remove the unnecessary assignment and loops to create a shorter, more pythonic solution:
def ticket(num):
half_length = int(len(list_num)/2)
first_half = sum(list(num[:half_length]))
second_half = sum(list(num[half_length:]))
return first_half == second_half
It can be further shortened though, which isn't as readable:
def ticket(num):
return sum(list(num[:len(num)//2])) == sum(list(num[len(num)//2:]))

is Valid Golomb Ruler

I want to write a function which can test whether a list of numbers is a valid Golomb ruler or not. I know how to do it in O(n^2) time (using nested for loops) but I am looking for a simple and more optimized way of doing it. I am trying to do it in python and my function takes a list of integers as argument.
A Golomb ruler is defined as a set :
Iff
(Wikipedia)
If this is 3SUM-hard, then no one knows how to do much better than quadratic. I would be amazed if this weren't 3SUM-hard, but the hypothetical reduction looks as though it would be rather technical.
it's my code in python which checks if the list of integers forms a rule of Golomb or not:
def is_golomb_ruler(ruler: list[int]) -> bool:
marks_count = len(ruler)
# check if first element is not 0
if ruler[0] != 0:
return False
# check if contains negative values
for item in ruler:
if item < 0:
return False
# check differences table
tab_diff = []
for i in range(marks_count - 1):
d = ruler[i + 1] - ruler[i]
if tab_diff.__contains__(d):
return False
tab_diff.append(d)
index = marks_count - 2
diff_len = 2
for i in range(index, 0, -1):
for j in range(i):
d = 0
for k in range(diff_len):
d += tab_diff[k + j]
if tab_diff.__contains__(d):
return False
tab_diff.append(d)
diff_len += 1
# case of two identical marks
if tab_diff.__contains__(0):
return False
return True
you find my repo of the complete Golomb problem solved by the metaheuristic method (Simulated annealing and genetic algorithms) here :
GolombProblem_Metaheuristic_Project

Probabilistic Sieve of Eratosthenes

Consider the following algorithm.
function Rand():
return a uniformly random real between 0.0 and 1.0
function Sieve(n):
assert(n >= 2)
for i = 2 to n
X[i] = true
for i = 2 to n
if (X[i])
for j = i+1 to n
if (Rand() < 1/i)
X[j] = false
return X[n]
What is the probability that Sieve(k) returns true as a function of k ?
Let's define a series of random variables recursively:
Let Xk,r denote the indicator variable, taking value 1 iff X[k] == true by the end of the iteration in which the variable i took value r.
In order to have fewer symbols and since it makes more intuitive sense with the code, we'll just write Xk,i which is valid although would have been confusing in the definition since i taking value i is confusing when the first refers to the variable in the loop and the latter to the value of the variable.
Now we note that:
P(Xk,i ~ 0) = P(Xk,i-1 ~ 0) + P(Xk,i-1 ~ 1) * P(Xk-1,i-1 ~ 1) * 1/i
(~ is used in place of = just to make it understandable, since = would otherwise take two separate meanings and looks confusing).
This equality holds by virtue of the fact that either X[k] was false at the end of the i iteration either because it was false at the end of the i-1, or it was true at that point, but in that last iteration X[k-1] was true and so we entered the loop and changed X[k] with probability of 1/i. The events are mutually exclusive, so there is no intersection.
The base of the recursion is simply the fact that P(Xk,1 ~ 1) = 1 and P(X2,i ~ 1) = 1.
Lastly, we note simply that P(X[k] == true) = P(Xk,k-1 ~ 1).
This can be programmed rather easily. Here's a javascript implementation that employs memoisation (you can benchmark if using nested indices is better than string concatenation for the dictionary index, you could also redesign the calculation to maintain the same runtime complexity but not run out of stack size by building bottom-up and not top-down). Naturally the implementation will have a runtime complexity of O(k^2) so it's not practical for arbitrarily large numbers:
function P(k) {
if (k<2 || k!==Math.round(k)) return -1;
var _ = {};
function _P(n,i) {
if(n===2) return 1;
if(i===1) return 1;
var $ = n+'_'+i;
if($ in _) return _[$];
return _[$] = 1-(1-_P(n,i-1) + _P(n,i-1)*_P(n-1,i-1)*1/i);
}
return _P(k,k-1);
}
P(1000); // 0.12274162882390949
More interesting would be how the 1/i probability changes things. I.e. whether or not the probability converges to 0 or to some other value, and if so, how changing the 1/i affects that.
Of course if you ask on mathSE you might get a better answer - this answer is pretty simplistic, I'm sure there is a way to manipulate it to acquire a direct formula.

Programming Interview Question / how to find if any two integers in an array sum to zero?

Not a homework question, but a possible interview question...
Given an array of integers, write an algorithm that will check if the sum of any two is zero.
What is the Big O of this solution?
Looking for non brute force methods
Use a lookup table: Scan through the array, inserting all positive values into the table. If you encounter a negative value of the same magnitude (which you can easily lookup in the table); the sum of them will be zero. The lookup table can be a hashtable to conserve memory.
This solution should be O(N).
Pseudo code:
var table = new HashSet<int>();
var array = // your int array
foreach(int n in array)
{
if ( !table.Contains(n) )
table.Add(n);
if ( table.Contains(n*-1) )
// You found it.;
}
The hashtable solution others have mentioned is usually O(n), but it can also degenerate to O(n^2) in theory.
Here's a Theta(n log n) solution that never degenerates:
Sort the array (optimal quicksort, heap sort, merge sort are all Theta(n log n))
for i = 1, array.len - 1
binary search for -array[i] in i+1, array.len
If your binary search ever returns true, then you can stop the algorithm and you have a solution.
An O(n log n) solution (i.e., the sort) would be to sort all the data values then run a pointer from lowest to highest at the same time you run a pointer from highest to lowest:
def findmatch(array n):
lo = first_index_of(n)
hi = last_index_of(n)
while true:
if lo >= hi: # Catch where pointers have met.
return false
if n[lo] = -n[hi]: # Catch the match.
return true
if sign(n[lo]) = sign(n[hi]): # Catch where pointers are now same sign.
return false
if -n[lo] > n[hi]: # Move relevant pointer.
lo = lo + 1
else:
hi = hi - 1
An O(n) time complexity solution is to maintain an array of all values met:
def findmatch(array n):
maxval = maximum_value_in(n) # This is O(n).
array b = new array(0..maxval) # This is O(1).
zero_all(b) # This is O(n).
for i in index(n): # This is O(n).
if n[i] = 0:
if b[0] = 1:
return true
b[0] = 1
nextfor
if n[i] < 0:
if -n[i] <= maxval:
if b[-n[i]] = 1:
return true;
b[-n[i]] = -1
nextfor
if b[n[i]] = -1:
return true;
b[n[i]] = 1
This works by simply maintaining a sign for a given magnitude, every possible magnitude between 0 and the maximum value.
So, if at any point we find -12, we set b[12] to -1. Then later, if we find 12, we know we have a pair. Same for finding the positive first except we set the sign to 1. If we find two -12's in a row, that still sets b[12] to -1, waiting for a 12 to offset it.
The only special cases in this code are:
0 is treated specially since we need to detect it despite its somewhat strange properties in this algorithm (I treat it specially so as to not complicate the positive and negative cases).
low negative values whose magnitude is higher than the highest positive value can be safely ignored since no match is possible.
As with most tricky "minimise-time-complexity" algorithms, this one has a trade-off in that it may have a higher space complexity (such as when there's only one element in the array that happens to be positive two billion).
In that case, you would probably revert to the sorting O(n log n) solution but, if you know the limits up front (say if you're restricting the integers to the range [-100,100]), this can be a powerful optimisation.
In retrospect, perhaps a cleaner-looking solution may have been:
def findmatch(array num):
# Array empty means no match possible.
if num.size = 0:
return false
# Find biggest value, no match possible if empty.
max_positive = num[0]
for i = 1 to num.size - 1:
if num[i] > max_positive:
max_positive = num[i]
if max_positive < 0:
return false
# Create and init array of positives.
array found = new array[max_positive+1]
for i = 1 to found.size - 1:
found[i] = false
zero_found = false
# Check every value.
for i = 0 to num.size - 1:
# More than one zero means match is found.
if num[i] = 0:
if zero_found:
return true
zero_found = true
# Otherwise store fact that you found positive.
if num[i] > 0:
found[num[i]] = true
# Check every value again.
for i = 0 to num.size - 1:
# If negative and within positive range and positive was found, it's a match.
if num[i] < 0 and -num[i] <= max_positive:
if found[-num[i]]:
return true
# No matches found, return false.
return false
This makes one full pass and a partial pass (or full on no match) whereas the original made the partial pass only but I think it's easier to read and only needs one bit per number (positive found or not found) rather than two (none, positive or negative found). In any case, it's still very much O(n) time complexity.
I think IVlad's answer is probably what you're after, but here's a slightly more off the wall approach.
If the integers are likely to be small and memory is not a constraint, then you can use a BitArray collection. This is a .NET class in System.Collections, though Microsoft's C++ has a bitset equivalent.
The BitArray class allocates a lump of memory, and fills it with zeroes. You can then 'get' and 'set' bits at a designated index, so you could call myBitArray.Set(18, true), which would set the bit at index 18 in the memory block (which then reads something like 00000000, 00000000, 00100000). The operation to set a bit is an O(1) operation.
So, assuming a 32 bit integer scope, and 1Gb of spare memory, you could do the following approach:
BitArray myPositives = new BitArray(int.MaxValue);
BitArray myNegatives = new BitArray(int.MaxValue);
bool pairIsFound = false;
for each (int testValue in arrayOfIntegers)
{
if (testValue < 0)
{
// -ve number - have we seen the +ve yet?
if (myPositives.get(-testValue))
{
pairIsFound = true;
break;
}
// Not seen the +ve, so log that we've seen the -ve.
myNegatives.set(-testValue, true);
}
else
{
// +ve number (inc. zero). Have we seen the -ve yet?
if (myNegatives.get(testValue))
{
pairIsFound = true;
break;
}
// Not seen the -ve, so log that we've seen the +ve.
myPositives.set(testValue, true);
if (testValue == 0)
{
myNegatives.set(0, true);
}
}
}
// query setting of pairIsFound to see if a pair totals to zero.
Now I'm no statistician, but I think this is an O(n) algorithm. There is no sorting required, and the longest duration scenario is when no pairs exist and the whole integer array is iterated through.
Well - it's different, but I think it's the fastest solution posted so far.
Comments?
Maybe stick each number in a hash table, and if you see a negative one check for a collision? O(n). Are you sure the question isn't to find if ANY sum of elements in the array is equal to 0?
Given a sorted array you can find number pairs (-n and +n) by using two pointers:
the first pointer moves forward (over the negative numbers),
the second pointer moves backwards (over the positive numbers),
depending on the values the pointers point at you move one of the pointers (the one where the absolute value is larger)
you stop as soon as the pointers meet or one passed 0
same values (one negative, one possitive or both null) are a match.
Now, this is O(n), but sorting (if neccessary) is O(n*log(n)).
EDIT: example code (C#)
// sorted array
var numbers = new[]
{
-5, -3, -1, 0, 0, 0, 1, 2, 4, 5, 7, 10 , 12
};
var npointer = 0; // pointer to negative numbers
var ppointer = numbers.Length - 1; // pointer to positive numbers
while( npointer < ppointer )
{
var nnumber = numbers[npointer];
var pnumber = numbers[ppointer];
// each pointer scans only its number range (neg or pos)
if( nnumber > 0 || pnumber < 0 )
{
break;
}
// Do we have a match?
if( nnumber + pnumber == 0 )
{
Debug.WriteLine( nnumber + " + " + pnumber );
}
// Adjust one pointer
if( -nnumber > pnumber )
{
npointer++;
}
else
{
ppointer--;
}
}
Interesting: we have 0, 0, 0 in the array. The algorithm will output two pairs. But in fact there are three pairs ... we need more specification what exactly should be output.
Here's a nice mathematical way to do it: Keep in mind all prime numbers (i.e. construct an array prime[0 .. max(array)], where n is the length of the input array, so that prime[i] stands for the i-th prime.
counter = 1
for i in inputarray:
if (i >= 0):
counter = counter * prime[i]
for i in inputarray:
if (i <= 0):
if (counter % prime[-i] == 0):
return "found"
return "not found"
However, the problem when it comes to implementation is that storing/multiplying prime numbers is in a traditional model just O(1), but if the array (i.e. n) is large enough, this model is inapropriate.
However, it is a theoretic algorithm that does the job.
Here's a slight variation on IVlad's solution which I think is conceptually simpler, and also n log n but with fewer comparisons. The general idea is to start on both ends of the sorted array, and march the indices towards each other. At each step, only move the index whose array value is further from 0 -- in only Theta(n) comparisons, you'll know the answer.
sort the array (n log n)
loop, starting with i=0, j=n-1
if a[i] == -a[j], then stop:
if a[i] != 0 or i != j, report success, else failure
if i >= j, then stop: report failure
if abs(a[i]) > abs(a[j]) then i++ else j--
(Yeah, probably a bunch of corner cases in here I didn't think about. You can thank that pint of homebrew for that.)
e.g.,
[ -4, -3, -1, 0, 1, 2 ] notes:
^i ^j a[i]!=a[j], i<j, abs(a[i])>abs(a[j])
^i ^j a[i]!=a[j], i<j, abs(a[i])>abs(a[j])
^i ^j a[i]!=a[j], i<j, abs(a[i])<abs(a[j])
^i ^j a[i]==a[j] -> done
The sum of two integers can only be zero if one is the negative of the other, like 7 and -7, or 2 and -2.

Sorting a string that could contain either a time or distance

I have implemented a sorting algorithm for a custom string that represents either time or distance data for track & field events. Below is the format
'10:03.00 - Either ten minutes and three seconds or 10 feet, three inches
The result of the sort is that for field events, the longest throw or jump would be the first element while for running events, the fastest time would be first. Below is the code I am currently using for field events. I didn't post the running_event_sort since it is the same logic with the greater than/less than swapped. While it works, it just seems overly complex and needs to be refactored. I am open to suggestions. Any help would be great.
event_participants.sort!{ |a, b| Participant.field_event_sort(a, b) }
class Participant
def self.field_event_sort(a, b)
a_parts = a.time_distance.scan(/'([\d]*):([\d]*).([\d]*)/)
b_parts = b.time_distance.scan(/'([\d]*):([\d]*).([\d]*)/)
if(a_parts.empty? || b_parts.empty?)
0
elsif a_parts[0][0] == b_parts[0][0]
if a_parts[0][1] == b_parts[0][1]
if a_parts[0][2] > b_parts[0][2]
-1
elsif a_parts[0][2] < b_parts[0][2]
1
else
0
end
elsif a_parts[0][1] > b_parts[0][1]
-1
else
1
end
elsif a_parts[0][0] > b_parts[0][0]
-1
else
1
end
end
end
This is a situation where #sort_by could simplify your code enormously:
event_participants = event_participants.sort_by do |s|
if s =~ /'(\d+):(\d+)\.(\d+)/
[ $1, $2, $3 ].map { |digits| digits.to_i }
else
[]
end
end.reverse
Here, I parse the relevant times into an array of integers, and use those as a sorting key for the data. Array comparisons are done entry by entry, with the first being the most significant, so this works well.
One thing you don't do is convert the digits to integers, which you most likely want to do. Otherwise, you'll have issues with "100" < "2" #=> true. This is why I added the #map step.
Also, in your regex, the square brackets around \d are unnecessary, though you do want to escape the period so it doesn't match all characters.
One way the code I gave doesn't match the code you gave is in the situation where a line doesn't contain any distances. Your code will compare them as equal to surrounding lines (which may get you into trouble if the sorting algorithm assumes equality is transitive. That is a == b, b == c implies a ==c, which is not the case for your code : for example a = "'10:00.1", b = "frog", c="'9:99:9").
#sort_by sorts in ascending order, so the call to #reverse will change it into descending order. #sort_by also has the advantage of only parsing out the comparison values once, whereas your algorithm will have to parse each line for every comparison.
Instead of implementing the sort like this, maybe you should have a TrackTime and FieldDistance models. They don't necessarily need to be persisted - the Participant
model could create them from time_distance when it is loaded.
You're probably going to want to be able to get the difference between two values, validate values as well sort values in the future. The model would make it easy to add these features. Also it would make unit testing a lot easier.
I'd also separate time and distance into two separate fields. Having dual purpose columns in the database only causes pain down the line in my experience.
I don't know ruby but here's some c-like pseudo code that refactors this a bit.
/// In c, I would probably shorten this with the ? operator.
int compareIntegers(a, b) {
int result = 0;
if (a < b) {
result = -1;
} else if (a > b) {
result = 1;
}
return result;
}
int compareValues(a, b) {
int result = 0;
if (!/* check for empty*/) {
int majorA = /* part before first colon */
int majorB = /* part before first colon */
int minorA = /* part after first colon */
int minorB = /* part after first colon */
/// In c, I would probably shorten this with the ? operator.
result = compareIntegers(majorA, majorB);
if (result == 0) {
result = compareIntegers(minorA, minorB);
}
}
return result;
}
Your routine looks fine but you could just remove the ''', ':' and '.' and treat the result as a numeric string. In other words the 10' 5" would become 1005 and 10' 4" would be 1004. 1005 is clearly more than 1004.
Since the higer order elements are on the left, it will sort naturally. This also works with time for the same reasons.
I agree that converting to integers will make is simpler. Also note that for integers
if a > b
1
elsif a < b
-1
else
0
can be simplified to a<=>b. To get the reverse use -(a <=> b).
In this scenario:
Since you know you are working with feet, inches, and (whatever your third unit of measure is), why not just create a total sum of the two values you are comparing?
So after these two lines:
a_parts = a.time_distance.scan(/'([\d]):([\d]).([\d])/)
b_parts = b.time_distance.scan(/'([\d]):([\d]).([\d])/)
Generate the total distance for a_parts and b_parts:
totalDistanceA = a_parts[0][0].to_i * 12 + a_parts[0][1].to_i + b_parts[0][2].to_i * (whatever your third unit of measure factor against the size of an inch)
totalDistanceB = b_parts[0][0].to_i * 12 + b_parts[0][1].to_i + b_parts[0][2].to_i * (whatever your third unit of measure factor against the size of an inch)
Then return the comparison of these two values:
totalDistanceA <=> totalDistanceB
Note that you should keep the validation you are already making that checks if a_parts and b_parts are empty or not:
a_parts.empty? || b_parts.empty?
For doing the sorting by time scenario, do the exact same thing except with different factors (for example, 60 seconds to a min).
Why not do
a_val = a_parts[0][0].to_i * 10000 + a_parts[0][1].to_i * 100 + a_parts[0][2].to_i
b_val = b_parts[0][0].to_i * 10000 + b_parts[0][1].to_i * 100 + b_parts[0][2].to_i
a_val <=> b_val
The numbers won't make sense to subtract, etc but they should sort ok.
You may want to check [1] and [2] are always two digits in the regexp.

Resources