Efficiently randomly sampling List while maintaining order

Efficiently randomly sampling List while maintaining order - performance

I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.
Is there a clean and more efficient way of doing this?
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random
.shuffle(ls.indices.toList)
.take(sampleSize)
.sorted
.map(idx => ls(idx))
}
val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though
sampledList(sampleList, 5) // List(e, u, i, r, n)
EDIT:
It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.

If by
maintaining the order of the values
you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}
This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).

Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.
import scala.util.Random
import scala.annotation.tailrec
def sampledList[T](ls: List[T], sampleSize: Int) = {
#tailrec
def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
require(listSize >= sampleSize,
s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
)
list match {
case hd :: tl =>
if (Random.nextInt(listSize) < sampleSize)
rec(tl, listSize-1, hd :: sample, sampleSize-1)
else rec(tl, listSize-1, sample, sampleSize)
case Nil =>
require(sampleSize == 0, // Should never happen
s"sampleSize must be zero at the end of processing, but got $sampleSize"
)
sample
}
}
rec(ls, ls.size, Nil, sampleSize).reverse
}
The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.

Here's another O(n) implementation that should have a uniform probability for each element:
implicit class SampleSeqOps[T](s: Seq[T]) {
def sample(n: Int, r: Random = Random): Seq[T] = {
assert(n >= 0 && n <= s.length)
val res = ListBuffer[T]()
val length = s.length
var samplesNeeded = n
for { (e, i) <- s.zipWithIndex } {
val p = samplesNeeded.toDouble / (length - i)
if (p >= r.nextDouble()) {
res += e
samplesNeeded -= 1
}
}
res.toSeq
}
}
I'm using it frequently with collections > 100'000 elements and the performance seems reasonable.
It's probably the same idea as in Régis Jean-Gilles's answer but I think the imperative solution is slightly more readable in this case.

Perhaps I don't quite understand, but since Lists are immutable you don't really need to worry about 'maintaining the order' since the original List is never touched. Wouldn't the following suffice?
def sampledList[T](ls: List[T], sampleSize: Int) =
Random.shuffle(ls).take(sampleSize)

While my previous answer has linear complexity, it does have the drawback of requiring two passes, the first one corresponding to the need to compute the length before doing anything else. Besides affecting the running time, we might want to sample a very large collection for which it is not practical nor efficient to load the whole collection in memory at once, in which case we'd like to be able to work with a simple iterator.
As it happens, we don't need to invent anything to fix this. There is simple and clever algorithm called reservoir sampling which does exactly this (building a sample as we iterate over a collection, all in one pass). With a minor modification we can also preserve the order, as required:
import scala.util.Random
def sampledList[T](ls: TraversableOnce[T], sampleSize: Int, preserveOrder: Boolean = false, rng: Random = new Random): Iterable[T] = {
val result = collection.mutable.Buffer.empty[(T, Int)]
for ((item, n) <- ls.toIterator.zipWithIndex) {
if (n < sampleSize) result += (item -> n)
else {
val s = rng.nextInt(n)
if (s < sampleSize) {
result(s) = (item -> n)
}
}
}
if (preserveOrder) {
result.sortBy(_._2).map(_._1)
}
else result.map(_._1)
}

Related

Performance issue in Scala code - O(nlgn) faster than O(n)

I am just getting started with scala. I am trying to learn it by solving easy problems on leetcode. Here's my first (successful) attempt at LC #977:
def sortedSquares(A: Array[Int]): Array[Int] = {
A.map(math.abs).map(x => x * x).sorted
}
Because of the sorting, I expected this to run O(NlgN) time, N being the size of the input array. But I know that there is a two-pointers solution to this problem which has a run-time complexity of O(N). So I went ahead and implemented that in my noob Scala:
def sortedSquaresHelper(A: Array[Int], result: Vector[Int], left: Int, right: Int): Array[Int] = {
if (left < 0 && right >= A.size) {
result.toArray
} else if (left < 0) {
sortedSquaresHelper(A, result :+ A(right) * A(right), left, right + 1)
} else if (right >= A.size) {
sortedSquaresHelper(A, result :+ A(left) * A(left), left - 1, right)
} else {
if (math.abs(A(left)) < math.abs(A(right))) {
sortedSquaresHelper(A, result :+ A(left) * A(left), left - 1, right)
} else {
sortedSquaresHelper(A, result :+ A(right) * A(right), left, right + 1)
}
}
}
def sortedSquares(A: Array[Int]): Array[Int] = {
val min_idx = A.zipWithIndex.reduceLeft((x, y) => if (math.abs(x._1) < math.abs(y._1)) x else y)._2
val result: Vector[Int] = Vector(A(min_idx) * A(min_idx))
sortedSquaresHelper(A, result, min_idx - 1, min_idx + 1)
}
Turns out, the first version ran faster than the second one. Now I am quite confused about what I might have gotten wrong. Is there something about the recursion in second version that is causing high overhead?
I'd also like some suggestion about what is the idiomatic scala-way of writing the second solution? This is my first serious foray into functional programming, and I am struggling to write the function in a tail-recursive manner.

Vectors are significantly slower than Arrays overall. In particular
Vector item-by-item construction is 5-15x slower than List or mutable.Buffer construction, and even 40x slower than pre-allocating an Array in the cases where that's possible.
With map and sorted length of array is known, so they can be preallocated.
As that page mentions, Vector#:+ is really logarithmic, so you end up with O(n log n) in both cases. Even if you didn't, O(n log n) vs O(n) technically only says something about how performance changes when input grows indefinitely. It's mostly aligned with what's faster for small inputs as well, but only mostly.
I'd also like some suggestion about what is the idiomatic scala-way of writing the second solution?
One way is to construct a List in the reverse order, and then reverse it at the end. Or use an ArrayBuffer: even though it's mutable, you can effectively ignore that because you don't keep a reference to the older state anywhere.

Generate random numbers where the difference is always positive

I am trying to generate some numbers, such that the difference is always positive. the user inputs the number of digits and the amount of rows they want. for example 3 digits 3 rows will produce
971
888
121
I want to make sure the difference of those is always positive. is there some kind of algorithm I can use. right now i just have my program create numbers, then subtract them and if it comes out negative, it will do it again... and again. It is very slow.
I was thinking of first generating the difference and then adding to it until the amount of desired rows is reached. But i ran into problems if i generates a very large number.
here is the code i use to generate a random number with X digits, just in case it matters
private fun createRandomNumber(digits: Int): Int {
val numberArray = IntArray(digits)
for (number in 0 until numberArray.size){
numberArray[number] = 9
}
val maxnumber:Int = numberArray.joinToString("").toInt()
numberArray[0] = 1
for (number in 1 until numberArray.size){
numberArray[number] = 0
}
val minnumber:Int = numberArray.joinToString("").toInt()
return (minnumber..maxnumber).random()
}
based on the suggestion by Jeff Bowman, I began by sorting an array with all the numbers that are generated and it speeds everything up to an acceptable amount!

Even when #forpas solution is fine, it still runs in O(n log n) because of the final sorting. My solution just generates the increasing intervals where to generate random numbers (for uniformity distribution), and then map each interval to a random number in that range, hence avoiding the need to sort the final list. Complexity is O(n)
I chose to use Stream to avoid mutation or explicit recursion, but is not mandatory.
Example
fun main(args: Array<String>) {
val count = 20L
val digits = 5
val min = pow(10.0, digits.toDouble() - 1).toLong()
val max = min*10 - 1
val gap = (max - min)/count + 1
val numbers =
Stream.iterate(Pair(min, min + gap)) { (_, prev) -> Pair(prev, prev + gap) }
.map { (start, end) -> Random.nextLong(start, end) }
.limit(count)
.collect(Collectors.toList())
numbers.forEach(::println)
}
Output
11298
16284
20841
26084
31960
35538
37208
45325
46970
52918
57514
59769
67689
70135
75338
78075
84561
86652
91938
99931

I would use this function to create a random number with a certain number of digits:
fun createRandomNumber(digits: Int) = (10f.pow(digits - 1).toInt() until 10f.pow(digits).toInt()).shuffled().first()
you will need this import:
import kotlin.math.pow
And then with this:
fun main(args: Array<String>) {
print("how many numbers?: ")
val numbers = readLine()!!.toInt()
print("how many digits?: ")
val digits = readLine()!!.toInt()
val set = mutableSetOf<Int>()
do {
set.add(createRandomNumber(digits))
} while (set.size < numbers)
val array = set.toTypedArray().sortedArrayDescending()
array.forEach { println(it) }
}
you get the user's input and create a set of random numbers.
With toTypedArray().sortedArrayDescending() you get the array.

Converting Map[Int, Double] to breeze.linalg.SparseVector

I am new to the Breeze library and I would like to convert a Map[Int, Double] to breeze.linalg.SparseVector, and ideally without having to specify a fixed length of the SparseVector. I managed to achieve the goal with this clumsy code:
import breeze.linalg.{SparseVector => SBV}
val mySparseVector: SBV[Double] = new SBV[Double](Array.empty, Array.empty, 10000)
myMap foreach { e => mySparseVector(e._1) = e._2 }
Not only I have to specify a fixed length of 10,000, but the code runs in O(n), where n is the size of the map. Is there a better way?

You can use VectorBuilder. There's a (sadly) undocumented feature where if you tell it the length is -1, it will happily let you add things. You will have to (annoyingly) set the length before you construct the result...
val vb = new VectorBuilder(length = -1)
myMap foreach { e => vb.add(e._1, e._2) }
vb.length = myMap.keys.max + 1
vb.toSparseVector
(Your code is actually n^2 because SparseVector has to be sorted so you're repeatedly moving elements around in an array. VectorBuilder gives you n log n, which is the best you can do.)

Given a number, produce another random number that is the same every time and distinct from all other results

Basically, I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.
I know it's easy to do by creating an ordered list of numbers, shuffling them randomly, and then returning the input's index. But I want to know if it can be done without any caching at all. Perhaps with some kind of hashing algorithm? Mostly the reason for this is that if the range of possible outputs were much larger, say 10000000000, then it would be ludicrous to generate an entire range of numbers and then shuffle them randomly, if you were only going to get a few results out of it.
Doesn't matter what language it's done in, I just want to know if it's possible. I've been thinking about this problem for a long time and I can't think of a solution besides the one I've already come up with.
Edit: I just had another idea; it would be interesting to have another algorithm that returned the reverse of the first one. Whether or not that's possible would be an interesting challenge to explore.

This sounds like a non-repeating random number generator. There are several possible approaches to this.
As described in this article, we can generate them by selecting a prime number p and satisfies p % 4 = 3 that is large enough (greater than the maximum value in the output range) and generate them this way:
int randomNumberUnique(int range_len , int p , int x)
if(x * 2 < p)
return (x * x) % p
else
return p - (x * x) % p
This algorithm will cover all values in [0 , p) for an input in range [0 , p).

Here's an example in C#:
private void DoIt()
{
const long m = 101;
const long x = 387420489; // must be coprime to m
var multInv = MultiplicativeInverse(x, m);
var nums = new HashSet<long>();
for (long i = 0; i < 100; ++i)
{
var encoded = i*x%m;
var decoded = encoded*multInv%m;
Console.WriteLine("{0} => {1} => {2}", i, encoded, decoded);
if (!nums.Add(encoded))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1%modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a/b;
var r = a%b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q*t);
}
That generates numbers in the range 0-100, from input in the range 0-100. Each input results in a unique output.
It also shows how to reverse the process, using the multiplicative inverse.
You can extend the range by increasing the value of m. x must be coprime with m.
Code cribbed from Eric Lippert's article, A practical use of multiplicative inverses, and a few of the previous articles in that series.

You can not have completely unrelated (particularly if you want the reverse as well).
There is a concept of modulo inverse of a number, but this would work only if the range number is a prime, eg. 100 will not work, you would need 101 (a prime). This can provide you a pseudo random number if you want.
Here is the concept of modulo inverse:
If there are two numbers a and b, such that
(a * b) % p = 1
where p is any number, then
a and b are modular inverses of each other.
For this to be true, if we have to find the modular inverse of a wrt a number p, then a and p must be co-prime, ie. gcd(a,p) = 1
So, for all numbers in a range to have modular inverses, the range bound must be a prime number.
A few outputs for range bound 101 will be:
1 == 1
2 == 51
3 == 34
4 == 76
etc.
EDIT:
Hey...actually you know, you can use the combined approach of modulo inverse and the method as defined by #Paul. Since every pair will be unique and all numbers will be covered, your random number can be:
random(k) = randomUniqueNumber(ModuloInverse(k), p) //this is Paul's function

Checking if two strings are permutations of each other in Python

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:
sorted(a) == sorted(b)
and
all(a.count(char) == b.count(char) for char in a)
but the first one is slower when (for example) the first char of a is nowhere in b, and the second is slower when they are actually permutations.
Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?

Here is a way which is O(n), asymptotically better than the two ways you suggest.
import collections
def same_permutation(a, b):
d = collections.defaultdict(int)
for x in a:
d[x] += 1
for x in b:
d[x] -= 1
return not any(d.itervalues())
## same_permutation([1,2,3],[2,3,1])
#. True
## same_permutation([1,2,3],[2,3,1,1])
#. False

"but the first one is slower when (for example) the first char of a is nowhere in b".
This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.
Only do the O-style "overall" analysis.
Overall, the sorts are O( n log( n ) ).
The a.count(char) for char in a solution is O( n 2 ). Each count pass is a full examination of the string.
If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.
In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.

heuristically you're probably better to split them off based on string size.
Pseudocode:
returnvalue = false
if len(a) == len(b)
if len(a) < threshold
returnvalue = (sorted(a) == sorted(b))
else
returnvalue = naminsmethod(a, b)
return returnvalue
If performance is critical, and string size can be large or small then this is what I'd do.
It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.

I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.

Your second example won't actually work:
all(a.count(char) == b.count(char) for char in a)
will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.
If you want to know whether two strings are permutations of the same unique characters, just do:
set(a) == set(b)
To correct your second example:
all(str1.count(char) == str2.count(char) for char in set(a) | set(b))
set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.
That said, the sorted() method is much simpler and more intuitive, and would be what I would use.

Here are some timed executions on very small strings, using two different methods:
1. sorting
2. counting (specifically the original method by #namin).
a, b, c = 'confused', 'unfocused', 'foncused'
sort_method = lambda x,y: sorted(x) == sorted(y)
def count_method(a, b):
d = {}
for x in a:
d[x] = d.get(x, 0) + 1
for x in b:
d[x] = d.get(x, 0) - 1
for v in d.itervalues():
if v != 0:
return False
return True
Average run times of the 2 methods over 100,000 loops are:
non-match (string a and b)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.b)'
100000 loops, best of 3: 9.72 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.b)'
10000 loops, best of 3: 28.1 usec per loop
match (string a and c)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.c)'
100000 loops, best of 3: 9.47 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.c)'
100000 loops, best of 3: 24.6 usec per loop
Keep in mind that the strings used are very small. The time complexity of the methods are different, so you'll see different results with very large strings. Choose according to your data, you may even use a combination of the two.

Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:
public boolean isPermutation(String a, String b) {
if (a.length() != b.length()) {
return false;
}
int[] charCount = new int[256];
for (int i = 0; i < a.length(); ++i) {
++charCount[a.charAt(i)];
}
for (int i = 0; i < b.length(); ++i) {
if (--charCount[b.charAt(i)] < 0) {
return false;
}
}
return true;
}
First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.

Here's martinus code in python. It only works for ascii strings:
def is_permutation(a, b):
if len(a) != len(b):
return False
char_count = [0] * 256
for c in a:
char_count[ord(c)] += 1
for c in b:
char_count[ord(c)] -= 1
if char_count[ord(c)] < 0:
return False
return True

I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:
Testing against 9227 words.
Permutation testing by sorting ... done. 18.582 s
Permutation testing by counting ... done. 14.949 s
If anyone wants the algorithm and test data set, comment away.

First, for solving such problems, e.g. whether String 1 and String 2 are exactly the same or not, easily, you can use an "if" since it is O(1).
Second, it is important to consider that whether they are only numerical values or they can be also words in the string. If the latter one is true (words and numerical values are in the string at the same time), your first solution will not work. You can enhance it by using "ord()" function to make it ASCII numerical value. However, in the end, you are using sort; therefore, in the worst case your time complexity will be O(NlogN). This time complexity is not bad. But, you can do better. You can make it O(N).
My "suggestion" is using Array(list) and set at the same time. Note that finding a value in Array needs iteration so it's time complexity is O(N), but searching a value in set (which I guess it is implemented with HashTable in Python, I'm not sure) has O(1) time complexity:
def Permutation2(Str1, Str2):
ArrStr1 = list(Str1) #convert Str1 to array
SetStr2 = set(Str2) #convert Str2 to set
ArrExtra = []
if len(Str1) != len(Str2): #check their length
return False
elif Str1 == Str2: #check their values
return True
for x in xrange(len(ArrStr1)):
ArrExtra.append(ArrStr1[x])
for x in xrange(len(ArrExtra)): #of course len(ArrExtra) == len(ArrStr1) ==len(ArrStr2)
if ArrExtra[x] in SetStr2: #checking in set is O(1)
continue
else:
return False
return True

Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.
As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.

In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b).
But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.

This is derived from #patros' answer.
from collections import Counter
def is_anagram(a, b, threshold=1000000):
"""Returns true if one sequence is a permutation of the other.
Ignores whitespace and character case.
Compares sorted sequences if the length is below the threshold,
otherwise compares dictionaries that contain the frequency of the
elements.
"""
a, b = a.strip().lower(), b.strip().lower()
length_a, length_b = len(a), len(b)
if length_a != length_b:
return False
if length_a < threshold:
return sorted(a) == sorted(b)
return Counter(a) == Counter(b) # Or use #namin's method if you don't want to create two dictionaries and don't mind the extra typing.

This is an O(n) solution in Python using hashing with dictionaries. Notice that I don't use default dictionaries because the code can stop this way if we determine the two strings are not permutations after checking the second letter for instance.
def if_two_words_are_permutations(s1, s2):
if len(s1) != len(s2):
return False
dic = {}
for ch in s1:
if ch in dic.keys():
dic[ch] += 1
else:
dic[ch] = 1
for ch in s2:
if not ch in dic.keys():
return False
elif dic[ch] == 0:
return False
else:
dic[ch] -= 1
return True

This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?
public function is_anagram($word1, $word2) {
$letters1 = str_split($word1);
$letters2 = str_split($word2);
if (count($letters1) == count($letters2)) {
foreach ($letters1 as $letter) {
$index = array_search($letter, $letters2);
if ($index !== false) {
unset($letters2[$index]);
}
else { return false; }
}
return true;
}
return false;
}
Here's a literal translation to Python of the PHP version (by JFS):
def is_anagram(word1, word2):
letters2 = list(word2)
if len(word1) == len(word2):
for letter in word1:
try:
del letters2[letters2.index(letter)]
except ValueError:
return False
return True
return False
Comments:
1. The algorithm is O(N**2). Compare it to #namin's version (it is O(N)).
2. The multiple returns in the function look horrible.

This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in #patros's answer).
It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory e.g., given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).
def isanagram(iterable1, iterable2):
d = {}
get = d.get
for c in iterable1:
d[c] = get(c, 0) + 1
try:
for c in iterable2:
d[c] -= 1
return not any(d.itervalues())
except KeyError:
return False
It is unclear why this version is faster then defaultdict (#namin's) one for large iterable1 (tested on 25MB thesaurus).
If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings i.e. when there are few duplicates.

In Swift (or another languages implementation), you could look at the encoded values ( in this case Unicode) and see if they match.
Something like:
let string1EncodedValues = "Hello".unicodeScalars.map() {
//each encoded value
$0
//Now add the values
}.reduce(0){ total, value in
total + value.value
}
let string2EncodedValues = "oellH".unicodeScalars.map() {
$0
}.reduce(0) { total, value in
total + value.value
}
let equalStrings = string1EncodedValues == string2EncodedValues ? true : false
You will need to handle spaces and cases as needed.

def matchPermutation(s1, s2):
a = []
b = []
if len(s1) != len(s2):
print 'length should be the same'
return
for i in range(len(s1)):
a.append(s1[i])
for i in range(len(s2)):
b.append(s2[i])
if set(a) == set(b):
print 'Permutation of each other'
else:
print 'Not a permutation of each other'
return
#matchPermutaion('rav', 'var') #returns True
matchPermutaion('rav', 'abc') #returns False

Checking if two strings are permutations of each other in Python
# First method
def permutation(s1,s2):
if len(s1) != len(s2):return False;
return ' '.join(sorted(s1)) == ' '.join(sorted(s2))
# second method
def permutation1(s1,s2):
if len(s1) != len(s2):return False;
array = [0]*128;
for c in s1:
array[ord(c)] +=1
for c in s2:
array[ord(c)] -=1
if (array[ord(c)]) < 0:
return False
return True

How about something like this. Pretty straight-forward and readable. This is for strings since the as per the OP.
Given that the complexity of sorted() is O(n log n).
def checkPermutation(a,b):
# input: strings a and b
# return: boolean true if a is Permutation of b
if len(a) != len(b):
return False
else:
s_a = ''.join(sorted(a))
s_b = ''.join(sorted(b))
if s_a == s_b:
return True
else:
return False
# test inputs
a = 'sRF7w0qbGp4fdgEyNlscUFyouETaPHAiQ2WIxzohiafEGJLw03N8ALvqMw6reLN1kHRjDeDausQBEuIWkIBfqUtsaZcPGoqAIkLlugTxjxLhkRvq5d6i55l4oBH1QoaMXHIZC5nA0K5KPBD9uIwa789sP0ZKV4X6'
b = 'Vq3EeiLGfsAOH2PW6skMN8mEmUAtUKRDIY1kow9t1vIEhe81318wSMICGwf7Rv2qrLrpbeh8bh4hlRLZXDSMyZJYWfejLND4u9EhnNI51DXcQKrceKl9arWqOl7sWIw3EBkeu7Fw4TmyfYwPqCf6oUR0UIdsAVNwbyyiajtQHKh2EKLM1KlY6NdvQTTA7JKn6bLInhFvwZ4yKKbzkgGhF3Oogtnmzl29fW6Q2p0GPuFoueZ74aqlveGTYc0zcXUJkMzltzohoRdMUKP4r5XhbsGBED8ReDbL3ouPhsFchERvvNuaIWLUCY4gl8OW06SMuvceZrCg7EkSFxxprYurHz7VQ2muxzQHj7RG2k3khxbz2ZAhWIlBBtPtg4oXIQ7cbcwgmBXaTXSBgBe3Y8ywYBjinjEjRJjVAiZkWoPrt8JtZv249XiN0MTVYj0ZW6zmcvjZtRn32U3KLMOdjLnRFUP2I3HJtp99uVlM9ghIpae0EfC0v2g78LkZE1YAKsuqCiiy7DVOhyAZUbOrRwXOEDHxUyXwCmo1zfVkPVhwysx8HhH7Iy0yHAMr0Tb97BqcpmmyBsrSgsV1aT3sjY0ctDLibrxbRXBAOexncqB4BBKWJoWkQwUZkFfPXemZkWYmE72w5CFlI6kuwBQp27dCDZ39kRG7Txs1MbsUnRnNHBy1hSOZvTQRYZPX0VmU8SVGUqzwm1ECBHZakQK4RUquk3txKCqbDfbrNmnsEcjFaiMFWkY3Esg6p3Mm41KWysTpzN6287iXjgGSBw6CBv0hH635WiZ0u47IpUD5mY9rkraDDl5sDgd3f586EWJdKAaou3jR7eYU7YuJT3RQVRI0MuS0ec0xYID3WTUI0ckImz2ck7lrtfnkewzRMZSE2ANBkEmg2XAmwrCv0gy4ExW5DayGRXoqUv06ZLGCcBEiaF0fRMlinhElZTVrGPqqhT03WSq4P97JbXA90zUxiHCnsPjuRTthYl7ZaiVZwNt3RtYT4Ff1VQ5KXRwRzdzkRMsubBX7YEhhtl0ZGVlYiP4N4t00Jr7fB4687eabUqK6jcUVpXEpTvKDbj0JLcLYsneM9fsievUz193f6aMQ5o5fm4Ilx3TUZiX4AUsoyd8CD2SK3NkiLuR255BDIA0Zbgnj2XLyQPiJ1T4fjStpjxKOTzsQsZxpThY9Fvjvoxcs3HAiXjLtZ0TSOX6n4ZLjV3TdJMc4PonwqIb3lAndlTMnuzEPof2dXnpexoVm5c37XQ7fBkoMBJ4ydnW25XKYJbkrueRDSwtJGHjY37dob4jPg0axM5uWbqGocXQ4DyiVm5GhvuYX32RQaOtXXXw8cWK6JcSUnlP1gGLMNZEGeDXOuGWiy4AJ7SH93ZQ4iPgoxdfCuW0qbsLKT2HopcY9dtBIRzr91wnES9lDL49tpuW77LSt5dGA0YLSeWAaZt9bDrduE0gDZQ2yX4SDvAOn4PMcbFRfTqzdZXONmO7ruBHHb1tVFlBFNc4xkoetDO2s7mpiVG6YR4EYMFIG1hBPh7Evhttb34AQzqImSQm1gyL3O7n3p98Kqb9qqIPbN1kuhtW5mIbIioWW2n7MHY7E5mt0'
print(checkPermutation(a, b)) #optional

def permute(str1,str2):
if sorted(str1) == sorted(str2):
return True
else:
return False
str1="hello"
str2='olehl'
a=permute(str1,str2)
print(a

from collections import defaultdict
def permutation(s1,s2):
h = defaultdict(int)
for ch in s1:
h[ch]+=1
for ch in s2:
h[ch]-=1
for key in h.keys():
if h[key]!=0 or len(s1)!= len(s2):
return False
return True
print(permutation("tictac","tactic"))

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficiently randomly sampling List while maintaining order - performance

Perhaps I don't quite understand, but since Lists are immutable you don't really need to worry about 'maintaining the order' since the original List is never touched. Wouldn't the following suffice? def sampledList[T](ls: List[T], sampleSize: Int) = Random.shuffle(ls).take(sampleSize)

Related

Performance issue in Scala code - O(nlgn) faster than O(n)

Generate random numbers where the difference is always positive

Converting Map[Int, Double] to breeze.linalg.SparseVector

Given a number, produce another random number that is the same every time and distinct from all other results

Checking if two strings are permutations of each other in Python

Categories

Resources