Scala best way to find pairs in a collection [duplicate] - algorithm

This question already has answers here:
Composing a list of all pairs
(3 answers)
Closed 5 years ago.
I'm trying to find the most optimal way of finding pairs in a Scala collection. For example,
val list = List(1,2,3)
should produce these pairs
(1,2) (1,3) (2,1) (2,3) (3,1) (3,2)
My current implement seems quite expensive. How can I further optimize this?
val pairs = list.flatMap { currentElement =>
val clonedList: mutable.ListBuffer[Int] = list.to[ListBuffer]
val currentIndex = list.indexOf(currentElement)
val removedValue = clonedList.remove(currentIndex)
clonedList.map { y =>
(currentElement, y)
}
}

val l = Array(1,2,3,4)
val result = scala.collection.mutable.HashSet[(Int, Int)]()
for(i <- 0 until l.size) {
for(j<- (i+1) until l.size) {
result += l(i)->l(j)
result += l(j)->l(i)
}
}
Several optimizations here. First, with the second loop, we only traverse the list from the current element to the end, dividing the number of iterations by two. Then we limit the number of object creations to the minimum (Only tuples are created and added to a mutable hashset). Finally, with the hashset you handle the duplicates for free. An additional optimization would be to check if the set already contains the tuple to avoid creating an object for nothing.
For 1,000 elements, it takes less than 1s on my laptop. 7s for 10k distinct elements.
Note that recursively, you could do it this way:
def combi(s : Seq[Int]) : Seq[(Int, Int)] =
if(s.isEmpty)
Seq()
else
s.tail.flatMap(x=> Seq(s.head -> x, x -> s.head)) ++ combi(s.tail)
It takes a little bit more than 1s for 1000 elements.

Supposing that "most optimal way" could be treated differently (e.g. most of time I treat it the one which allows myself to be more productive) I suggest the following approach:
val originalList = (1 to 1000) toStream
def orderedPairs[T](list: Stream[T]) = list.combinations(2).map( p => (p(0), p(1)) ).toStream
val pairs = orderedPairs(originalList) ++ orderedPairs(originalList.reverse)
println(pairs.slice(0, 1000).toList)

Related

Why my Binary Search implementation in Scala is so slow?

Recently, I implemented this Binary Search, which is supposed to run under 6 seconds for Scala, yet it runs for 12-13 seconds on the machine that checks the assignments.
Note before you read the code: the input consists of two lines: first - list of numbers to search in, and second - list of "search terms" to search in the list of numbers. Expected output just lists the indexes of each term in the list of numbers. Each input can be maximum of length 10^5 and each number maximum of size 10^9.
For example:
Input:
5 1 5 8 12 13 //note, that the first number 5 indicates the length of the
following sequence
5 8 1 23 1 11 //note, that the first number 5 indicates the length of the
following sequence
Output:
2 0 -1 0 -1 // index of each term in the input array
My solution:
object BinarySearch extends App {
val n_items = readLine().split(" ").map(BigInt(_))
val n = n_items(0)
val items = n_items.drop(1)
val k :: terms = readLine().split(" ").map(BigInt(_)).toList
println(search(terms, items).mkString(" "))
def search(terms: List[BigInt], items:Array[BigInt]): Array[BigInt] = {
#tailrec
def go(terms: List[BigInt], results: Array[BigInt]): Array[BigInt] = terms match {
case List() => results
case head :: tail => go(tail, results :+ find(head))
}
def find(term: BigInt): BigInt = {
#tailrec
def go(left: BigInt, right: BigInt): BigInt = {
if (left > right) { -1 }
else {
val middle = left + (right - left) / 2
val middle_val = items(middle.toInt)
middle_val match {
case m if m == term => middle
case m if m <= term => go(middle + 1, right)
case m if m > term => go(left, middle - 1)
}
}
}
go(0, n - 1)
}
go(terms, Array())
}
}
What makes this code so slow? Thank you
I am worried about the complexity of
results :+ find(head)
Appending an item to a list of length L is O(L) (see here), so if you have n results to compute, the complexity will be O(n*n).
Try using a mutable ArrayBuffer instead of an Array to accumulate the results, or simply mapping the input terms through the find function.
In other words replace
go(terms, Array())
with
terms.map( x => find(x) ).toArray
By the way, the limits on the problem are small enough that using BigInt is overkill and probably making the code significantly slower. Normal ints should be large enough for this problem.

Find most unique words, penalizing words in common

suppose I have n classes like:
A: this,is,a,test,of,the,salmon,system
B: i,like,to,test,the,flounder,system
C: to,test,a,salmon,is,like,to,test,the,iodine,system
I want to get the most unique words for each class, so something with a ranking that gives me
A: salmon
B: flounder
C: iodine, salmon
(as their first elements ; it can be a ranking of all words)
How do I do this? There will be hundreds of input classes each with tens of thousands of tokens.
I'm guessing this is essentially the sort of thing any search engine back-end does, but I'd like a fairly simple standalone thing.
Using a language like Python, you can write this efficiently in 8 lines. For hundreds of groups, each with tens of thousands of tokens, the running time sounds like it will take at most a few minutes (although I haven't tried this on actual input).
Create a hash-based dictionary mapping each word to the number of its occurrences.
Iterate over all groups, and all words in a group, and update this dictionary.
For each group,
a. If you need a total ranking, sort with the value in the dictionary as the critera
b. If you need the top k, use an order statistics type of algorithm again using the value in the dictionary as the criteria
Steps 1 + 2 should have expected linear complexity in the total number of words.
Step 3 is n log(n) per group for total ranking, and linear in the total number of words otherwise.
Here is the Python code for the top k. Assume all_groups is a list of lists of strings, and that k = 10.
from collections import Counter
import heapq
import operator
c = Counter()
for g in all_groups:
c.update(g)
for g in all_groups:
print heapq.nsmallest(k, [(w, c[w]) for w in g], key=operator.itemgetter(1))
What I understand from your question, I come to this solution as the least used words per class comparing with all the other classes.
var a = "this,is,a,test,of,the,salmon,system".split(","),
b = "i,like,to,test,the,flounder,system".split(","),
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(","),
map = {},
min,
key,
parse = function(stringArr) {
var length = stringArr.length,
i,count;
for (i = 0; i< length; i++) {
if (count = map[stringArr[i]]) {
map[stringArr[i]] = count + 1;
}
else {
map[stringArr[i]] = 1;
}
}
},
get = function(stringArr) {
min = Infinity;
stringArr.forEach((item)=>{
if (map[item] < min) {
min = map[item];
key = item
}
});
console.log(key);
};
parse(a);
parse(b);
parse(c);
get(a);
get(b);
get(c);
Ignore the classes, go through all the words and make a frequency table.
Then, for each class select the word with the lowest frequency.
Example in Python (slightly unpythonic solution to maintain readability for non-Python users):
a = "this,is,a,test,of,the,salmon,system".split(",")
b = "i,like,to,test,the,flounder,system".split(",")
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(",")
freq = {}
for word in a + b + c:
freq[word] = (freq[word] if word in freq else 0) + 1
print("a: ", min(a, key=lambda w: freq[w]))
print("b: ", min(b, key=lambda w: freq[w]))
print("c: ", min(c, key=lambda w: freq[w]))

Scala Filter and Collect is slow

I am just beginning with Scala development and am trying to filter out unnecessary lines from an iterator using filter and collect. But the operation seems to be too slow.
val src = Source.fromFile("/home/Documents/1987.csv") // 1.2 Million
val iter = src.getLines().map(_.split(":"))
val iter250 = iter.take(250000) // Only interested in the first 250,000
val intrestedIndices = range(1, 100000, 3).toSeq // This could be any order
val slicedData = iter250.zipWithIndex
// Takes 3 minutes
val firstCase = slicedData.collect { case (x, i) if intrestedIndices.contains(i) => x }.size
// Takes 3 minutes
val secondCase = slicedData.filter(x => intrestedIndices.contains(x._2)).size
// Takes 1 second
val thirdCase = slicedData.collect { case (x,i ) if i % 3 == 0 => x}.size
It appears the intrestedIndices.contains(_) part is slowing down the program in the first and second case. Is there an alternative way to speed this process up.
This answer helped solve the problem.
You iterate over all interestedIndices in first two cases in linear time. Use Set instead of Seq to improve performance – Sergey Lagutin
For the record, here's a method to filter with an (ordered) Seq of indices, not necessarily equidistant, without scanning the indices at each step:
def filterInteresting[T](it: Iterator[T], indices: Seq[Int]): Iterator[T] =
it.zipWithIndex.scanLeft((indices, None: Option[T])) {
case ((indices, _), (elem, index)) => indices match {
case h :: t if h == index => (t, Some(elem))
case l => (l, None)
}
}.map(_._2).flatten

Algorithm for combining different age groups together based on their values

Let's say we have an array of age groups and an array of the number of people in each age group
For example:
Ages = ("1-13", "14-20", "21-30", "31-40", "41-50", "51+")
People = (1, 10, 21, 3, 2, 1)
I want to have an algorithm that combines these age groups with the following logic if there are fewer than 5 people in each group. The algorithm that I have so far does the following:
Start from the last element (e.g., "51+") can you combine it with the next group? (here "41-50") if yes add the numbers 1+2 and combine their labels. So we get the following
Ages = ("1-13", "14-20", "21-30", "31-40", "41+")
People = (1, 10, 21, 3, 3)
Take the last one again (here is "41+"). Can you combine it with the next group (31-40)? the answer is yes so we get:
Ages = ("1-13", "14-20", "21-30", "31+")
People = (1, 10, 21, 6)
since the group 31+ now has 6 members we cannot collapse it into the next group.
we cannot collapse "21-30" into the next one "14-20" either
"14-20" also has 10 people (>5) so we don't do anything on this either
for the first one ("1-13") since we have only one person and it is the last group we combine it with the next group "14-20" and get the following
Ages = ("1-20", "21-30", "31+")
People = (11, 21, 6)
I have an implementation of this algorithm that uses many flags to keep track of whether or not any data is changed and it makes a number of passes on the two arrays to finish this task.
My question is if you know any efficient way of doing the same thing? any data structure that can help? any algorithm that can help me do the same thing without doing too much bookkeeping would be great.
Update:
A radical example would be (5,1,5)
in the first pass it becomes (5,6) [collapsing the one on the right into the one in the middle]
then we have (5,6). We cannot touch 6 since it is larger than our threshold:5. so we go to the next one (which is element on the very left 5) since it is less than or equal to 5 and since it is the last one on the left we group it with the one on its right. so we finally get (11)
Here is an OCaml solution of a left-to-right merge algorithm:
let close_group acc cur_count cur_names =
(List.rev cur_names, cur_count) :: acc
let merge_small_groups mini l =
let acc, cur_count, cur_names =
List.fold_left (
fun (acc, cur_count, cur_names) (name, count) ->
if cur_count <= mini || count <= mini then
(acc, cur_count + count, name :: cur_names)
else
(close_group acc cur_count cur_names, count, [name])
) ([], 0, []) l
in
List.rev (close_group acc cur_count cur_names)
let input = [
"1-13", 1;
"14-20", 10;
"21-30", 21;
"31-40", 3;
"41-50", 2;
"51+", 1
]
let output = merge_small_groups 5 input
(* output = [(["1-13"; "14-20"], 11); (["21-30"; "31-40"; "41-50"; "51+"], 27)] *)
As you can see, the result of merging from left to right may not be what you want.
Depending on the goal, it may make more sense to merge the pair of consecutive elements whose sum is smallest and iterate until all counts are above the minimum of 5.
Here is my scala approach.
We start with two lists:
val people = List (1, 10, 21, 3, 2, 1)
val ages = List ("1-13", "14-20", "21-30", "31-40", "41-50", "51+")
and combine them to a kind of mapping:
val agegroup = ages.zip (people)
define a method to merge two Strings, describing an (open ended) interval. The first parameter is, if any, the one with the + in "51+".
/**
combine age-strings
a+ b-c => b+
a-b c-d => c-b
*/
def merge (xs: String, ys: String) = {
val xab = xs.split ("[+-]")
val yab = ys.split ("-")
if (xs.contains ("+")) yab(0) + "+" else
yab (0) + "-" + xab (1)
}
Here is the real work:
/**
reverse the list, combine groups < threshold.
*/
def remap (map: List [(String, Int)], threshold : Int) = {
def remap (mappings: List [(String, Int)]) : List [(String, Int)] = mappings match {
case Nil => Nil
case x :: Nil => x :: Nil
case x :: y :: xs => if (x._2 > threshold) x :: remap (y :: xs) else
remap ((merge (x._1, y._1), x._2 + y._2) :: xs) }
val nearly = (remap (map.reverse)).reverse
// check for first element
if (! nearly.isEmpty && nearly.length > 1 && nearly (0)._2 < threshold) {
val a = nearly (0)
val b = nearly (1)
val rest = nearly.tail.tail
(merge (b._1, a._1), a._2 + b._2) :: rest
} else nearly
}
and invocation
println (remap (agegroup, 5))
with result:
scala> println (remap (agegroup, 5))
List((1-20,11), (21-30,21), (31+,6))
The result is a list of pairs, age-group and membercount.
I guess the main part is easy to understand: There are 3 basic cases: an empty list, which can't be grouped, a list of one group, which is the solution itself, and more than one element.
If the first element (I reverse the list in the beginning, to start with the end) is bigger than 5 (6, whatever), yield it, and procede with the rest - if not, combine it with the second, and take this combined element and call it with the rest in a recursive way.
If 2 elements get combined, the merge-method for the strings is called.
The map is remapped, after reverting it, and the result reverted again. Now the first element has to be inspected and eventually combined.
We're done.
I think a good data structure would be a linked list of pairs, where each pair contains the age span and the count. Using that, you can easily walk the list, and join two pairs in O(1).

Better random "feeling" integer generator for short sequences

I'm trying to figure out a way to create random numbers that "feel" random over short sequences. This is for a quiz game, where there are four possible choices, and the software needs to pick one of the four spots in which to put the correct answer before filling in the other three with distractors.
Obviously, arc4random % 4 will create more than sufficiently random results over a long sequence, but in a short sequence its entirely possible (and a frequent occurrence!) to have five or six of the same number come back in a row. This is what I'm aiming to avoid.
I also don't want to simply say "never pick the same square twice," because that results in only three possible answers for every question but the first. Currently I'm doing something like this:
bool acceptable = NO;
do {
currentAnswer = arc4random() % 4;
if (currentAnswer == lastAnswer) {
if (arc4random() % 4 == 0) {
acceptable = YES;
}
} else {
acceptable = YES;
}
} while (!acceptable);
Is there a better solution to this that I'm overlooking?
If your question was how to compute currentAnswer using your example's probabilities non-iteratively, Guffa has your answer.
If the question is how to avoid random-clustering without violating equiprobability and you know the upper bound of the length of the list, then consider the following algorithm which is kind of like un-sorting:
from random import randrange
# randrange(a, b) yields a <= N < b
def decluster():
for i in range(seq_len):
j = (i + 1) % seq_len
if seq[i] == seq[j]:
i_swap = randrange(i, seq_len) # is best lower bound 0, i, j?
if seq[j] != seq[i_swap]:
print 'swap', j, i_swap, (seq[j], seq[i_swap])
seq[j], seq[i_swap] = seq[i_swap], seq[j]
seq_len = 20
seq = [randrange(1, 5) for _ in range(seq_len)]; print seq
decluster(); print seq
decluster(); print seq
where any relation to actual working Python code is purely coincidental. I'm pretty sure the prior-probabilities are maintained, and it does seem break clusters (and occasionally adds some). But I'm pretty sleepy so this is for amusement purposes only.
You populate an array of outcomes, then shuffle it, then assign them in that order.
So for just 8 questions:
answer_slots = [0,0,1,1,2,2,3,3]
shuffle(answer_slots)
print answer_slots
[1,3,2,1,0,2,3,0]
To reduce the probability for a repeated number by 25%, you can pick a random number between 0 and 3.75, and then rotate it so that the 0.75 ends up at the previous answer.
To avoid using floating point values, you can multiply the factors by four:
Pseudo code (where / is an integer division):
currentAnswer = ((random(0..14) + lastAnswer * 4) % 16) / 4
Set up a weighted array. Lets say the last value was a 2. Make an array like this:
array = [0,0,0,0,1,1,1,1,2,3,3,3,3];
Then pick a number in the array.
newValue = array[arc4random() % 13];
Now switch to using math instead of an array.
newValue = ( ( ( arc4random() % 13 ) / 4 ) + 1 + oldValue ) % 4;
For P possibilities and a weight 0<W<=1 use:
newValue = ( ( ( arc4random() % (P/W-P(1-W)) ) * W ) + 1 + oldValue ) % P;
For P=4 and W=1/4, (P/W-P(1-W)) = 13. This says the last value will be 1/4 as likely as other values.
If you completely eliminate the most recent answer it will be just as noticeable as the most recent answer showing up too often. I do not know what weight will feel right to you, but 1/4 is a good starting point.

Resources