Grouping adjacent elements in a list - algorithm

Let's say I want to write a function that does this:
input: [1,1,3,3,4,2,2,5,6,6]
output: [[1,1],[3,3],[4],[2,2],[5],[6,6]]
It's grouping adjacent elements that are same.
What should the name of this method be? Is there a standard name for this operation?

In [1,1,3,3,4,2,2,5,6,6], a thing like [1,1] is very often referred to as run (as in run-length encoding, see RLE in Scala). I'd therefore call the method groupRuns.

#tailrec
def groupRuns[A](c: Seq[A], acc: Seq[Seq[A]] = Seq.empty): Seq[Seq[A]] = {
c match {
case Seq() => acc
case xs =>
val (same, rest) = xs.span { _ == xs.head }
groupRuns(rest, acc :+ same)
}
}
scala> groupRuns(Vector(1, 1, 3, 3, 4, 2, 2, 5, 6, 6))
res7: Seq[Seq[Int]] = List(Vector(1, 1), Vector(3, 3), Vector(4), Vector(2, 2), Vector(5), Vector(6, 6))

Related

Fill a nested structure with values from a linear supply stream

I got stuck in the resolution of the next problem:
Imagine we have an array structure, any structure, but for this example let's use:
[
[ [1, 2], [3, 4], [5, 6] ],
[ 7, 8, 9, 10 ]
]
For convenience, I transform this structure into a flat array like:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
Imagine that after certain operations our array looks like this:
[ 1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10]
NOTE: those values are a result of some operation, I just want to point out that is independent from the structure or their positions.
What I would like to know is... given the first array structure, how can I transform the last flat array into the same structure as the first? So it will look like:
[
[ [1, 2], [3, 4] , [12515, 25125] ],
[ 12512, 8, 9, 10]
]
Any suggestions? I was just hardcoding the positions in to the given structure. But that's not dynamic.
Just recurse through the structure, and use an iterator to generate the values in order:
function fillWithStream(structure, iterator) {
for (var i=0; i<structure.length; i++)
if (Array.isArray(structure[i]))
fillWithStream(structure[i], iterator);
else
structure[i] = getNext(iterator);
}
function getNext(iterator) {
const res = iterator.next();
if (res.done) throw new Error("not enough elements in the iterator");
return res.value;
}
var structure = [
[ [1, 2], [3, 4], [5, 6] ],
[ 7, 8, 9, 10 ]
];
var seq = [1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10];
fillWithStream(structure, seq[Symbol.iterator]())
console.log(JSON.stringify(structure));
Here is a sketch in Scala. Whatever your language is, you first have to represent the tree-like data structure somehow:
sealed trait NestedArray
case class Leaf(arr: Array[Int]) extends NestedArray {
override def toString = arr.mkString("[", ",", "]")
}
case class Node(children: Array[NestedArray]) extends NestedArray {
override def toString =
children
.flatMap(_.toString.split("\n"))
.map(" " + _)
.mkString("[\n", "\n", "\n]")
}
object NestedArray {
def apply(ints: Int*) = Leaf(ints.toArray)
def apply(cs: NestedArray*) = Node(cs.toArray)
}
The only important part is the differentiation between the leaf nodes that hold arrays of integers, and the inner nodes that hold their child-nodes in arrays. The toString methods and extra constructors are not that important, it's mostly just for the little demo below.
Now you essentially want to build an encoder-decoder, where the encode part simply flattens everything, and decode part takes another nested array as argument, and reshapes a flat array into the shape of the nested array. The flattening is very simple:
def encode(a: NestedArray): Array[Int] = a match {
case Leaf(arr) => arr
case Node(cs) => cs flatMap encode
}
The restoring of the structure isn't all that difficult either. I've decided to keep the track of the position in the array by passing around an explicit int-index:
def decode(
shape: NestedArray,
flatArr: Array[Int]
): NestedArray = {
def recHelper(
startIdx: Int,
subshape: NestedArray
): (Int, NestedArray) = subshape match {
case Leaf(a) => {
val n = a.size
val subArray = Array.ofDim[Int](n)
System.arraycopy(flatArr, startIdx, subArray, 0, n)
(startIdx + n, Leaf(subArray))
}
case Node(cs) => {
var idx = startIdx
val childNodes = for (c <- cs) yield {
val (i, a) = recHelper(idx, c)
idx = i
a
}
(idx, Node(childNodes))
}
}
recHelper(0, shape)._2
}
Your example:
val original = NestedArray(
NestedArray(NestedArray(1, 2), NestedArray(3, 4), NestedArray(5, 6)),
NestedArray(NestedArray(7, 8, 9, 10))
)
println(original)
Here is what it looks like as ASCII-tree:
[
[
[1,2]
[3,4]
[5,6]
]
[
[7,8,9,10]
]
]
Now reconstruct a tree of same shape from a different array:
val flatArr = Array(1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10)
val reconstructed = decode(original, flatArr)
println(reconstructed)
this gives you:
[
[
[1,2]
[3,4]
[12515,25125]
]
[
[12512,8,9,10]
]
]
I hope that should be more or less comprehensible for anyone who does some functional programming in a not-too-remote descendant of ML.
Turns out I've already answered your question a few months back, a very similar one to it anyway.
The code there needs to be tweaked a little bit, to make it fit here. In Scheme:
(define (merge-tree-fringe vals tree k)
(cond
[(null? tree)
(k vals '())]
[(not (pair? tree)) ; for each leaf:
(k (cdr vals) (car vals))] ; USE the first of vals
[else
(merge-tree-fringe vals (car tree) (lambda (Avals r) ; collect 'r' from car,
(merge-tree-fringe Avals (cdr tree) (lambda (Dvals q) ; collect 'q' from cdr,
(k Dvals (cons r q))))))])) ; return the last vals and the combined results
The first argument is a linear list of values, the second is the nested list whose structure is to be re-created. Making sure there's enough elements in the linear list of values is on you.
We call it as
> (merge-tree-fringe '(1 2 3 4 5 6 7 8) '(a ((b) c) d) (lambda (vs r) (list r vs)))
'((1 ((2) 3) 4) (5 6 7 8))
> (merge-tree-fringe '(1 2 3 4 5 6 7 8) '(a ((b) c) d) (lambda (vs r) r))
'(1 ((2) 3) 4)
There's some verbiage at the linked answer with the explanations of what's going on. Short story short, it's written in CPS – continuation-passing style:
We process a part of the nested structure while substituting the leaves with the values from the linear supply; then we're processing the rest of the structure with the remaining supply; then we combine back the two results we got from processing the two sub-parts. For LISP-like nested lists, it's usually the "car" and the "cdr" of the "cons" cell, i.e. the tree's top node.
This is doing what Bergi's code is doing, essentially, but in a functional style.
In an imaginary pattern-matching pseudocode, which might be easier to read/follow, it is
merge-tree-fringe vals tree = g vals tree (vs r => r)
where
g vals [a, ...d] k = g vals a (avals r => -- avals: vals remaining after 'a'
g avals d (dvals q => -- dvals: remaining after 'd'
k dvals [r, ...q] )) -- combine the results
g vals [] k = k vals [] -- empty
g [v, ...vs] _ k = k vs v -- leaf: replace it
This computational pattern of threading a changing state through the computations is exactly what the State monad is about; with Haskell's do notation the above would be written as
merge_tree_fringe vals tree = evalState (g tree) vals
where
g [a, ...d] = do { r <- g a ; q <- g d ; return [r, ...q] }
g [] = do { return [] }
g _ = do { [v, ...vs] <- get ; put vs ; return v } -- leaf: replace
put and get work with the state being manipulated, updated and passed around implicitly; vals being the initial state; the final state being silently discarded by evalState, like our (vs r => r) above also does, but explicitly so.

How to merge lists that contains recursive common attribute

The input is List(1,2), List(3,4), List(1000), List(5,6), List(100, 1,3), List(99, 4, 5).
The expected output is: List(1,2,3,4,5,6,99,100), List(1000)
I try to use foldLeft, but I find out one loop O(n) would be missing some elements. I wonder is there a way a Scala collection api or method I can use to solve this puzzle ? Also, I prefer to be more functional if it is possible.
def merge(lists: List[List[Int]]): List[List[Int]] = {
???
}
Thanks in advance.
You can try this function. It works well over huge lists also
def merge(input: List[List[Int]]): List[List[Int]] = {
val sets: Set[Set[Int]] = input.map(_.toSet).toSet
def hasIntersect(set: Set[Int]): Boolean =
sets.count(set.intersect(_).nonEmpty) > 1
val (merged, rejected) = sets partition hasIntersect
List(merged.flatten, rejected.flatten).map(_.toList.sorted)
}
merge(List(List(1, 2), List(3, 4), List(1000), List(5, 6), List(100, 1, 3), List(99, 4, 5)))
You will get the result in the format
res0: List[List[Int]] = List(List(1, 2, 3, 4, 5, 6, 99, 100), List(1000))
Please let me know if you have any further doubts. I would be happy to clarify them.
Here is a recursive solution for your reference:
def merge(a:List[List[Int]]):List[List[Int]] = {
a match {
case Nil => Nil
case h::l =>
l.partition(_.intersect(h)!=Nil) match {
case (Nil, _) =>
//No intersect, just merge the rest and add this one
h::merge(l)
case (intersects, others) =>
//It has intersects, merge them to one list and continue merging
merge((h::intersects).flatten.distinct::others)
}
}
}
res9: List[List[Int]] = List(List(1, 2, 100, 3, 4, 99, 5, 6), List(1000))
All you need is filter, toSet and sorted function calls as
def merge(lists: List[List[Int]]): List[List[Int]] = {
val flattenedList = lists.flatten
val repeatedList = lists.filter(list => list.map(x => flattenedList.count(_ == x) > 1).contains(true))
val notRepeatedList = lists.diff(repeatedList)
List(repeatedList.flatten.toSet.toList.sorted) ++ notRepeatedList
}
and then calling the merge function as
val lists = List(List(1,2), List(3,4), List(1000), List(5,6), List(100, 1,3), List(99, 4, 5))
println(merge(lists))
would give you
List(List(1, 2, 3, 4, 5, 6, 99, 100), List(1000))

Scala: How to get the Top N elements of an Iterable with Grouping (or Binning)

I have used the solution mentioned here to get the top n elements of a Scala Iterable, efficiently.
End example:
scala> val li = List (4, 3, 6, 7, 1, 2, 9, 5)
li: List[Int] = List(4, 3, 6, 7, 1, 2, 9, 5)
scala> top (2, li)
res0: List[Int] = List(2, 1)
Now, suppose I want to get the top n elements with a lower resolution. The range of integers may somehow be divided/binned/grouped to sub-ranges such as modulo 2: {0-1, 2-3, 4-5, ...}, and in each sub-range I do not differentiate between integers, e.g. 0 and 1 are all the same to me. Therefore, the top element in the above example would still be 1, but the next element would either be 2 or 3. More clearly these results are equivalent:
scala> top (2, li)
res0: List[Int] = List(2, 1)
scala> top (2, li)
res0: List[Int] = List(3, 1)
How do I change this nice function to fit these needs?
Is my intuition correct and this sort should be faster? Since the sort is
on the bins/groups, then taking all or some of the elements of the
bins with no specific order until we get to n elements.
Comments:
The binning/grouping is something simple and fixed like modulo k, doesn't have to
be generic like allowing different lengths of sub-ranges
Inside each bin, assuming we need only some of the elements, we can
just take first elements, or even some random elements, doesn't have
to be some specific system.
Per the comment, you're just changing the comparison.
In this version, 4 and 3 compare equal and 4 is taken first.
object Firstly extends App {
def firstly(taking: Int, vs: List[Int]) = {
import collection.mutable.{ SortedSet => S }
def bucketed(i: Int) = (i + 1) / 2
vs.foldLeft(S.empty[Int]) { (s, i) =>
if (s.size < taking) s += i
else if (bucketed(i) >= bucketed(s.last)) s
else {
s += i
s -= s.last
}
}
}
assert(firstly(taking = 2, List(4, 6, 7, 1, 9, 3, 5)) == Set(4, 1))
}
Edit: example of sorting buckets instead of keeping sorted "top N":
scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
| case (i, vs) => i }.flatMap {
| case (i, vs) => vs }.take(5)
res10: List[Int] = List(1, 4, 3, 6, 5)
scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
| case (i, vs) => i }.map {
| case (i, vs) => vs.head }.take(5)
res11: List[Int] = List(1, 4, 6, 7, 9)
Not sure which result you prefer, of the last two.
As to whether sorting buckets is better, it depends how many buckets.
How about mapping with integer division before using the original algorithm?
def top(n: Int, li: List[Int]) = li.sorted.distinct.take(n)
val li = List (4, 3, 6, 7, 1, 2, 9, 5)
top(2, li) // List(1, 2)
def topBin(n: Int, bin: Int, li: List[Int]) =
top(n, li.map(_ / bin)) // e.g. List(0, 1)
.map(i => (i * bin) until ((i + 1) * bin))
topBin(2, 2, li) // List(0 to 1, 2 to 3)

Multiply collection and randomly merge with other - Apache Spark

I am given two collections(RDDs). Let's say and a number of samples
val v = sc.parallelize(List("a", "b", "c"))
val a = sc.parallelize(List(1, 2, 3, 4, 5))
val samplesCount = 2
I want to create two collections(samples) consisting of pairs where one value is from the 'v' and second one from 'a'. Each collection must consist all values from v and random values from 'a'.
Example result would be:
(
(("a", 3), ("b", 5), ("c", 1)),
(("a", 4), ("b", 2), ("c", 5))
)
One more to add is that the values from v or a can't repeat within a sample.
I can't think of any good way to achieve this.
You randomly shuffle the RDD to be sampled and then join the two RDDs by line index:
def shuffle[A: reflect.ClassTag](a: RDD[A]): RDD[A] = {
val randomized = a.map(util.Random.nextInt -> _)
randomized.sortByKey().values
}
def joinLines[A: reflect.ClassTag, B](a: RDD[A], b: RDD[B]): RDD[(A, B)] = {
val aNumbered = a.zipWithIndex.map { case (x, i) => (i, x) }
val bNumbered = b.zipWithIndex.map { case (x, i) => (i, x) }
aNumbered.join(bNumbered).values
}
val v = sc.parallelize(List("a", "b", "c"))
val a = sc.parallelize(List(1, 2, 3, 4, 5))
val sampled = joinLines(v, shuffle(a))
RDDs are immutable, so you don't need to "multiply" anything. If you want multiple samples just do:
val sampledRDDs: Seq[RDD[(String, Int)]] =
(1 to samplesCount).map(_ => joinLines(v, shuffle(a)))

Looking for the best solution

Consider this list composed of objects which are instances of case classes:
A, B, Opt(A),C, Opt(D), F, Opt(C), G, Opt(H)
I wan to normalize this list to get this result:
A, B, C, Opt(D), F, G, Opt(H)
As you see, if there are elements A and Opt(A) I replace them with just A or said other way, I have to remove OPT(A) element.
I would like:
most optimal solution in the mean of performance
shortest solution
This might be a little more concise, as filtering is what you want ;-):
scala> List(1,2,3,Some(4),5,Some(5))
res0: List[Any] = List(1, 2, 3, Some(4), 5, Some(5))
scala> res0.filter {
| case Some(x) => !res0.contains(x)
| case _ => true
| }
res1: List[Any] = List(1, 2, 3, Some(4), 5)
edit: For large collections it might be good to use a toSet or directly use a Set.
Not the most efficient solution, but certainly a simple one.
scala> case class Opt[A](a: A)
defined class Opt
scala> val xs = List(1, 2, Opt(1), 3, Opt(4), 6, Opt(3), 7, Opt(8))
xs: List[Any] = List(1, 2, Opt(1), 3, Opt(4), 6, Opt(3), 7, Opt(8))
scala> xs flatMap {
| case o # Opt(x) => if(xs contains x) None else Some(o)
| case x => Some(x)
| }
res5: List[Any] = List(1, 2, 3, Opt(4), 6, 7, Opt(8))
If you don't care about order then efficiency leads you to use a Set:
xs.foldLeft(Set.empty[Any])({ case (set, x) => x match {
case Some(y) => if (set contains y) set else set + x
case y => if (set contains Some(y)) set - Some(y) + y else set + y
}}).toList
Alternatively:
val (opts, ints) = xs.toSet.partition(_.isInstanceOf[Option[_]])
opts -- (ints map (Option(_))) ++ ints toList

Resources