Multiply collection and randomly merge with other - Apache Spark - algorithm

I am given two collections(RDDs). Let's say and a number of samples
val v = sc.parallelize(List("a", "b", "c"))
val a = sc.parallelize(List(1, 2, 3, 4, 5))
val samplesCount = 2
I want to create two collections(samples) consisting of pairs where one value is from the 'v' and second one from 'a'. Each collection must consist all values from v and random values from 'a'.
Example result would be:
(
(("a", 3), ("b", 5), ("c", 1)),
(("a", 4), ("b", 2), ("c", 5))
)
One more to add is that the values from v or a can't repeat within a sample.
I can't think of any good way to achieve this.

You randomly shuffle the RDD to be sampled and then join the two RDDs by line index:
def shuffle[A: reflect.ClassTag](a: RDD[A]): RDD[A] = {
val randomized = a.map(util.Random.nextInt -> _)
randomized.sortByKey().values
}
def joinLines[A: reflect.ClassTag, B](a: RDD[A], b: RDD[B]): RDD[(A, B)] = {
val aNumbered = a.zipWithIndex.map { case (x, i) => (i, x) }
val bNumbered = b.zipWithIndex.map { case (x, i) => (i, x) }
aNumbered.join(bNumbered).values
}
val v = sc.parallelize(List("a", "b", "c"))
val a = sc.parallelize(List(1, 2, 3, 4, 5))
val sampled = joinLines(v, shuffle(a))
RDDs are immutable, so you don't need to "multiply" anything. If you want multiple samples just do:
val sampledRDDs: Seq[RDD[(String, Int)]] =
(1 to samplesCount).map(_ => joinLines(v, shuffle(a)))

Related

lattice xyplot 3x3 arrangement

I have 3 xyplots from lattice. Up to now I have only ever used
print(pd1, split = c(1,1,2,2), more = TRUE)
print(pd2, split = c(2, 1, 2, 2), more = TRUE) etc (using split)
to arrange plots in 2x2 manner. However, how can I use it for 1x3 or 3x3 arrangement? I tried to do some positions but I have not quite understood how it actually works.
# I think this is what you want
data <- data.frame(class = LETTERS[1:6], value = 1:6)
pd1 <- dotplot(value ~ class, data)
pd2 <- dotplot(class ~ value, data)
pd3 <- dotplot(class ~ value | cut(value, c(0, 3, 6)), data)
print(pd1, split = c(1, 1, 1, 3), more = TRUE)
print(pd2, split = c(1, 2, 1, 3), more = TRUE)
print(pd1, split = c(1, 3, 1, 3))

Fill a nested structure with values from a linear supply stream

I got stuck in the resolution of the next problem:
Imagine we have an array structure, any structure, but for this example let's use:
[
[ [1, 2], [3, 4], [5, 6] ],
[ 7, 8, 9, 10 ]
]
For convenience, I transform this structure into a flat array like:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
Imagine that after certain operations our array looks like this:
[ 1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10]
NOTE: those values are a result of some operation, I just want to point out that is independent from the structure or their positions.
What I would like to know is... given the first array structure, how can I transform the last flat array into the same structure as the first? So it will look like:
[
[ [1, 2], [3, 4] , [12515, 25125] ],
[ 12512, 8, 9, 10]
]
Any suggestions? I was just hardcoding the positions in to the given structure. But that's not dynamic.
Just recurse through the structure, and use an iterator to generate the values in order:
function fillWithStream(structure, iterator) {
for (var i=0; i<structure.length; i++)
if (Array.isArray(structure[i]))
fillWithStream(structure[i], iterator);
else
structure[i] = getNext(iterator);
}
function getNext(iterator) {
const res = iterator.next();
if (res.done) throw new Error("not enough elements in the iterator");
return res.value;
}
var structure = [
[ [1, 2], [3, 4], [5, 6] ],
[ 7, 8, 9, 10 ]
];
var seq = [1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10];
fillWithStream(structure, seq[Symbol.iterator]())
console.log(JSON.stringify(structure));
Here is a sketch in Scala. Whatever your language is, you first have to represent the tree-like data structure somehow:
sealed trait NestedArray
case class Leaf(arr: Array[Int]) extends NestedArray {
override def toString = arr.mkString("[", ",", "]")
}
case class Node(children: Array[NestedArray]) extends NestedArray {
override def toString =
children
.flatMap(_.toString.split("\n"))
.map(" " + _)
.mkString("[\n", "\n", "\n]")
}
object NestedArray {
def apply(ints: Int*) = Leaf(ints.toArray)
def apply(cs: NestedArray*) = Node(cs.toArray)
}
The only important part is the differentiation between the leaf nodes that hold arrays of integers, and the inner nodes that hold their child-nodes in arrays. The toString methods and extra constructors are not that important, it's mostly just for the little demo below.
Now you essentially want to build an encoder-decoder, where the encode part simply flattens everything, and decode part takes another nested array as argument, and reshapes a flat array into the shape of the nested array. The flattening is very simple:
def encode(a: NestedArray): Array[Int] = a match {
case Leaf(arr) => arr
case Node(cs) => cs flatMap encode
}
The restoring of the structure isn't all that difficult either. I've decided to keep the track of the position in the array by passing around an explicit int-index:
def decode(
shape: NestedArray,
flatArr: Array[Int]
): NestedArray = {
def recHelper(
startIdx: Int,
subshape: NestedArray
): (Int, NestedArray) = subshape match {
case Leaf(a) => {
val n = a.size
val subArray = Array.ofDim[Int](n)
System.arraycopy(flatArr, startIdx, subArray, 0, n)
(startIdx + n, Leaf(subArray))
}
case Node(cs) => {
var idx = startIdx
val childNodes = for (c <- cs) yield {
val (i, a) = recHelper(idx, c)
idx = i
a
}
(idx, Node(childNodes))
}
}
recHelper(0, shape)._2
}
Your example:
val original = NestedArray(
NestedArray(NestedArray(1, 2), NestedArray(3, 4), NestedArray(5, 6)),
NestedArray(NestedArray(7, 8, 9, 10))
)
println(original)
Here is what it looks like as ASCII-tree:
[
[
[1,2]
[3,4]
[5,6]
]
[
[7,8,9,10]
]
]
Now reconstruct a tree of same shape from a different array:
val flatArr = Array(1, 2, 3, 4, 12515, 25125, 12512, 8, 9, 10)
val reconstructed = decode(original, flatArr)
println(reconstructed)
this gives you:
[
[
[1,2]
[3,4]
[12515,25125]
]
[
[12512,8,9,10]
]
]
I hope that should be more or less comprehensible for anyone who does some functional programming in a not-too-remote descendant of ML.
Turns out I've already answered your question a few months back, a very similar one to it anyway.
The code there needs to be tweaked a little bit, to make it fit here. In Scheme:
(define (merge-tree-fringe vals tree k)
(cond
[(null? tree)
(k vals '())]
[(not (pair? tree)) ; for each leaf:
(k (cdr vals) (car vals))] ; USE the first of vals
[else
(merge-tree-fringe vals (car tree) (lambda (Avals r) ; collect 'r' from car,
(merge-tree-fringe Avals (cdr tree) (lambda (Dvals q) ; collect 'q' from cdr,
(k Dvals (cons r q))))))])) ; return the last vals and the combined results
The first argument is a linear list of values, the second is the nested list whose structure is to be re-created. Making sure there's enough elements in the linear list of values is on you.
We call it as
> (merge-tree-fringe '(1 2 3 4 5 6 7 8) '(a ((b) c) d) (lambda (vs r) (list r vs)))
'((1 ((2) 3) 4) (5 6 7 8))
> (merge-tree-fringe '(1 2 3 4 5 6 7 8) '(a ((b) c) d) (lambda (vs r) r))
'(1 ((2) 3) 4)
There's some verbiage at the linked answer with the explanations of what's going on. Short story short, it's written in CPS – continuation-passing style:
We process a part of the nested structure while substituting the leaves with the values from the linear supply; then we're processing the rest of the structure with the remaining supply; then we combine back the two results we got from processing the two sub-parts. For LISP-like nested lists, it's usually the "car" and the "cdr" of the "cons" cell, i.e. the tree's top node.
This is doing what Bergi's code is doing, essentially, but in a functional style.
In an imaginary pattern-matching pseudocode, which might be easier to read/follow, it is
merge-tree-fringe vals tree = g vals tree (vs r => r)
where
g vals [a, ...d] k = g vals a (avals r => -- avals: vals remaining after 'a'
g avals d (dvals q => -- dvals: remaining after 'd'
k dvals [r, ...q] )) -- combine the results
g vals [] k = k vals [] -- empty
g [v, ...vs] _ k = k vs v -- leaf: replace it
This computational pattern of threading a changing state through the computations is exactly what the State monad is about; with Haskell's do notation the above would be written as
merge_tree_fringe vals tree = evalState (g tree) vals
where
g [a, ...d] = do { r <- g a ; q <- g d ; return [r, ...q] }
g [] = do { return [] }
g _ = do { [v, ...vs] <- get ; put vs ; return v } -- leaf: replace
put and get work with the state being manipulated, updated and passed around implicitly; vals being the initial state; the final state being silently discarded by evalState, like our (vs r => r) above also does, but explicitly so.

Scala : Sorting list of number based on another list

I am implementing an algorithm in scala where I have set of nodes (Integers numbers) and each node has one property associated with it, lets call that property "d" (which is again an integer).
I have a list[Int] , this list contains nodes in the descending order of value "d".
Also I have a Map[Int,Iterable[Int]] , here key is a node and value is the list of all its neighbors.
The question is, how can I store the List of neighbors for a node in Map in the descending order of property "d" .
Example :
List 1 : List[1,5,7,2,4,8,6,3] --> Imagine this list is sorted in some order and has all the numbers.
Map : [Int,Iterable][Int]] --> [1 , Iterable[2,3,4,5,6]]
This iterable may or may not have all numbers.
In simple words, I want the numbers in Iterable to be in same order as in List 1.
So my entry in Map should be : [1, Iterable[5,2,4,6,3]]
The easiest way to do this is to just filter the sorted list.
val list = List(1,5,7,2,4,8,6,3)
val map = Map(1 -> List(2,3,4,5,6),
2 -> List(1,2,7,8))
val map2 = map.mapValues(neighbors => list.filter(neighbors.contains))
println(map2)
Here is a possible solution utilizing foldLeft (note we get an ArrayBuffer at end instead of desired Iterable, but the type signature does say Iterable):
scala> val orderTemplate = List(1,5,7,2,4,8,6,3)
orderTemplate: List[Int] = List(1, 5, 7, 2, 4, 8, 6, 3)
scala> val toOrder = Map(1 -> Iterable(2,3,4,5,6))
toOrder: scala.collection.immutable.Map[Int,Iterable[Int]] = Map(1 -> List(2, 3, 4, 5, 6))
scala> val ordered = toOrder.mapValues(iterable =>
orderTemplate.foldLeft(Iterable.empty[Int])((a, i) =>
if (iterable.toBuffer.contains(i)) a.toBuffer :+ i
else a
)
)
ordered: scala.collection.immutable.Map[Int,Iterable[Int]] = Map(1 -> ArrayBuffer(5, 2, 4, 6, 3))
Here's what I got.
val lst = List(1,5,7,2,4,8,6,3)
val itr = Iterable(2,3,4,5,6)
itr.map(x => (lst.indexOf(x), x))
.toArray
.sorted
.map(_._2)
.toIterable // res0: Iterable[Int] = WrappedArray(5, 2, 4, 6, 3)
I coupled each entry with its relative index in the full list.
Can't sort iterables so went with Array (for no particular reason).
Tuples sorting defaults to the first element.
Remove the indexes.
Back to Iterable.

Grouping adjacent elements in a list

Let's say I want to write a function that does this:
input: [1,1,3,3,4,2,2,5,6,6]
output: [[1,1],[3,3],[4],[2,2],[5],[6,6]]
It's grouping adjacent elements that are same.
What should the name of this method be? Is there a standard name for this operation?
In [1,1,3,3,4,2,2,5,6,6], a thing like [1,1] is very often referred to as run (as in run-length encoding, see RLE in Scala). I'd therefore call the method groupRuns.
#tailrec
def groupRuns[A](c: Seq[A], acc: Seq[Seq[A]] = Seq.empty): Seq[Seq[A]] = {
c match {
case Seq() => acc
case xs =>
val (same, rest) = xs.span { _ == xs.head }
groupRuns(rest, acc :+ same)
}
}
scala> groupRuns(Vector(1, 1, 3, 3, 4, 2, 2, 5, 6, 6))
res7: Seq[Seq[Int]] = List(Vector(1, 1), Vector(3, 3), Vector(4), Vector(2, 2), Vector(5), Vector(6, 6))

Looking for the best solution

Consider this list composed of objects which are instances of case classes:
A, B, Opt(A),C, Opt(D), F, Opt(C), G, Opt(H)
I wan to normalize this list to get this result:
A, B, C, Opt(D), F, G, Opt(H)
As you see, if there are elements A and Opt(A) I replace them with just A or said other way, I have to remove OPT(A) element.
I would like:
most optimal solution in the mean of performance
shortest solution
This might be a little more concise, as filtering is what you want ;-):
scala> List(1,2,3,Some(4),5,Some(5))
res0: List[Any] = List(1, 2, 3, Some(4), 5, Some(5))
scala> res0.filter {
| case Some(x) => !res0.contains(x)
| case _ => true
| }
res1: List[Any] = List(1, 2, 3, Some(4), 5)
edit: For large collections it might be good to use a toSet or directly use a Set.
Not the most efficient solution, but certainly a simple one.
scala> case class Opt[A](a: A)
defined class Opt
scala> val xs = List(1, 2, Opt(1), 3, Opt(4), 6, Opt(3), 7, Opt(8))
xs: List[Any] = List(1, 2, Opt(1), 3, Opt(4), 6, Opt(3), 7, Opt(8))
scala> xs flatMap {
| case o # Opt(x) => if(xs contains x) None else Some(o)
| case x => Some(x)
| }
res5: List[Any] = List(1, 2, 3, Opt(4), 6, 7, Opt(8))
If you don't care about order then efficiency leads you to use a Set:
xs.foldLeft(Set.empty[Any])({ case (set, x) => x match {
case Some(y) => if (set contains y) set else set + x
case y => if (set contains Some(y)) set - Some(y) + y else set + y
}}).toList
Alternatively:
val (opts, ints) = xs.toSet.partition(_.isInstanceOf[Option[_]])
opts -- (ints map (Option(_))) ++ ints toList

Resources