How to carry-over the calculated value within the RDD ? -Apache spark - syntax

SOLVED: There is no good solution to this problem
I am sure that this is just a syntax-relevant question and that answer is an easy one.
What I am trying to achieve is to:
-pass a variable to RDD
-change the variable according to RDD data
-get the adjusted variable
Lets say I have:
var b = 2
val x = sc.parallelize(0 to 3)
what I want to do is to obtain the value 2+0 + 2+0+1 + 2+0+1+2 + 2+0+1+2+3 = 18
That is, the value 18 by doing something like
b = x.map(i=> … b+i...).collect
The problem is, for each i, I need to carry over the value b, to be incremented with the next i
I want to use this logic for adding the elements to an array that is external to RDD
How would I do that without doing the collect first ?

As mentioned in the comments, it's not possible to mutate one variable with the contents of an RDD as RDDs are distributed across potentially many different nodes while mutable variables are local to each executor (JVM).
Although not particularly performant, it's possible to implement these requirements on Spark by translating the sequential algorithm in a series of transformations that can be executed in a distributed environment.
Using the same example as on the question, this algorithm in Spark could be expressed as:
val initialOffset = 2
val rdd = sc.parallelize(0 to 3)
val halfCartesian = rdd.cartesian(rdd).filter{case (x,y) => x>=y}
val partialSums = halfCartesian.reduceByKey(_ + _)
val adjustedPartials = partialSums.map{case (k,v) => v+initialOffset}
val total = adjustedPartials.reduce(_ + _)
scala> total
res33: Int = 18
Note that cartesian is a very expensive transformation as it creates (m x n) elements, or in this case n^2.
This is just to say that it's not impossible, but probably not ideal.
If the amount of data to be processed sequentially would fit in the memory of one machine (maybe after filtering/reduce), then Scala has a built-in collection operation to realize exactly what's being asked: scan[Left|Right]
val arr = Array(0,1,2,3)
val cummulativeScan = arr.scanLeft(initialOffset)(_ + _)
// we remove head b/c scan adds the given element at the start of the sequence
val result = cummulativeScan.tail.sum
result: Int = 18

Related

find keystrokes for On Screen Keyboard scala

I am trying to solve a recent interview question using Scala..
You have an on screen keyboard which is a grid of 6 rows , 5 columns each. With alphabets from A to Z and blank space are arranged in the grid row first.
You can use this on screen keyboard to type words.. by using your TV Remote by press Left, Right, Up , Down or OK keys to type each character.
Question: given an input string, find the sequence of keystrokes needed to be pressed on the remote to type the input.
The code implementation can be found at
https://github.com/mradityagoyal/scala/blob/master/OnScrKb/src/main/scala/OnScrKB.scala
I have tried to solve this using three different approaches..
Simple forldLeft.
def keystrokesByFL(input: String, startChar: Char = 'A'): String = {
val zero = ("", startChar)
//(acc, last) + next => (acc+ aToB , next)
def op(zero: (String, Char), next: Char): (String, Char) = zero match {
case (acc, last) => (acc + path(last, next), next)
}
val result = input.foldLeft(zero)(op)
result._1
}
divide and conquer - Uses divide and conquer mechanism. The algorithm is similar to merge sort. * We split the input word into two if the length is > 3 * we recursively call the subroutine to get the path of left and right halves from the split. * In the end.. we add the keystrokes for first + keystrokes from end of first string to start of second string + keystrokes for second. * Essentially we divide the input string in two smaller halves till we get to size 4. for smaller than 4 we use the fold right.
def keystrokesByDnQ(input: String, startChar: Char = 'A'): String = {
def splitAndMerge(in: String, startChar: Char): String = {
if (in.length() < 4) {
//if length is <4 then dont split.. as you might end up with one side having only 1 char.
keystrokesByFL(in, startChar)
} else {
//split
val (x, y) = in.splitAt(in.length() / 2)
splitAndMerge(x, startChar) + splitAndMerge(y, x.last)
}
}
splitAndMerge(input, startChar)
}
Fold - uses the property that the underlying operation is associative (but not commutative). * For eg.. the keystrokes("ABCDEFGHI", startChar = 'A') == keystrokes("ABC", startChar='A')+keystrokes("DEF", 'C') + keystrokes("GHI", 'F')
def keystrokesByF(input: String, startChar: Char = 'A'): String = {
val mapped = input.map { x => PathAcc(text = "" + x, path = "") } // map each character in input to case class PathAcc("CharAsString", "")
val z = PathAcc(text = ""+startChar, path = "") //the starting char.
def op(left: PathAcc, right: PathAcc): PathAcc = {
PathAcc(text = left.text + right.text, path = left.path + path(left.text.last, right.text.head) + right.path)
}
val foldresult = mapped.fold(z)(op)
foldresult.path
}
My questions:
1. Is the divide and conquer approach better than Fold?
are Fold and Divide and conquer better than foldLeft (for this specific problem)
Is there a way i can represent the divide and conquer approach or the Fold approach as a Monad? I can see the associative law being satisfied... but i am not able to figure out if a monoid is present here.. and if yes.. what does it achieve for me?
Is Divide and conquer approach the best one available for this particular problem?
Which approach is better suited for spark?
Any suggestions are welcome..
Here's how I would do it:
def keystrokes(input: String, start: Char): String =
((start + input) zip input).par.map((path _).tupled).fold("")(_ ++ _)
The main point here is using the par method to parallelize the sequence of (Char, Char), so that it can parallelize the map, and take the optimal implementation for fold.
The algorithm simply take the characters in the String two by two (representing the units of path to be walked), computes the path between them, and then concatenates the result. Note that fold("")(_ ++ _) is basically mkString (although mkString on parallel collection is implemented by seq.mkString so it is much less efficient).
What your implementations dearly miss is parallelization of tasks. Even in your divide-and-conquer approach, you never run code in parallel, so you will wait for the first half to be finished before starting the second half (even though they are totally independant).
Assuming you use parallelization, the classical implementation of fold on parallel sequences is precisely the divide-and-conquer algorithm you described, but it may be that it is better optimized (for instance, it may choose another value than 3 for chunk size, I tend to trust the scala-collection implementers on these matters).
Note that fold on String is probably implemented with foldLeft, so there is no added value than what you did with foldLeft, unless you use .par before.
Back to your questions (I'll mostly repeat what I just said):
1) Yes, the divide and conquer is better than fold... on String (but not on parallelized String)
2) Fold can only be better than FoldLeft with some kind of parallelization, in which case it will be as good as (or better than, if there is a better implementation for a particular parallelized collection) divide-and-conquer.
3) I don't see what monads have to do with anything here. the operator and zero for fold must indeed form a monoid (otherwise, you'll have some problems with operation ordering if the operator is not associative, and unwanted noise if zero is not a neutral element).
4) Yes, that I know of, once parallelized
5) Spark is inherently parallel, so the main issue would be to join all the pieces back together in the end. What I mean is that an RDD is not ordered, so you'll need to keep some information on which piece of input should be put where in your cluster. Once you've done that correctly (using partitions and such, this would probably be a whole question itself), using map and fold still works as a charm (Spark was designed to have an API as close as possible to scala-collection, so that's really nice here).

Why does this tensorflow loop require so much memory?

I have a contrived version of a complicated network:
import tensorflow as tf
a = tf.ones([1000])
b = tf.ones([1000])
for i in range(int(1e6)):
a = a * b
My intuition is that this should require very little memory. Just the space for the initial array allocation and a string of commands that utilizes the nodes and overwrites the memory stored in tensor 'a' at each step. But memory usage grows quite rapidly.
What is going on here, and how can I decrease memory usage when I compute a tensor and overwrite it a bunch of times?
Edit:
Thanks to Yaroslav's suggestions the solution turned out to be using a while_loop to minimize the number of nodes on the graph. This works great and is much faster, requires far less memory, and is all contained in-graph.
import tensorflow as tf
a = tf.ones([1000])
b = tf.ones([1000])
cond = lambda _i, _1, _2: tf.less(_i, int(1e6))
body = lambda _i, _a, _b: [tf.add(_i, 1), _a * _b, _b]
i = tf.constant(0)
output = tf.while_loop(cond, body, [i, a, b])
with tf.Session() as sess:
result = sess.run(output)
print(result)
Your a*b command translates to tf.mul(a, b), which is equivalent to tf.mul(a, b, g=tf.get_default_graph()). This command adds a Mul node to the current Graph object, so you are trying to add 1 million Mul nodes to the current graph. That's also problematic since you can't serialize Graph object larger than 2GB, there are some checks that may fail once you are dealing with such a large graph.
I'd recommend reading Programming Models for Deep Learning by MXNet folks. TensorFlow is "symbolic" programming in their terminology, and you are treating it as imperative.
To get what you want using Python loop you could construct multiplication op once, and run it repeatedly, using feed_dict to feed updates
mul_op = a*b
result = sess.run(a)
for i in range(int(1e6)):
result = sess.run(mul_op, feed_dict={a: result})
For more efficiency you could use tf.Variable objects and var.assign to avoid Python<->TensorFlow data transfers

Spark example program runs very slow

I tried to use Spark to work on simple graph problem. I found an example program in Spark source folder: transitive_closure.py, which computes the transitive closure in a graph with no more than 200 edges and vertices. But in my own laptop, it runs more than 10 minutes and doesn't terminate. The command line I use is: spark-submit transitive_closure.py.
I wonder why spark is so slow even when computing just such small transitive closure result? Is it a common case? Is there any configuration I miss?
The program is shown below, and can be found in spark install folder at their website.
from __future__ import print_function
import sys
from random import Random
from pyspark import SparkContext
numEdges = 200
numVertices = 100
rand = Random(42)
def generateGraph():
edges = set()
while len(edges) < numEdges:
src = rand.randrange(0, numEdges)
dst = rand.randrange(0, numEdges)
if src != dst:
edges.add((src, dst))
return edges
if __name__ == "__main__":
"""
Usage: transitive_closure [partitions]
"""
sc = SparkContext(appName="PythonTransitiveClosure")
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
tc = sc.parallelize(generateGraph(), partitions).cache()
# Linear transitive closure: each round grows paths by one edge,
# by joining the graph's edges with the already-discovered paths.
# e.g. join the path (y, z) from the TC with the edge (x, y) from
# the graph to obtain the path (x, z).
# Because join() joins on keys, the edges are stored in reversed order.
edges = tc.map(lambda x_y: (x_y[1], x_y[0]))
oldCount = 0
nextCount = tc.count()
while True:
oldCount = nextCount
# Perform the join, obtaining an RDD of (y, (z, x)) pairs,
# then project the result to obtain the new (x, z) paths.
new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))
tc = tc.union(new_edges).distinct().cache()
nextCount = tc.count()
if nextCount == oldCount:
break
print("TC has %i edges" % tc.count())
sc.stop()
There can many reasons why this code doesn't perform particularly well on your machine but most likely this is just another variant of the problem described in Spark iteration time increasing exponentially when using join. The simplest way to check if it is indeed the case is to provide spark.default.parallelism parameter on submit:
bin/spark-submit --conf spark.default.parallelism=2 \
examples/src/main/python/transitive_closure.py
If not limited otherwise, SparkContext.union, RDD.join and RDD.union set a number of partitions of the child to the total number of partitions in the parents. Usually it is a desired behavior but can become extremely inefficient if applied iteratively.
The useage says the command line is
transitive_closure [partitions]
Setting default parallelism will only help with the joins in each partition, not the inital distribution of work.
Im going to argue that that MORE partitions should be used. Setting the default parallelism may still help, but the code you posted sets the number explicitly (the argument passed or 2, whichever is greater). The absolute minimum should be the cores available to Spark, otherwise you're always working at less than 100%.

matlab code optimization - clustering algorithm KFCG

Background
I have a large set of vectors (orientation data in an axis-angle representation... the axis is the vector). I want to apply a clustering algorithm to. I tried kmeans but the computational time was too long (never finished). So instead I am trying to implement KFCG algorithm which is faster (Kirke 2010):
Initially we have one cluster with the entire training vectors and the codevector C1 which is centroid. In the first iteration of the algorithm, the clusters are formed by comparing first element of training vector Xi with first element of code vector C1. The vector Xi is grouped into the cluster 1 if xi1< c11 otherwise vector Xi is grouped into cluster2 as shown in Figure 2(a) where codevector dimension space is 2. In second iteration, the cluster 1 is split into two by comparing second element Xi2 of vector Xi belonging to cluster 1 with that of the second element of the codevector. Cluster 2 is split into two by comparing the second element Xi2 of vector Xi belonging to cluster 2 with that of the second element of the codevector as shown in Figure 2(b). This procedure is repeated till the codebook size is reached to the size specified by user.
I'm unsure what ratio is appropriate for the codebook, but it shouldn't matter for the code optimization. Also note mine is 3-D so the same process is done for the 3rd dimension.
My code attempts
I've tried implementing the above algorithm into Matlab 2013 (Student Version). Here's some different structures I've tried - BUT take way too long (have never seen it completed):
%training vectors:
Atgood = Nx4 vector (see test data below if want to test);
vecA = Atgood(:,1:3);
roA = size(vecA,1);
%Codebook size, Nsel, is ratio of data
remainFrac2=0.5;
Nseltemp = remainFrac2*roA; %codebook size
%Ensure selected size after nearest power of 2 is NOT greater than roA
if 2^round(log2(Nseltemp)) &lt roA
NselIter = round(log2(Nseltemp));
else
NselIter = ceil(log2(Nseltemp)-1);
end
Nsel = 2^NselIter; %power of 2 - for LGB and other algorithms
MAIN BLOCK TO OPTIMIZE:
%KFCG:
%%cluster = cell(1,Nsel); %Unsure #rows - Don't know how to initialize if need mean...
codevec(1,1:3) = mean(vecA,1);
count1=1;
count2=1;
ind=1;
for kk = 1:NselIter
hh2 = 1:2:size(codevec,1)*2;
for hh1 = 1:length(hh2)
hh=hh2(hh1);
% for ii = 1:roA
% if vecA(ii,ind) &lt codevec(hh1,ind)
% cluster{1,hh}(count1,1:4) = Atgood(ii,:); %want all 4 elements
% count1=count1+1;
% else
% cluster{1,hh+1}(count2,1:4) = Atgood(ii,:); %want all 4
% count2=count2+1;
% end
% end
%EDIT: My ATTEMPT at optimizing above for loop:
repcv=repmat(codevec(hh1,ind),[size(vecA,1),1]);
splitind = vecA(:,ind)&gt=repcv;
splitind2 = vecA(:,ind)&ltrepcv;
cluster{1,hh}=vecA(splitind,:);
cluster{1,hh+1}=vecA(splitind2,:);
end
clear codevec
%Only mean the 1x3 vector portion of the cluster - for centroid
codevec = cell2mat((cellfun(#(x) mean(x(:,1:3),1),cluster,'UniformOutput',false))');
if ind &lt 3
ind = ind+1;
else
ind=1;
end
end
if length(codevec) ~= Nsel
warning('codevec ~= Nsel');
end
Alternatively, instead of cells I thought 3D Matrices would be faster? I tried but it was slower using my method of appending the next row each iteration (temp=[]; for...temp=[temp;new];)
Also, I wasn't sure what was best to loop with, for or while:
%If initialize cell to full length
while length(find(~cellfun('isempty',cluster))) < Nsel
Well, anyways, the first method was fastest for me.
Questions
Is the logic standard? Not in the sense that it matches with the algorithm described, but from a coding perspective, any weird methods I employed (especially with those multiple inner loops) that slows it down? Where can I speed up (you can just point me to resources or previous questions)?
My array size, Atgood, is 1,000,000x4 making NselIter=19; - do I just need to find a way to decrease this size or can the code be optimized?
Should this be asked on CodeReview? If so, I'll move it.
Testing Data
Here's some random vectors you can use to test:
for ii=1:1000 %My size is ~ 1,000,000
omega = 2*rand(3,1)-1;
omega = (omega/norm(omega))';
Atgood(ii,1:4) = [omega,57];
end
Your biggest issue is re-iterating through all of vecA FOR EACH CODEVECTOR, rather than just the ones that are part of the corresponding cluster. You're supposed to split each cluster on it's codevector. As it is, your cluster structure grows and grows, and each iteration is processing more and more samples.
Your second issue is the loop around the comparisons, and the appending of samples to build up the clusters. Both of those can be solved by vectorizing the comparison operation. Oh, I just saw your edit, where this was optimized. Much better. But codevec(hh1,ind) is just a scalar, so you don't even need the repmat.
Try this version:
% (preallocs added in edit)
cluster = cell(1,Nsel);
codevec = zeros(Nsel, 3);
codevec(1,:) = mean(Atgood(:,1:3),1);
cluster{1} = Atgood;
nClusters = 1;
ind = 1;
while nClusters < Nsel
for c = 1:nClusters
lower_cluster_logical = cluster{c}(:,ind) < codevec(c,ind);
cluster{nClusters+c} = cluster{c}(~lower_cluster_logical,:);
cluster{c} = cluster{c}(lower_cluster_logical,:);
codevec(c,:) = mean(cluster{c}(:,1:3), 1);
codevec(nClusters+c,:) = mean(cluster{nClusters+c}(:,1:3), 1);
end
ind = rem(ind,3) + 1;
nClusters = nClusters*2;
end

Scala: fastest `remove(i: Int)` in mutable sequence

Which implementation from scala.collection.mutable package should I take if I intend to do lots of by-index-deletions, like remove(i: Int), in a single-threaded environment? The most obvious choice, ListBuffer, says that it may take linear time depending on buffer size. Is there some collection with log(n) or even constant time for this operation?
Removal operators, including buf remove i, are not part of Seq, but it's actually part of Buffer trait under scala.mutable. (See Buffers)
See the first table on Performance Characteristics. I am guessing buf remove i has the same characteristic as insert, which are linear for both ArrayBuffer and ListBuffer.
As documented in Array Buffers, they use arrays internally, and Link Buffers use linked lists (that's still O(n) for remove).
As an alternative, immutable Vector may give you an effective constant time.
Vectors are represented as trees with a high branching factor. Every tree node contains up to 32 elements of the vector or contains up to 32 other tree nodes. [...] So for all vectors of reasonable size, an element selection involves up to 5 primitive array selections. This is what we meant when we wrote that element access is "effectively constant time".
scala> import scala.collection.immutable._
import scala.collection.immutable._
scala> def remove[A](xs: Vector[A], i: Int) = (xs take i) ++ (xs drop (i + 1))
remove: [A](xs: scala.collection.immutable.Vector[A],i: Int)scala.collection.immutable.Vector[A]
scala> val foo = Vector(1, 2, 3, 4, 5)
foo: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4, 5)
scala> remove(foo, 2)
res0: scala.collection.immutable.Vector[Int] = Vector(1, 2, 4, 5)
Note, however, a high constant time with lots of overhead may not win a quick linear access until the data size is significantly large.
Depending on your exact use case, you may be able to use LinkedHashMap from scala.collection.mutable.
Although you cannot remove by index, you can remove by a unique key in constant time, and it maintains a deterministic ordering when you iterate.
scala> val foo = new scala.collection.mutable.LinkedHashMap[String,String]
foo: scala.collection.mutable.LinkedHashMap[String,String] = Map()
scala> foo += "A" -> "A"
res0: foo.type = Map((A,A))
scala> foo += "B" -> "B"
res1: foo.type = Map((A,A), (B,B))
scala> foo += "C" -> "C"
res2: foo.type = Map((A,A), (B,B), (C,C))
scala> foo -= "B"
res3: foo.type = Map((A,A), (C,C))
Java's ArrayList effectively has constant time complexity if the last element is the one to be removed. Look at the following snippet copied from its source code,
int numMoved = size - index - 1;
if (numMoved > 0)
System.arraycopy(elementData, index+1, elementData, index,
numMoved);
elementData[--size] = null; // clear to let GC do its work
As you can see, if numMoved is equal to 0, remove will not shift and copy the array at all. This in some scenarios can be quite useful. For example, if you do not care about the ordering that much, to remove an element, you can always swap it with the last element, and then delete the last element from the ArrayList, which effectively makes the remove operation all the way constant time. I was hoping ArrayBuffer would do the same, unfortunately that is not the case.

Resources