How does Scala compiler handle unused variable values? - performance

Using Scala and Spark, I have the following construction:
val rdd1: RDD[String] = ...
val rdd2: RDD[(String, Any)] = ...
val rdd1pairs = rdd1.map(s => (s, s))
val result = rdd2.join(rdd1pairs)
.map { case (_: String, (e: Any, _)) => e }
The purpose of mapping rdd1 into a PairRDD is the join with rdd2 in the subsequent step. However, I am actually only interested in the values of rdd2, hence the mapping step in the last line which omits the keys. Actually, this is an intersection between rdd2 and rdd1 performed with Spark's join() for efficiency reasons.
My question refers to the keys of rdd1pairs: they are created for syntactical reasons only (to allow the join) in the first map step and are later discarded without any usage. How does the compiler handle this? Does it matter in terms of memory consumption whether I use the String s (as shown in the example)? Should I replace it by null or 0 to save a little memory? Does the compiler actually create and store these objects (references) or does it notice that they are never used?

In this case, it is what the Spark driver will do that influences the outcome rather than the compiler, I think. Whether or not Spark can optimise its execution pipeline in order to avoid creating the redundant duplication of s. I'm not sure but I think Spark will create the rdd1pairs, in memory.
Instead of mapping to (String, String) you could use (String, Unit):
rdd1.map(s => (s,()))
What you're doing is basically a filter of rdd2 based on rdd1. If rdd1 is significantly smaller than rdd2, another method would be to represent the data of rdd1 as a broadcast variable rather than an RDD, and simply filter rdd2. This avoids any shuffling or reduce phase, so may be quicker, but will only work if the data of rdd1 is small enough to fit on each node.
EDIT:
Considering how using Unit rather than String saves space, consider the following examples:
object size extends App {
(1 to 1000000).map(i => ("foo"+i, ()))
val input = readLine("prompt> ")
}
and
object size extends App {
(1 to 1000000).map(i => ("foo"+i, "foo"+i))
val input = readLine("prompt> ")
}
Using the jstat command as described in this question How to check heap usage of a running JVM from the command line? the first version uses significantly less heap than the latter.
Edit 2:
Unit is effectively a singleton object with no contents, so logically, it should not require any serialization. The fact that the type definition contains Unit tells you all you need to be able to deserialize a structure which has a field of type Unit.
Spark uses Java Serialization by default. Consider the following:
object Main extends App {
import java.io.{ObjectOutputStream, FileOutputStream}
case class Foo (a: String, b:String)
case class Bar (a: String, b:String, c: Unit)
val str = "abcdef"
val foo = Foo("abcdef", "xyz")
val bar = Bar("abcdef", "xyz", ())
val fos = new FileOutputStream( "foo.obj" )
val fo = new ObjectOutputStream( fos )
val bos = new FileOutputStream( "bar.obj" )
val bo = new ObjectOutputStream( bos )
fo writeObject foo
bo writeObject bar
}
The two files are of identical size:
�� sr Main$Foo3�,�z \ L at Ljava/lang/String;L bq ~ xpt abcdeft xyz
and
�� sr Main$Bar+a!N��b L at Ljava/lang/String;L bq ~ xpt abcdeft xyz

Related

Performance Difference Using Update Operation on a Mutable Map in Scala with a Large Size Data

I would like to know if an update operation on a mutable map is better in performance than reassignment.
Lets assume I have the following Map
val m=Map(1 -> Set("apple", "banana"),
2 -> Set("banana", "cabbage"),
3 -> Set("cabbage", "dumplings"))
which I would like to reverse into this map:
Map("apple" -> Set(1),
"banana" -> Set(1, 2),
"cabbage" -> Set(2, 3),
"dumplings" -> Set(3))
The code to do so is:
def reverse(m:Map[Int,Set[String]])={
var rm = Map[String,Set[Int]]()
m.keySet foreach { k=>
m(k) foreach { e =>
rm = rm + (e -> (rm.getOrElse(e, Set()) + k))
}
}
rm
}
Would it be more efficient to use the update operator on a map if it is very large in size?
The code using the update on map is as follows:
def reverse(m:Map[Int,Set[String]])={
var rm = scala.collection.mutable.Map[String,Set[Int]]()
m.keySet foreach { k=>
m(k) foreach { e =>
rm.update(e,(rm.getOrElse(e, Set()) + k))
}
}
rm
}
I ran some tests using Rex Kerr's Thyme utility.
First I created some test data.
val rndm = new util.Random
val dna = Seq('A','C','G','T')
val m = (1 to 4000).map(_ -> Set(rndm.shuffle(dna).mkString
,rndm.shuffle(dna).mkString)).toMap
Then I timed some runs with both the immutable.Map and mutable.Map versions. Here's an example result:
Time: 2.417 ms 95% CI 2.337 ms - 2.498 ms (n=19) // immutable
Time: 1.618 ms 95% CI 1.579 ms - 1.657 ms (n=19) // mutable
Time 2.278 ms 95% CI 2.238 ms - 2.319 ms (n=19) // functional version
As you can see, using a mutable Map with update() has a significant performance advantage.
Just for fun I also compared these results with a more functional version of a Map reverse (or what I call a Map inverter). No var or any mutable type involved.
m.flatten{case(k, vs) => vs.map((_, k))}
.groupBy(_._1)
.mapValues(_.map(_._2).toSet)
This version consistently beat your immutable version but still doesn't come close to the mutable timings.
The trade-of between mutable and immutable collections usually narrows down to this:
immutable collections are safer to share and allows to use structural sharing
mutable collections have better performance
Some time ago I did comparison of performance between mutable and immutable Maps in Scala and the difference was about 2 to 3 times in favor of mutable ones.
So, when performance is not critical I usually go with immutable collections for safety and readability.
For example, in your case functional "scala way" of performing this transformation would be something like this:
m.view
.flatMap(x => x._2.map(_ -> x._1)) // flatten map to lazy view of String->Int pairs
.groupBy(_._1) // group pairs by String part
.mapValues(_.map(_._2).toSet) // extract all Int parts into Set
Although I used lazy view to avoid creating intermediate collections, groupBy still internally creates mutable map (you may want to check it's sources, the logic is pretty similar to what you have wrote), which in turn gets converted to immutable Map which then gets discarded by mapValues.
Now, if you want to squeeze every bit of performance you want to use mutable collections and do as little updates of immutable collections as possible.
For your case is means having Map of mutable Sets as you intermediate buffer:
def transform(m:Map[Int, Set[String]]):Map[String, Set[Int]] = {
val accum:Map[String, mutable.Set[Int]] =
m.valuesIterator.flatten.map(_ -> mutable.Set[Int]()).toMap
for ((k, vals) <- m; v <- vals) {
accum(v) += k
}
accum.mapValues(_.toSet)
}
Note, I'm not updating accum once it's created: I'm doing exactly one map lookup and one set update for each value, while in both your examples there was additional map update.
I believe this code is reasonably optimal performance wise. I didn't perform any tests myself, but I highly encourage you to do that on your real data and post results here.
Also, if you want to go even further, you might want to try mutable BitSet instead of Set[Int]. If ints in your data are fairly small it might yield some minor performance increase.
Just using #Aivean method in a functional way:
def transform(mp :Map[Int, Set[String]]) = {
val accum = mp.values.flatten
.toSet.map( (_-> scala.collection.mutable.Set[Int]())).toMap
mp.map {case(k,vals) => vals.map( v => accum(v)+=k)}
accum.mapValues(_.toSet)
}

Conversion of mutable collections to immutable introduces performance penalty

I've encountered a strange behavior regarding conversion of mutable collections to immutable ones, which might significantly affect performance.
Let's take a look at the following code:
val map: Map[String, Set[Int]] = createMap()
while (true) {
map.get("existing-key")
}
It simply creates a map once, and then repeatedly accesses one of its enries, which contains a set as a value. It may create the map in several ways:
With immutable collections:
def createMap() = keys.map(key => key -> (1 to amount).toSet).toMap
Or with mutable collections (note the two conversion options at the end):
def createMap() = {
val map = mutable.Map[String, mutable.Set[Int]]()
for (key <- keys) {
val set = map.getOrElseUpdate(key, mutable.Set())
for (i <- 1 to amount) {
set.add(i)
}
}
map.toMap.mapValues(_.toSet) // option #1
map.mapValues(_.toSet).toMap // option #2
}
Curiously enough, mutable #1 code creates a map which invokes toSet on its values whenever get is invoked (if the entry exists), which may introduce a significant performance hit (depending on the use-case).
Why is this happening? How can this be avoided?
mapValues simply returns a map view which maps every key of this map to f(this(key)). The resulting map wraps the original map without copying any elements.
Looking at the implementation, mapValues returns an instance of MappedValues which override the get function:
def get(key: K) = self.get(key).map(f)
If you want to force the materialization of the map, call toMap after the mapValues call. Just like you did in #2!

Scala regex splitting on InputStream

I'm parsing a resource file and splitting on empty lines, using the following code:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val fooString = source.mkString
val fooParsedSections = fooString.split("\\r\\n[\\f\\t ]*\\r\\n")
I believe this is pulling the input stream into memory as a full string, and then splitting on the regex. This works fine for the relatively small file I'm parsing, but it's not ideal and I'm curious how I could improve it--
Two ideas are:
read the input stream line-by-line and have a buffer of segments that I build up, splitting on empty lines
read the stream character-by-character and parse segments based off of a small finite state machine
However, I'd love to not maintain a mutable buffer if possible.
Any suggestions? This is just for a personal fun project, and I want to learn how to do this in an efficent and functional manner.
You can use Stream.span method to get the prefix before the empty line, then repeat. Here's a helper function for that:
def sections(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) Stream.empty
else {
// cutting off the longest `prefix` before an empty line
val (prefix, suffix) = lines.span { _.trim.nonEmpty }
// dropping any empty lines (there may be several)
val rest = suffix.dropWhile{ _.trim.isEmpty }
// grouping back the prefix lines and calling recursion
prefix.mkString("\n") #:: sections(rest)
}
}
Note, that Stream's method #:: is lazy and doesn't evaluate the right operand until it's needed. Here is how you can apply it to your use case:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val parsedSections = sections(source.getLines.toStream)
Source.getLines
method returns Iterator[String] which we convert to Stream and apply the helper function. You can also call .toIterator in the end if you process the groups of lines on the way and don't need to store them. See the Stream docs for details.
EDIT
If you still want to use regex, you can change .trim.nonEmpty in the function above to the use of the String matches method.

Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n
I walk through my files one by one and also want to build one output file per input file.
Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same.
After grouping up all of the lines, I give them to my Java parser.
The parser then will contain all parsed data objects in memory and output it as JSON afterwards.
To visualize how I think my Job is processed, I threw together the following "flow graph". Note that I did not visualize the groupByKey-Shuffeling-Process.
My problems:
I expected Spark to split up the files, process the splits with separate tasks and save each task output to a "part"-file.
However, my tasks run out of memory and get killed by YARN before they can finish: Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser is throwing all parsed data objects into memory. I can't change the code of the Parser.
Please note that my code works for smaller files (for example two files with 600.000 lines each as the input to my Job)
My questions:
How can I make sure that Spark will create a result for every file-split in my map task? (Maybe they will if my tasks succeed but I will never see the output as of now.)
I thought that my map transformation val lineMap = lines.map ... (see Scala code below) produces a partitioned rdd. Thus I expect the values of the rdd to be split in some way before calling my second map task.
Furthermore, I thought that calling saveAsTextFile on this rdd lineMap will produce a output task that runs after each of my map task has finished. If my assumptions are correct, why do my executors still run out of memory? Is Spark doing several (too) big file splits and processes them concurrently, which leads to the Parser filling up the memory?
Is repartitioning the lineMap rdd to get more (smaller) inputs for my Parser a good idea?
Is there somewhere an additional reducer step which I am not aware of? Like results being aggregated before getting written to file or similar?
Scala code (I left out unrelevant code parts):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
})
}
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()
val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
}
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
}
I fixed this by removing the groupByKey call and implementing a new FileInputFormat as well as a RecordReader to remove my limitations that lines depend on other lines. For now, I implemented it so that each split will contain a 50.000 Byte overhead of the previous split. This will ensure that all lines that depend on previous lines can be parsed correctly.
I will now go ahead and still look through the last 50.000 Bytes of the previous split, but only copy over lines that actually affect the parsing of the current split. Thus, I minimize the overhead and still get a highly parallelizable task.
The following links dragged me into the right direction. Because the topic of FileInputFormat/RecordReader is quite complicated at first sight (it was for me at least), it is good to read through these articles and understand whether this is suitable for your problem or not:
https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
http://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/
Relevant code parts from the ae.be article just in case the website goes down. The author (#Gurdt) uses this to detect whether a chat message contains an escaped line return (by having the line end with "\") and appends the escaped lines together until an unescaped \n is found. This will allow him to retrieve messages that spans over two or more lines. The code written in Scala:
Usage
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
FileInputFormat
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
}
RecordReader
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
.open(split.getPath)
stream.seek(start)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
stream.seek(start)
if(start != 0)
skipRemainderFromPreviousSplit(reader)
}
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
}
}
override def nextKeyValue(): Boolean = {
key.set(pos)
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
if(!lastNewlineWasEscaped)
return true
}
// end of split reached?
return false
}
}
Note: You might need to implement getCurrentKey, getCurrentValue, close and getProgress in your RecordReader as well.

Scala collection transformation performance: single looping vs. multiple looping

When there is a collection and you must perform two or more operations on all of its elements, what is faster?:
val f1: String => String = _.reverse
val f2: String => String = _.toUpperCase
val elements: Seq[String] = List("a", "b", "c")
iterate multiple times and perform one operation on one loop
val result = elements.map(f1).map(f2)
This approach does have the advantage, that the result after application of the first function could be reused.
iterate one time and perform all operation on each element together
val result = elements.map(element => f2(f1(element)))
or
val result = elements.map(element => f1.compose(f2)
Is there any difference in performance between these two approaches? And if yes, which is faster?
Here's the thing, transformation of a collection is more or less of runtime O(N) , * runtime cost of all the functions applied. So I doubt the 2nd set of choices you present above would make even the slightest difference in runtime. The first option you list, is a different story. New collection creation can be avoided, because that could result in overhead. That's where "view" collections come in (see this good example I spotted)
In Scala, what does "view" do?
If you had the apply several mapping operations you might do this:
val result = elements.view.map(f1).map(f2).force
(force at the end, causes all functions to evaluate)
The 2nd set of examples above would maybe be a tiny bit faster, but the "view" option could make your code more readable if you had a lot of these or complex anonymous functions used in the mapping.
Composing functions to produce a single pass transformation will probably gain you some performance, but will quickly become unreadable. Consider using views as an alernative. While this will create intermediate collections:
val result = elements.map(f1).map(f2)
This will perform lazy evaluation and will perform functional composition the same way you do:
val result = elements.view.map(f1).map(f2)
Notice that result type will be SeqView so you might want to convert it to list later with toList.

Resources