Scala regex splitting on InputStream - performance

I'm parsing a resource file and splitting on empty lines, using the following code:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val fooString = source.mkString
val fooParsedSections = fooString.split("\\r\\n[\\f\\t ]*\\r\\n")
I believe this is pulling the input stream into memory as a full string, and then splitting on the regex. This works fine for the relatively small file I'm parsing, but it's not ideal and I'm curious how I could improve it--
Two ideas are:
read the input stream line-by-line and have a buffer of segments that I build up, splitting on empty lines
read the stream character-by-character and parse segments based off of a small finite state machine
However, I'd love to not maintain a mutable buffer if possible.
Any suggestions? This is just for a personal fun project, and I want to learn how to do this in an efficent and functional manner.

You can use Stream.span method to get the prefix before the empty line, then repeat. Here's a helper function for that:
def sections(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) Stream.empty
else {
// cutting off the longest `prefix` before an empty line
val (prefix, suffix) = lines.span { _.trim.nonEmpty }
// dropping any empty lines (there may be several)
val rest = suffix.dropWhile{ _.trim.isEmpty }
// grouping back the prefix lines and calling recursion
prefix.mkString("\n") #:: sections(rest)
}
}
Note, that Stream's method #:: is lazy and doesn't evaluate the right operand until it's needed. Here is how you can apply it to your use case:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val parsedSections = sections(source.getLines.toStream)
Source.getLines
method returns Iterator[String] which we convert to Stream and apply the helper function. You can also call .toIterator in the end if you process the groups of lines on the way and don't need to store them. See the Stream docs for details.
EDIT
If you still want to use regex, you can change .trim.nonEmpty in the function above to the use of the String matches method.

Related

Kotlin map not working with List of String

I have been working on code where I have to generate all possible ways to the target string. I am using the below-mentioned code.
Print Statement:
println("---------- How Construct -------")
println("${
window.howConstruct("purple", listOf(
"purp",
"p",
"ur",
"le",
"purpl"
))
}")
Function Call:
fun howConstruct(
target: String,
wordBank: List<String>,
): List<List<String>> {
if (target.isEmpty()) return emptyList()
var result = emptyList<List<String>>()
for (word in wordBank) {
if (target.indexOf(word) == 0) { // Starting with prefix
val substring = target.substring(word.length)
val suffixWays = howConstruct(substring, wordBank)
val targetWays = suffixWays.map { way ->
val a = way.toMutableList().apply {
add(word)
}
a.toList()
}
result = targetWays
}
}
return result
}
Expected Output:-
[['purp','le'],['p','ur','p','le']]
Current Output:-
[]
Your code is almost working; only a couple of small changes are needed to get the required output:
If the target is empty, return listOf(emptyList()) instead of emptyList().
Use add(0, word) instead of add(word).
The first of those changes is the important one. Your function returns a list of matches; and since each match is itself a list of strings, it returns a list of lists of strings. Once your code has matched the entire target and calls itself one last time, it returned an empty list — i.e. no matches — instead of a list containing an empty list — meaning one match with no remaining strings.
The second change simply fixes the order of strings within each match, which was reversed (because it appended the prefix after the returned suffix match).
However, there are many others ways that code could be improved. Rather than list them all individually, it's probably easier to give an alternative version:
fun howConstruct(target: String, wordBank: List<String>
): List<List<String>>
= if (target == "") listOf(emptyList())
else wordBank.filter{ target.endsWith(it) } // Look for suffixes of the target in the word bank
.flatMap { suffix: String ->
howConstruct(target.removeSuffix(suffix), wordBank) // For each, recurse to search the rest
.map{ it + suffix } } // And append the suffix to each match.
That does almost exactly the same as your code, except that it searches from the end of the string — matching suffixes — instead of from the beginning. The result is the same; the main benefit is that it's simpler to append a suffix string to a partial match list (using +) than to prepend a prefix (which is quite messy, as you found).
However, it's a lot more concise, mainly because it uses a functional style — in particular, it uses filter() to determine which words are valid suffixes, and flatMap() to collate the list of matches corresponding to each one recursively, as well as map() to append the suffix to each one (like your code does). That avoids all the business of looping over lists, creating lists, and adding to them. As a result, it doesn't need to deal with mutable lists or variables, avoiding some sources of confusion and error.
I've written it as an expression body (with = instead of { … }) for simplicity. I find that's simpler and clearer for short functions — this one is about the limit, though. It might fit as it an extension function on String, since it's effectively returning a transformation of the string, without any side-effects — though again, that tends to work best on short functions.
There are also several small tweaks. It's a bit simpler — and more efficient — to use startsWith() or endsWith() instead of indexOf(); removePrefix() or removeSuffix() is arguably slightly clearer than substring(); and I find == "" clearer than isEmpty().
(Also, the name howConstruct() doesn't really describe the result very well, but I haven't come up with anything better so far…)
Many of these changes are of course a matter of personal preference, and I'm sure other developers would write it in many other ways! But I hope this has given some ideas.

Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n
I walk through my files one by one and also want to build one output file per input file.
Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same.
After grouping up all of the lines, I give them to my Java parser.
The parser then will contain all parsed data objects in memory and output it as JSON afterwards.
To visualize how I think my Job is processed, I threw together the following "flow graph". Note that I did not visualize the groupByKey-Shuffeling-Process.
My problems:
I expected Spark to split up the files, process the splits with separate tasks and save each task output to a "part"-file.
However, my tasks run out of memory and get killed by YARN before they can finish: Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser is throwing all parsed data objects into memory. I can't change the code of the Parser.
Please note that my code works for smaller files (for example two files with 600.000 lines each as the input to my Job)
My questions:
How can I make sure that Spark will create a result for every file-split in my map task? (Maybe they will if my tasks succeed but I will never see the output as of now.)
I thought that my map transformation val lineMap = lines.map ... (see Scala code below) produces a partitioned rdd. Thus I expect the values of the rdd to be split in some way before calling my second map task.
Furthermore, I thought that calling saveAsTextFile on this rdd lineMap will produce a output task that runs after each of my map task has finished. If my assumptions are correct, why do my executors still run out of memory? Is Spark doing several (too) big file splits and processes them concurrently, which leads to the Parser filling up the memory?
Is repartitioning the lineMap rdd to get more (smaller) inputs for my Parser a good idea?
Is there somewhere an additional reducer step which I am not aware of? Like results being aggregated before getting written to file or similar?
Scala code (I left out unrelevant code parts):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
})
}
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()
val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
}
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
}
I fixed this by removing the groupByKey call and implementing a new FileInputFormat as well as a RecordReader to remove my limitations that lines depend on other lines. For now, I implemented it so that each split will contain a 50.000 Byte overhead of the previous split. This will ensure that all lines that depend on previous lines can be parsed correctly.
I will now go ahead and still look through the last 50.000 Bytes of the previous split, but only copy over lines that actually affect the parsing of the current split. Thus, I minimize the overhead and still get a highly parallelizable task.
The following links dragged me into the right direction. Because the topic of FileInputFormat/RecordReader is quite complicated at first sight (it was for me at least), it is good to read through these articles and understand whether this is suitable for your problem or not:
https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
http://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/
Relevant code parts from the ae.be article just in case the website goes down. The author (#Gurdt) uses this to detect whether a chat message contains an escaped line return (by having the line end with "\") and appends the escaped lines together until an unescaped \n is found. This will allow him to retrieve messages that spans over two or more lines. The code written in Scala:
Usage
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
FileInputFormat
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
}
RecordReader
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
.open(split.getPath)
stream.seek(start)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
stream.seek(start)
if(start != 0)
skipRemainderFromPreviousSplit(reader)
}
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
}
}
override def nextKeyValue(): Boolean = {
key.set(pos)
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
if(!lastNewlineWasEscaped)
return true
}
// end of split reached?
return false
}
}
Note: You might need to implement getCurrentKey, getCurrentValue, close and getProgress in your RecordReader as well.

How does Scala compiler handle unused variable values?

Using Scala and Spark, I have the following construction:
val rdd1: RDD[String] = ...
val rdd2: RDD[(String, Any)] = ...
val rdd1pairs = rdd1.map(s => (s, s))
val result = rdd2.join(rdd1pairs)
.map { case (_: String, (e: Any, _)) => e }
The purpose of mapping rdd1 into a PairRDD is the join with rdd2 in the subsequent step. However, I am actually only interested in the values of rdd2, hence the mapping step in the last line which omits the keys. Actually, this is an intersection between rdd2 and rdd1 performed with Spark's join() for efficiency reasons.
My question refers to the keys of rdd1pairs: they are created for syntactical reasons only (to allow the join) in the first map step and are later discarded without any usage. How does the compiler handle this? Does it matter in terms of memory consumption whether I use the String s (as shown in the example)? Should I replace it by null or 0 to save a little memory? Does the compiler actually create and store these objects (references) or does it notice that they are never used?
In this case, it is what the Spark driver will do that influences the outcome rather than the compiler, I think. Whether or not Spark can optimise its execution pipeline in order to avoid creating the redundant duplication of s. I'm not sure but I think Spark will create the rdd1pairs, in memory.
Instead of mapping to (String, String) you could use (String, Unit):
rdd1.map(s => (s,()))
What you're doing is basically a filter of rdd2 based on rdd1. If rdd1 is significantly smaller than rdd2, another method would be to represent the data of rdd1 as a broadcast variable rather than an RDD, and simply filter rdd2. This avoids any shuffling or reduce phase, so may be quicker, but will only work if the data of rdd1 is small enough to fit on each node.
EDIT:
Considering how using Unit rather than String saves space, consider the following examples:
object size extends App {
(1 to 1000000).map(i => ("foo"+i, ()))
val input = readLine("prompt> ")
}
and
object size extends App {
(1 to 1000000).map(i => ("foo"+i, "foo"+i))
val input = readLine("prompt> ")
}
Using the jstat command as described in this question How to check heap usage of a running JVM from the command line? the first version uses significantly less heap than the latter.
Edit 2:
Unit is effectively a singleton object with no contents, so logically, it should not require any serialization. The fact that the type definition contains Unit tells you all you need to be able to deserialize a structure which has a field of type Unit.
Spark uses Java Serialization by default. Consider the following:
object Main extends App {
import java.io.{ObjectOutputStream, FileOutputStream}
case class Foo (a: String, b:String)
case class Bar (a: String, b:String, c: Unit)
val str = "abcdef"
val foo = Foo("abcdef", "xyz")
val bar = Bar("abcdef", "xyz", ())
val fos = new FileOutputStream( "foo.obj" )
val fo = new ObjectOutputStream( fos )
val bos = new FileOutputStream( "bar.obj" )
val bo = new ObjectOutputStream( bos )
fo writeObject foo
bo writeObject bar
}
The two files are of identical size:
�� sr Main$Foo3�,�z \ L at Ljava/lang/String;L bq ~ xpt abcdeft xyz
and
�� sr Main$Bar+a!N��b L at Ljava/lang/String;L bq ~ xpt abcdeft xyz

java 8 stream interference versus non-interference

I understand why the following code is ok. Because the collection is being modified before calling the terminal operation.
List<String> wordList = ...;
Stream<String> words = wordList.stream();
wordList.add("END"); // Ok
long n = words.distinct().count();
But why is this code is not ok?
Stream<String> words = wordList.stream();
words.forEach(s -> if (s.length() < 12) wordList.remove(s)); // Error—interference
Stream.forEach() is a terminal operation, and the underlying wordList collection is modified after the terminal has been started/called.
Joachim's answer is correct, +1.
You didn't ask specifically, but for the benefit of other readers, here are a couple techniques for rewriting the program a different way, avoiding stream interference problems.
If you want to mutate the list in-place, you can do so with a new default method on List instead of using streams:
wordList.removeIf(s -> s.length() < 12);
If you want to leave the original list intact but create a modified copy, you can use a stream and a collector to do that:
List<String> newList = wordList.stream()
.filter(s -> s.length() >= 12)
.collect(Collectors.toList());
Note that I had to invert the sense of the condition, since filter takes a predicate that keeps values in the stream if the condition is true.

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Resources