process bunch of string effective - performance

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){

It'd be better to read text line by line:
for(line <- Source.fromFile("file.txt").getLines()) {

I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)


What's the Input key of MapReduce by default?

I'm using MpaReduce based on hadoop 2.6.0,and I want to skip the first six lines of my data file, so I use
return ;
{do ....}
in my map() function.
But it was not right. I find that the input key of map() is not the offset of file line. The key is the sum of the length of every line. Why? It doesn't look like the words in many books.
If you look at the code, it is the actual byte offset of the file and not the line.
If you want to skip the first n lines of your file, you probably have to write your own input format / record reader, or make sure that you keep a line counter in the mapper logic ala:
int lines = 0;
public void map(LongWritable key, Text value, ...) {
if(++lines < 6) { return; }
This obviously doesn't work if you split the text file (so having > 1 mapper). So writing a dedicated InputFormat is the cleanest way to solve this problem.
Another trick would be to measure how many bytes the first n lines are in that specific file and then just skipping this amount of bytes at the start.

Scala regex splitting on InputStream

I'm parsing a resource file and splitting on empty lines, using the following code:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source =
val fooString = source.mkString
val fooParsedSections = fooString.split("\\r\\n[\\f\\t ]*\\r\\n")
I believe this is pulling the input stream into memory as a full string, and then splitting on the regex. This works fine for the relatively small file I'm parsing, but it's not ideal and I'm curious how I could improve it--
Two ideas are:
read the input stream line-by-line and have a buffer of segments that I build up, splitting on empty lines
read the stream character-by-character and parse segments based off of a small finite state machine
However, I'd love to not maintain a mutable buffer if possible.
Any suggestions? This is just for a personal fun project, and I want to learn how to do this in an efficent and functional manner.
You can use Stream.span method to get the prefix before the empty line, then repeat. Here's a helper function for that:
def sections(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) Stream.empty
else {
// cutting off the longest `prefix` before an empty line
val (prefix, suffix) = lines.span { _.trim.nonEmpty }
// dropping any empty lines (there may be several)
val rest = suffix.dropWhile{ _.trim.isEmpty }
// grouping back the prefix lines and calling recursion
prefix.mkString("\n") #:: sections(rest)
Note, that Stream's method #:: is lazy and doesn't evaluate the right operand until it's needed. Here is how you can apply it to your use case:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source =
val parsedSections = sections(source.getLines.toStream)
method returns Iterator[String] which we convert to Stream and apply the helper function. You can also call .toIterator in the end if you process the groups of lines on the way and don't need to store them. See the Stream docs for details.
If you still want to use regex, you can change .trim.nonEmpty in the function above to the use of the String matches method.

Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n
I walk through my files one by one and also want to build one output file per input file.
Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same.
After grouping up all of the lines, I give them to my Java parser.
The parser then will contain all parsed data objects in memory and output it as JSON afterwards.
To visualize how I think my Job is processed, I threw together the following "flow graph". Note that I did not visualize the groupByKey-Shuffeling-Process.
My problems:
I expected Spark to split up the files, process the splits with separate tasks and save each task output to a "part"-file.
However, my tasks run out of memory and get killed by YARN before they can finish: Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser is throwing all parsed data objects into memory. I can't change the code of the Parser.
Please note that my code works for smaller files (for example two files with 600.000 lines each as the input to my Job)
My questions:
How can I make sure that Spark will create a result for every file-split in my map task? (Maybe they will if my tasks succeed but I will never see the output as of now.)
I thought that my map transformation val lineMap = ... (see Scala code below) produces a partitioned rdd. Thus I expect the values of the rdd to be split in some way before calling my second map task.
Furthermore, I thought that calling saveAsTextFile on this rdd lineMap will produce a output task that runs after each of my map task has finished. If my assumptions are correct, why do my executors still run out of memory? Is Spark doing several (too) big file splits and processes them concurrently, which leads to the Parser filling up the memory?
Is repartitioning the lineMap rdd to get more (smaller) inputs for my Parser a good idea?
Is there somewhere an additional reducer step which I am not aware of? Like results being aggregated before getting written to file or similar?
Scala code (I left out unrelevant code parts):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = => (line.split(" ")(1), line)).groupByKey()
val parsedLines ={ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
I fixed this by removing the groupByKey call and implementing a new FileInputFormat as well as a RecordReader to remove my limitations that lines depend on other lines. For now, I implemented it so that each split will contain a 50.000 Byte overhead of the previous split. This will ensure that all lines that depend on previous lines can be parsed correctly.
I will now go ahead and still look through the last 50.000 Bytes of the previous split, but only copy over lines that actually affect the parsing of the current split. Thus, I minimize the overhead and still get a highly parallelizable task.
The following links dragged me into the right direction. Because the topic of FileInputFormat/RecordReader is quite complicated at first sight (it was for me at least), it is good to read through these articles and understand whether this is suitable for your problem or not:
Relevant code parts from the article just in case the website goes down. The author (#Gurdt) uses this to detect whether a chat message contains an escaped line return (by having the line end with "\") and appends the escaped lines together until an unescaped \n is found. This will allow him to retrieve messages that spans over two or more lines. The code written in Scala:
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
if(start != 0)
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
override def nextKeyValue(): Boolean = {
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
return true
// end of split reached?
return false
Note: You might need to implement getCurrentKey, getCurrentValue, close and getProgress in your RecordReader as well.

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk =, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
a = m + 1
return (line == target)
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
line = file.readln()
} while not file.eof()
return hits

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
public int Frequency { get; set; }
public void Add(string word)
private void Add(char[] chars)
if (chars == null || chars.Length == 0)
throw new System.ArgumentException();
var first = chars[0];
if (!this.ContainsKey(first))
this.Add(first, new Trie());
if (chars.Length == 1)
this[first].Frequency += 1;
public int GetFrequency(string word)
return this.GetFrequency(word.ToCharArray());
private int GetFrequency(char[] chars)
if (chars == null || chars.Length == 0)
throw new System.ArgumentException();
var first = chars[0];
if (!this.ContainsKey(first))
return 0;
if (chars.Length == 1)
return this[first].Frequency;
return this[first].GetFrequency(chars.Skip(1).ToArray());
Then you can call code like this:
var t = new Trie();
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())
