Splits in hadoop with variable-length/non-delimited binary file - hadoop

I've just started working on a hadoop based ingester for open street map data. There are a few formats - but I've been targeting a protocolbuffer based format (note - it's not pure pb).
It's looking to me like it would be more efficient to pre-split the file into a sequence file - as opposed to handling the variable-length encoding in a custom record reader / input format - but would like a sanity check.
The format is described in more detail at PBF Format Description
But basically it's a collection of [BlobHeader,Blob] blocks.
There's a Blob Header
message BlobHeader {
required string type = 1;
optional bytes indexdata = 2;
required int32 datasize = 3;
}
And then the Blob (the size of which is defined by the datasize parameter in the header)
message Blob {
optional bytes raw = 1; // No compression
optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
optional bytes zlib_data = 3;
// optional bytes lzma_data = 4; // PROPOSED.
// optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
}
There's more structure once you get down into the blob obviously - but I would handle that in the mapper - what I would like to do is initially have one blob per mapper (later might be some multiple of blobs per mapper).
Some of the other input formats/record readers use a "big enough" split size, and then seek backwards/forwards to a delimiter - but since there is no delimiter that would let me know the offset of blobs/headers - and no index that points to them either - I can't see any way to get my split points without first streaming through the file.
Now I wouldn't need to actually read the entire file off of disks - I could start with reading the header, using that info to seek past the blob, set that as the first split point, then repeat. But that's about the only alternative to pre-splitting into a sequence file I can come up with.
Is there a better way to handle this - or if not, thoughts on the two suggestions?

Well, I went with parsing the binary file in the getSplits method -and since i'm skipping over 99% of the data it's plenty fast (~20 seconds for the planet-osm 22GB world file). Here's the getSplits method if anyone else stumbles along.
#Override
public List<InputSplit> getSplits(JobContext context){
List<InputSplit> splits = new ArrayList<InputSplit>();
FileSystem fs = null;
Path file = OSMPBFInputFormat.getInputPaths(context)[0];
FSDataInputStream in = null;
try {
fs = FileSystem.get(context.getConfiguration());
in = fs.open(file);
long pos = 0;
while (in.available() > 0){
int len = in.readInt();
byte[] blobHeader = new byte[len];
in.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
FileSplit split = new FileSplit(file, pos,len + h.getDatasize(), new String[] {});
splits.add(split);
pos += 4;
pos += len;
pos += h.getDatasize();
in.skip(h.getDatasize());
}
} catch (IOException e) {
sLogger.error(e.getLocalizedMessage());
} finally {
if (in != null) {try {in.close();}catch(Exception e){}};
if (fs != null) {try {fs.close();}catch(Exception e){}};
}
return splits;
}
working fine so far - though I haven't ground truthed the output yet. It's definitley faster than copying the pbf to hdfs, converting to a sequence in a single mapper, then ingesting (copy time dominates). It's also ~20% faster than having an external program copy to a sequence file in hdfs, then running a mapper against hdfs (scripted the latter).
So no complaints here.
Note that this generates a mapper for every block - which is ~23k mappers for the planet world file. I'm actually bundling up multiple blocks per split - just loop through x numbers of times before a split gets added to the collection.
For the BlobHeader I just compiled the protobuf .proto file from the OSM wiki link above. You can also pull it pre-generated from the OSM-binary class if you want - maven fragment is:
<dependency>
<groupId>org.openstreetmap.osmosis</groupId>
<artifactId>osmosis-osm-binary</artifactId>
<version>0.43-RELEASE</version>
</dependency>

Related

C++ Write large array as CSV to disk

I'm trying to write out an image from a camera to disk. To keep it simple for post processing down stream, we chose to use a CSV format (Other systems we have use this format as opposed to binary). The issue is to write a 1MB image it takes ~1 Second.
I'm looking to speed this up by at-least 10x. Ideally we maintain the CSV format, but we can move to a binary file if needed. This all is running on a WIN 10 machine, and compiled with VS 2017.
The code below is inefficient but prints prints out a rectangle image with ',' delineation between pixels. I'm sure 99% of the inefficiency comes from the Million repetitive IO requests for every loop, but I'm not sure how to append ',' without something like this...
int toCSV(uns16 * bufferCopy, uns32 bufsize, uns32 imgWidth, string directory, string fileout)
{
std::ofstream outputFile(directory + fileout);
for (uns32 i = 0; i < bufsize; i++)
{
outputFile << (bufferCopy[i]);
if (i % (imgWidth+1) < imgWidth)
outputFile << ",";
else
outputFile << "\n";
}
outputFile.close();
return 0;
}
An example of the output looks like this (Just with ~400x~600 rows):
32.323,23.456,54.323,45.332
45.343,73.846,35.328,15.842
32.323,23.456,54.323,45.332
45.343,73.846,35.328,15.842
Update:
If I move to a binary format, I'll be using the question below as reference code:
Writing a binary file in C++ very fast

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n
I walk through my files one by one and also want to build one output file per input file.
Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same.
After grouping up all of the lines, I give them to my Java parser.
The parser then will contain all parsed data objects in memory and output it as JSON afterwards.
To visualize how I think my Job is processed, I threw together the following "flow graph". Note that I did not visualize the groupByKey-Shuffeling-Process.
My problems:
I expected Spark to split up the files, process the splits with separate tasks and save each task output to a "part"-file.
However, my tasks run out of memory and get killed by YARN before they can finish: Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser is throwing all parsed data objects into memory. I can't change the code of the Parser.
Please note that my code works for smaller files (for example two files with 600.000 lines each as the input to my Job)
My questions:
How can I make sure that Spark will create a result for every file-split in my map task? (Maybe they will if my tasks succeed but I will never see the output as of now.)
I thought that my map transformation val lineMap = lines.map ... (see Scala code below) produces a partitioned rdd. Thus I expect the values of the rdd to be split in some way before calling my second map task.
Furthermore, I thought that calling saveAsTextFile on this rdd lineMap will produce a output task that runs after each of my map task has finished. If my assumptions are correct, why do my executors still run out of memory? Is Spark doing several (too) big file splits and processes them concurrently, which leads to the Parser filling up the memory?
Is repartitioning the lineMap rdd to get more (smaller) inputs for my Parser a good idea?
Is there somewhere an additional reducer step which I am not aware of? Like results being aggregated before getting written to file or similar?
Scala code (I left out unrelevant code parts):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
})
}
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()
val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
}
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
}
I fixed this by removing the groupByKey call and implementing a new FileInputFormat as well as a RecordReader to remove my limitations that lines depend on other lines. For now, I implemented it so that each split will contain a 50.000 Byte overhead of the previous split. This will ensure that all lines that depend on previous lines can be parsed correctly.
I will now go ahead and still look through the last 50.000 Bytes of the previous split, but only copy over lines that actually affect the parsing of the current split. Thus, I minimize the overhead and still get a highly parallelizable task.
The following links dragged me into the right direction. Because the topic of FileInputFormat/RecordReader is quite complicated at first sight (it was for me at least), it is good to read through these articles and understand whether this is suitable for your problem or not:
https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
http://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/
Relevant code parts from the ae.be article just in case the website goes down. The author (#Gurdt) uses this to detect whether a chat message contains an escaped line return (by having the line end with "\") and appends the escaped lines together until an unescaped \n is found. This will allow him to retrieve messages that spans over two or more lines. The code written in Scala:
Usage
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
FileInputFormat
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
}
RecordReader
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
.open(split.getPath)
stream.seek(start)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
stream.seek(start)
if(start != 0)
skipRemainderFromPreviousSplit(reader)
}
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
}
}
override def nextKeyValue(): Boolean = {
key.set(pos)
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
if(!lastNewlineWasEscaped)
return true
}
// end of split reached?
return false
}
}
Note: You might need to implement getCurrentKey, getCurrentValue, close and getProgress in your RecordReader as well.

save array of bigint to file

I have a large array of big integers which I need to save to the disk for the next user session. I don't have a data base or the option to use one.
Will large numbers take more space in memory when I write them to a text file?
What is the best way to store this array for later use?
I'd just go for it using a normal text file, the integer will take as much space as its string representation, so 2759275918572192759185721 as a big integer will take the same space as 2759275918572192759185721 as a string (which isn't that much).
When reading them from the file, you simply parse them again.
IMPORTANT: There is NO error handling in this code! You abosolutely MUST add try-catch-finally to ctahc IOException and NumberformatException!
File file = new File("C:\\Users\\Phiwa\\Desktop\\test.txt");
if(!file.exists())
file.createNewFile();
FileWriter fw = new FileWriter(file);
BigInteger bint1 = new BigInteger("999999999999999999");
fw.write(bint1.toString());
fw.flush();
fw.close();
// BigInteger has been written to file
// Read it from file again
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String str = br.readLine();
br.close();
BigInteger bint2 = BigInteger.valueOf(Long.parseLong(str));
System.out.println(bint2);
This works, did you already try this?
Things are a bit different if you have the integer in the format 0x999999999999999, in that case you would use
BigInteger bint2 = new BigInteger(str.replace("0x", ""), 16);

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Resources