C++ Write large array as CSV to disk - performance

I'm trying to write out an image from a camera to disk. To keep it simple for post processing down stream, we chose to use a CSV format (Other systems we have use this format as opposed to binary). The issue is to write a 1MB image it takes ~1 Second.
I'm looking to speed this up by at-least 10x. Ideally we maintain the CSV format, but we can move to a binary file if needed. This all is running on a WIN 10 machine, and compiled with VS 2017.
The code below is inefficient but prints prints out a rectangle image with ',' delineation between pixels. I'm sure 99% of the inefficiency comes from the Million repetitive IO requests for every loop, but I'm not sure how to append ',' without something like this...
int toCSV(uns16 * bufferCopy, uns32 bufsize, uns32 imgWidth, string directory, string fileout)
{
std::ofstream outputFile(directory + fileout);
for (uns32 i = 0; i < bufsize; i++)
{
outputFile << (bufferCopy[i]);
if (i % (imgWidth+1) < imgWidth)
outputFile << ",";
else
outputFile << "\n";
}
outputFile.close();
return 0;
}
An example of the output looks like this (Just with ~400x~600 rows):
32.323,23.456,54.323,45.332
45.343,73.846,35.328,15.842
32.323,23.456,54.323,45.332
45.343,73.846,35.328,15.842
Update:
If I move to a binary format, I'll be using the question below as reference code:
Writing a binary file in C++ very fast

Related

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

How to use htslib/samtools to transform SAM/BAM reads?

I'm using the htslib library for reading SAM/BAM files, it works perfectly. I can also write the alignments back to a new SAM/BAM file.
For example, the following code prints the DNA sequence of an alignment:
bam1_t *b = ...;
int i;
for (i = 0; i < b->core.l_qseq; ++i) {
printf("%c", seq_nt16_str[bam_seqi(bam_get_seq(b),i)]);
}
Question: How do I change the query sequence? Say, change the first letter to 'T'? bam_get_seq returns the sequence of a read, but there is no bam_set_seq function? Ideally, I'm looking for something like:
bam_set_seq(b, 'TTTT') # My new DNA sequence
If I can figure out how to do the update, I know how to write the information to a new SAM/BAM file.

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}

Splits in hadoop with variable-length/non-delimited binary file

I've just started working on a hadoop based ingester for open street map data. There are a few formats - but I've been targeting a protocolbuffer based format (note - it's not pure pb).
It's looking to me like it would be more efficient to pre-split the file into a sequence file - as opposed to handling the variable-length encoding in a custom record reader / input format - but would like a sanity check.
The format is described in more detail at PBF Format Description
But basically it's a collection of [BlobHeader,Blob] blocks.
There's a Blob Header
message BlobHeader {
required string type = 1;
optional bytes indexdata = 2;
required int32 datasize = 3;
}
And then the Blob (the size of which is defined by the datasize parameter in the header)
message Blob {
optional bytes raw = 1; // No compression
optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
optional bytes zlib_data = 3;
// optional bytes lzma_data = 4; // PROPOSED.
// optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
}
There's more structure once you get down into the blob obviously - but I would handle that in the mapper - what I would like to do is initially have one blob per mapper (later might be some multiple of blobs per mapper).
Some of the other input formats/record readers use a "big enough" split size, and then seek backwards/forwards to a delimiter - but since there is no delimiter that would let me know the offset of blobs/headers - and no index that points to them either - I can't see any way to get my split points without first streaming through the file.
Now I wouldn't need to actually read the entire file off of disks - I could start with reading the header, using that info to seek past the blob, set that as the first split point, then repeat. But that's about the only alternative to pre-splitting into a sequence file I can come up with.
Is there a better way to handle this - or if not, thoughts on the two suggestions?
Well, I went with parsing the binary file in the getSplits method -and since i'm skipping over 99% of the data it's plenty fast (~20 seconds for the planet-osm 22GB world file). Here's the getSplits method if anyone else stumbles along.
#Override
public List<InputSplit> getSplits(JobContext context){
List<InputSplit> splits = new ArrayList<InputSplit>();
FileSystem fs = null;
Path file = OSMPBFInputFormat.getInputPaths(context)[0];
FSDataInputStream in = null;
try {
fs = FileSystem.get(context.getConfiguration());
in = fs.open(file);
long pos = 0;
while (in.available() > 0){
int len = in.readInt();
byte[] blobHeader = new byte[len];
in.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
FileSplit split = new FileSplit(file, pos,len + h.getDatasize(), new String[] {});
splits.add(split);
pos += 4;
pos += len;
pos += h.getDatasize();
in.skip(h.getDatasize());
}
} catch (IOException e) {
sLogger.error(e.getLocalizedMessage());
} finally {
if (in != null) {try {in.close();}catch(Exception e){}};
if (fs != null) {try {fs.close();}catch(Exception e){}};
}
return splits;
}
working fine so far - though I haven't ground truthed the output yet. It's definitley faster than copying the pbf to hdfs, converting to a sequence in a single mapper, then ingesting (copy time dominates). It's also ~20% faster than having an external program copy to a sequence file in hdfs, then running a mapper against hdfs (scripted the latter).
So no complaints here.
Note that this generates a mapper for every block - which is ~23k mappers for the planet world file. I'm actually bundling up multiple blocks per split - just loop through x numbers of times before a split gets added to the collection.
For the BlobHeader I just compiled the protobuf .proto file from the OSM wiki link above. You can also pull it pre-generated from the OSM-binary class if you want - maven fragment is:
<dependency>
<groupId>org.openstreetmap.osmosis</groupId>
<artifactId>osmosis-osm-binary</artifactId>
<version>0.43-RELEASE</version>
</dependency>

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Resources