xz-javadoc > What is the meaning of "Wrap it in BufferedInputStream if you need to read lots of data one byte at a time" - xz

As a new xz-javadoc user, I am trying to use the XZInputStream to read decompressed bytes. Thus I am reading the xz-javadoc (http://tukaani.org/xz/xz-javadoc/org/tukaani/xz/XZInputStream.html).
In the doc page, there is the following text in the description of read() method:
Reading lots of data with read() from this input stream may be inefficient. Wrap it in BufferedInputStream if you need to read lots of data one byte at a time.
What is the meaning of this? wrap this input stream to BufferedInputStream?

What is the meaning of this? wrap this input stream to BufferedInputStream?
it means this:
InputStream is = new BufferedInputStream(new XZInputStream(file));
int by;
while ((by = is.read()) != -1)
{
// do stuff with "by"
}
is.close();
So although you're reading byte by byte, your input is buffered. There's also a longer explanation here.

Related

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

Scala regex splitting on InputStream

I'm parsing a resource file and splitting on empty lines, using the following code:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val fooString = source.mkString
val fooParsedSections = fooString.split("\\r\\n[\\f\\t ]*\\r\\n")
I believe this is pulling the input stream into memory as a full string, and then splitting on the regex. This works fine for the relatively small file I'm parsing, but it's not ideal and I'm curious how I could improve it--
Two ideas are:
read the input stream line-by-line and have a buffer of segments that I build up, splitting on empty lines
read the stream character-by-character and parse segments based off of a small finite state machine
However, I'd love to not maintain a mutable buffer if possible.
Any suggestions? This is just for a personal fun project, and I want to learn how to do this in an efficent and functional manner.
You can use Stream.span method to get the prefix before the empty line, then repeat. Here's a helper function for that:
def sections(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) Stream.empty
else {
// cutting off the longest `prefix` before an empty line
val (prefix, suffix) = lines.span { _.trim.nonEmpty }
// dropping any empty lines (there may be several)
val rest = suffix.dropWhile{ _.trim.isEmpty }
// grouping back the prefix lines and calling recursion
prefix.mkString("\n") #:: sections(rest)
}
}
Note, that Stream's method #:: is lazy and doesn't evaluate the right operand until it's needed. Here is how you can apply it to your use case:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val parsedSections = sections(source.getLines.toStream)
Source.getLines
method returns Iterator[String] which we convert to Stream and apply the helper function. You can also call .toIterator in the end if you process the groups of lines on the way and don't need to store them. See the Stream docs for details.
EDIT
If you still want to use regex, you can change .trim.nonEmpty in the function above to the use of the String matches method.

save array of bigint to file

I have a large array of big integers which I need to save to the disk for the next user session. I don't have a data base or the option to use one.
Will large numbers take more space in memory when I write them to a text file?
What is the best way to store this array for later use?
I'd just go for it using a normal text file, the integer will take as much space as its string representation, so 2759275918572192759185721 as a big integer will take the same space as 2759275918572192759185721 as a string (which isn't that much).
When reading them from the file, you simply parse them again.
IMPORTANT: There is NO error handling in this code! You abosolutely MUST add try-catch-finally to ctahc IOException and NumberformatException!
File file = new File("C:\\Users\\Phiwa\\Desktop\\test.txt");
if(!file.exists())
file.createNewFile();
FileWriter fw = new FileWriter(file);
BigInteger bint1 = new BigInteger("999999999999999999");
fw.write(bint1.toString());
fw.flush();
fw.close();
// BigInteger has been written to file
// Read it from file again
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String str = br.readLine();
br.close();
BigInteger bint2 = BigInteger.valueOf(Long.parseLong(str));
System.out.println(bint2);
This works, did you already try this?
Things are a bit different if you have the integer in the format 0x999999999999999, in that case you would use
BigInteger bint2 = new BigInteger(str.replace("0x", ""), 16);

Splits in hadoop with variable-length/non-delimited binary file

I've just started working on a hadoop based ingester for open street map data. There are a few formats - but I've been targeting a protocolbuffer based format (note - it's not pure pb).
It's looking to me like it would be more efficient to pre-split the file into a sequence file - as opposed to handling the variable-length encoding in a custom record reader / input format - but would like a sanity check.
The format is described in more detail at PBF Format Description
But basically it's a collection of [BlobHeader,Blob] blocks.
There's a Blob Header
message BlobHeader {
required string type = 1;
optional bytes indexdata = 2;
required int32 datasize = 3;
}
And then the Blob (the size of which is defined by the datasize parameter in the header)
message Blob {
optional bytes raw = 1; // No compression
optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
optional bytes zlib_data = 3;
// optional bytes lzma_data = 4; // PROPOSED.
// optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
}
There's more structure once you get down into the blob obviously - but I would handle that in the mapper - what I would like to do is initially have one blob per mapper (later might be some multiple of blobs per mapper).
Some of the other input formats/record readers use a "big enough" split size, and then seek backwards/forwards to a delimiter - but since there is no delimiter that would let me know the offset of blobs/headers - and no index that points to them either - I can't see any way to get my split points without first streaming through the file.
Now I wouldn't need to actually read the entire file off of disks - I could start with reading the header, using that info to seek past the blob, set that as the first split point, then repeat. But that's about the only alternative to pre-splitting into a sequence file I can come up with.
Is there a better way to handle this - or if not, thoughts on the two suggestions?
Well, I went with parsing the binary file in the getSplits method -and since i'm skipping over 99% of the data it's plenty fast (~20 seconds for the planet-osm 22GB world file). Here's the getSplits method if anyone else stumbles along.
#Override
public List<InputSplit> getSplits(JobContext context){
List<InputSplit> splits = new ArrayList<InputSplit>();
FileSystem fs = null;
Path file = OSMPBFInputFormat.getInputPaths(context)[0];
FSDataInputStream in = null;
try {
fs = FileSystem.get(context.getConfiguration());
in = fs.open(file);
long pos = 0;
while (in.available() > 0){
int len = in.readInt();
byte[] blobHeader = new byte[len];
in.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
FileSplit split = new FileSplit(file, pos,len + h.getDatasize(), new String[] {});
splits.add(split);
pos += 4;
pos += len;
pos += h.getDatasize();
in.skip(h.getDatasize());
}
} catch (IOException e) {
sLogger.error(e.getLocalizedMessage());
} finally {
if (in != null) {try {in.close();}catch(Exception e){}};
if (fs != null) {try {fs.close();}catch(Exception e){}};
}
return splits;
}
working fine so far - though I haven't ground truthed the output yet. It's definitley faster than copying the pbf to hdfs, converting to a sequence in a single mapper, then ingesting (copy time dominates). It's also ~20% faster than having an external program copy to a sequence file in hdfs, then running a mapper against hdfs (scripted the latter).
So no complaints here.
Note that this generates a mapper for every block - which is ~23k mappers for the planet world file. I'm actually bundling up multiple blocks per split - just loop through x numbers of times before a split gets added to the collection.
For the BlobHeader I just compiled the protobuf .proto file from the OSM wiki link above. You can also pull it pre-generated from the OSM-binary class if you want - maven fragment is:
<dependency>
<groupId>org.openstreetmap.osmosis</groupId>
<artifactId>osmosis-osm-binary</artifactId>
<version>0.43-RELEASE</version>
</dependency>

Is it possible to send several different datatypes at once with boost::asio without casting?

At the moment I'm filling an std::vector with all of my data and then sending it with async_write. All of the packets I send have a 2 byte header and this tells receiver how much further to read (if any further at all). The code which generates this std::vector is:
std::vector<boost::asio::const_buffer> BasePacket::buffer()
{
std::vector<boost::asio::const_buffer> buffers;
buffers.push_back(boost::asio::buffer(headerBytes_)); // This is just a boost::array<uint8_t, 2>
return buffers;
}
std::vector<boost::asio::const_buffer> UpdatePacket::buffer()
{
printf("Making an update packet into a buffer.\n");
std::vector<boost::asio::const_buffer> buffers = BasePacket::buffer();
boost::array<uint16_t, 2> test = { 30, 40 };
buffers.push_back(boost::asio::buffer(test));
return buffers;
}
This is read by:
void readHeader(const boost::system::error_code& error, size_t bytesTransferred)
{
if(error)
{
printf("Error reading header: %s\n", error.message().c_str());
return;
}
// At this point 2 bytes have been read into boost::array<uint8_t, 2> header
uint8_t primeByte = header.data()[0];
uint8_t supByte = header.data()[1];
switch(primeByte)
{
// Unrelated case removed
case PACKETHEADER::UPDATE:
// Read the first 4 bytes as two 16-bit numbers representing the size of
// the update
boost::array<uint16_t, 2> buf;
printf("Attempting to read the first two Uint16's.\n");
boost::asio::read(mySocket, boost::asio::buffer(buf));
printf("The update has size %d x %d\n", buf.data()[0], buf.data()[1]);
break;
}
// Keep listening
boost::asio::async_read(mySocket, boost::asio::buffer(header),
boost::bind(readHeader, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred));
}
The code compiles, however it doesn't return 30 x 40 as I would expect. Instead it returns
188 x 40. If I stretch the second array out only the first byte is messed up. However, if I add a third array before sending (but still read the send amount), the values of the second array all get messed up. I'm guessing that this could be related to how I'm reading it (in chunks into one buffer rather than similar to how I'm writing it).
Ideally I'd like to avoid having to cast everything into bytes and read/write that way, since it's less clear and probably less portable, but I know that's an option. However, if there is a better way I'm fine rewriting what I have.
The first problem I see is a lifetime issue with the data you are sending. asio::buffers simply wrap a data buffer that you continue to own.
The UpdatePacket::buffer() method creates a boost::array which it wraps and then pushes back on the buffers std::vector. When the method exits the boost::array goes out of scope and the asio::buffer is now pointing to garbage.
There maybe other issues, but this is a good start. Mind the lifetimes of your data buffers in Asio.

Resources