How to continuously read a binary file in Crystal and get Bytes out of it? - binaryfiles

Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?

One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice

Related

Reimplementing QSerialPort canReadLine() and readLine() methods

I am trying to receive custom framed raw bytes via QSerialPort using value 0 as delimiter in asynchronous mode (using signals instead of polling).
The inconvenience is that QSerialPort doesn't seem to have a method that can read serial data until a specified byte value is encountered e.g. read_until (delimiter_value) in pyserial.
I was wondering if it's possible to reimplement QSerialPort's readLine() function in Python so that it reads until 0 byte value is encountered instead of '\n'. Similarly, it would be handy to reimplement canReadLine() as well.
I know that it is possible to use readAll() method and then parse the data for delimiter value. But this approach likely implies more code and decrease in efficiency. I would like to have the lowest overhead possible when processing the frames (serial baud rate and number of incoming bytes are large). However, if you know a fast approach to do it, I would like to take a look.
I ended up parsing the frame, it seems to work well enough.
Below is a method extract from my script which receives and parses serial data asynchronously. self.serial_buffer is a QByteArray array initialized inside a custom class init method. You can also use a globally declared bytearray but you will have to check for your delimiter value in another way.
#pyqtSlot()
def receive(self):
self.serial_buffer += self.serial.readAll() # Read all data from serial buffer
start_pos, del_pos = 0, 0
while True:
del_pos = self.serial_buffer.indexOf(b'\x00', start_pos) # b'\x00' is delimiter byte
if del_pos == -1: break # del_pos is -1 if b'\x00' is not found
frame = self.serial_buffer[start_pos: del_pos] # Copy data until delimiter
start_pos = del_pos + 1 # Exclude old delimiter from your search
self.serial_buffer = self.serial_buffer[start_pos:] # Copy remaining data excluding frame
self.process_frame(frame) # Process frame

How to efficiently slice binary data in Ruby?

After reviewing SO post Ruby: Split binary data, I used the following code which works.
z = 'A' * 1_000_000
z.bytes.each_slice( STREAMING_CHUNK_SIZE ).each do | chunk |
c = chunk.pack( 'C*' )
end
However, it is very slow:
Benchmark.realtime do
...
=> 0.0983949700021185
98ms to slice and pack a 1MB file. This is very slow.
Use Case:
Server receives binary data from an external API, and streams it using socket.write chunk.pack( 'C*' ).
The data is expected to be between 50KB and 5MB, with an average of 500KB.
So, how to efficiently slice binary data in Ruby?
Notes
Your code looks nice, uses the correct Ruby methods and the correct syntax, but it still :
creates a huge Array of Integers
slices this big Array in multiple Arrays
pack those Arrays back to a String
Alternative
The following code extracts the parts directly from the string, without converting anything :
def get_binary_chunks(string, size)
Array.new(((string.length + size - 1) / size)) { |i| string.byteslice(i * size, size) }
end
(string.length + size - 1) / size) is just to avoid missing the last chunk if it is smaller than size.
Performance
With a 500kB pdf file and chunks of 12345 bytes, Fruity returns :
Running each test 16 times. Test will take about 28 seconds.
_eric_duminil is faster than _b_seven by 380x ± 100.0
get_binary_chunks is also 6x times faster than StringIO#each(n) with this example.
Further optimization
If you're sure the string is binary (not UTF8 with multibyte characters like 'ä'), you can use slice instead of byteslice:
def get_binary_chunks(string, size)
Array.new(((string.length + size - 1) / size)) { |i| string.slice(i * size, size) }
end
which makes the code even faster (about 500x compared to your method).
If you use this code with a Unicode String, the chunks will have size characters but might have more than size bytes.
Using the chunks directly
Finally, if you're not interested in getting an Array of Strings, you could use the chunks directly :
def send_binary_chunks(socket, string, size)
((string.length + size - 1) / size).times do |i|
socket.write string.slice(i * size, size)
end
end
Use StringIO#each(n) with a string that has BINARY encoding:
require 'stringio'
string.force_encoding(Encoding::BINARY)
StringIO.new(string).each(size) { |chunk| socket.write(chunk) }
This only allocates the intermediate arrays just before pushing them to the socket.

Julia: How to modify a column of a matrix that has been saved as a binary file?

I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).

Algorithm for fragmenting data into packets

Lets just say I want to fragment some data units into packets (max size per packet is lets say 1024 bytes). Each data unit can be of variable size, say:
a = 20 bytes
b = 1000 bytes
c = 10 bytes
d = 800 bytes
Can anyone please suggest any efficient algorithm to create packets with such random data efficiently utilizing the bandwidth? I cannot split the individual data units into bytes...they go whole inside a packet.
EDIT: The ordering of data units is of no concern!
There are several different ways, depending on your requirements and how much time you want to spend on it. The general problem, as #amit mentioned in comments, is NP-Hard. But you can get some improvement with some simple changes.
Before we go there, are you sure you really need to do this? Most networking layers have a packet-sized (or larger) buffer. When you write to the network, it puts your data in that buffer. If you don't fill the buffer completely, the code will delay briefly before sending. If you add more data during that delay, the new data is added to the buffer. The buffer is sent once it fills, or after the delay timeout expires.
So if you have a loop that writes one byte at a time to the network, it's not like you'll be creating a large number of one-byte packets.
On the receiving side, the lowest level networking layer receives an entire packet, but there's no guarantee that your call to receive the data will get the entire packet. That is, the sender might send an 800 byte packet, but on the receiving end the first call to read might only return 50 or 273 bytes.
This depends, of course, at what level you're reading the data. If you're talking about something like Java or .NET, where your interface to the network stack is through a socket, you almost certainly can't guarantee that a call to socket.Read() will return an entire packet.
Now, if you can guarantee that every call to read returns an entire packet, then the easiest way to pack things would be to serialize everything into one big buffer and then send it out in multiple 1,024-byte packets. You'll want to create a header at the front of the first packet that says how many total bytes will be sent, so the receiver knows what to expect. The result will be a bunch of 1,024-byte packets, potentially followed by a final packet that is somewhat smaller.
If you want to make sure that a data object is fully contained within a single packet, then you have to do something like:
add a to buffer
if remaining buffer < size of b
send buffer
clear buffer
add b to buffer
if remaining buffer < size of c
send buffer
clear buffer
add c to buffer
... etc ...
Here's some simple JavaScript pseudo code. The packets will stay ordered and the bandwidth will be used optimally.
packets = [];
PACKET_SIZE = 1024;
currentPacket = [];
function write(data) {
var len = currentPacket.length + data.length;
if(len < PACKET_SIZE) {
currentPacket = currentPacket.concat(data);
} else if(len === PACKET_SIZE) {
packets.push(currentPacket.concat(data));
currentPacket = [];
} else { // if(len > PACKET_SIZE) {
packets.push(currentPacket);
currentPacket = data;
}
}
function flush() {
if(currentPacket.length > 0) {
packets.push(currentPacket);
currentPacket = [];
}
}
write(data20bytes);
write(data1000bytes);
write(data10bytes);
write(data800bytes);
flush();
EDIT Since you have all of the data chunks and you want to optimally package them out of order (bin packing) then you left with trying every permutation of the chunks for an exact answer or compromising with an best guess/first fit type algorithm.

Last byte in Huffman compression

I am wondering about what is the best way to handle the last byte in Huffman Copression. I have some nice code in C++, that can compress text files very well, but currently I must write to my coded file also number of coded chars (well, it equal to input file size), because of no idea how to handle last byte better.
For example, last char to compress is 'a', which code is 011 and I am just starting new byte to write, so the last byte will look like:
011 + some 5 bits of trash, I am making them zeros for example at the end.
And when I am encoding this coded file, it may happen that code 00000 (or with less zeros) is code for some char, so I will have some trash char at the end of my encoded file.
As I wrote in first paragraph, I am avoiding this by saving numbers of chars of input file in coded file, and while encoding, I am reading the coded file to reach that number (not to EndOfFile, to don't get to those example 5 zeros).
It's not really efficient, size of coded file is increased for long number.
How can I handle this in better way?
Your approach (write the number of encoded bytes the to the file) is a perfectly reasonable approach. If you want to try a different avenue, you could consider inventing a new "pseudo-EOF" character that marks the end of the input (I'll denote it as &square;). Whenever you want to compress a string s, you instead compress the string s&square;. This means that when you build up your encoding tree, you would include one copy of the &square; character so that you have a unique encoding for &square;. Then, when you write out the string to the file, you would write out the bits characters of the string as normal, then write out the bit pattern for &square;. If there are leftover bits, you can just leave them set arbitrarily.
The advantage to this approach is that as you decode the file, if at any point you find the &square; character, you can immediately stop decoding bits because you know that you have hit the end of the file. This does not require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.
The disadvantage to this setup is that it might increase the length of the bit patterns used by certain characters, since you will need to assign a bit pattern to &square; in addition to all the other characters.
I teach an introductory programming course and we use Huffman encoding as one of our assignments. We have students use the above approach, since it's a bit easier than having to write out the number of bits or bytes before the file contents. For more details, you could take a look at this handout or these lecture slides from the course.
Hope this helps!
I know this is an old question, but still, there's an alternate, so it might help someone.
When you're writing your compressed file to output, you probably have some integer keeping track of where you are in the current byte (for bit shifting).
char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
std::string trav = GetTraversal(c);
for (int i = 0; i < trav.size(); i++)
{
if (trav[i] == '1')
p += (1 << curr);
if (--curr < 0)
{
outfile.put(p);
p = '\0';
curr = 7;
}
}
}
if (curr < 7)
outfile.put(p);
At the end of this block, (curr+1)%8 equals the number of trash bits in the last data byte. You can then store it at the end as a single extra byte, and just keep it in mind when you're decompressing.

Resources