How to advance past a deflate byte sequence contained in a byte stream? - algorithm

I have a byte stream that is a concatenation of sections, where each section is composed of a header plus a deflated byte stream.
I need to split this byte stream sections but the header only contains information about the data in uncompressed form, no hint about the compressed data length so I can advance properly in the stream and parse the next section.
So far the only way I found to advance past the deflated byte sequece is to parse it according to the this specification. From what I understood by reading the specification, a deflate stream is composed of blocks, which can be compressed blocks or literal blocks.
Literal blocks contain a size header which can be used to easily advance past it.
Compressed blocks are composed with 'prefix codes', which are bit sequences of variable length that have special meanings to the deflate algorithm. Since I'm only interested in finding out the deflated stream length, I guess the only code I need to look for is '0000000' which according to the specification signals the end of block.
So I came up with this coffeescript function to parse the deflate stream(I'm working on node.js)
# The job of this function is to return the position
# after the deflate stream contained in 'buffer'. The
# deflated stream begins at 'pos'.
advanceDeflateStream = (buffer, pos) ->
byteOffset = 0
finalBlock = false
while 1
if byteOffset == 6
firstTypeBit = 0b00000001 & buffer[pos]
pos++
secondTypeBit = 0b10000000 & buffer[pos]
type = firstTypeBit | (secondTypeBit << 1)
else
if byteOffset == 7
pos++
type = buffer[pos] & (0b01100000 >>> byteOffset)
if type == 0
# Literal block
# ignore the remaining bits and advance position
byteOffset = 0
pos++
len = buffer.readUInt16LE(pos)
pos += 2
lenComplement = buffer.readUInt16LE(pos)
if (len ^ ~lenComplement)
throw new Error('Literal block lengh check fail')
pos += (2 + len) # Advance past literal block
else if type in [1, 2]
# huffman block
# we are only interested in finding the 'block end' marker
# which is signaled by the bit string 0000000 (256)
eob = false
matchedZeros = 0
while !eob
byte = buffer[pos]
for i in [byteOffset..7]
# loop the remaining bits looking for 7 consecutive zeros
if (byte ^ (0b10000000 >>> byteOffset)) >>> (7 - byteOffset)
matchedZeros++
else
# reset counter
matchedZeros = 0
if matchedZeros == 7
eob = true
break
byteOffset++
if !eob
byteOffset = 0
pos++
else
throw new Error('Invalid deflate block')
finalBlock = buffer[pos] & (0b10000000 >>> byteOffset)
if finalBlock
break
return pos
To check if this works, I wrote a simple mocha test case:
zlib = require 'zlib'
test 'sample deflate stream', (done) ->
data = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' # length 30
zlib.deflate data, (err, deflated) ->
# deflated.length == 11
advanceDeflateStream(deflated, 0).shoudl.eql(11)
done()
The problem is that this test fails and I do not know how to debug it. I accept any answer that points what I missed in the parsing algorithm or contains a correct version of the above function in any language.

The only way to find the end of a deflate stream or even a deflate block is to decode all of the Huffman codes contained within. There is no bit pattern that you can search for that can not appear earlier in the stream.

Related

Ruby openssl encryption with DES-CBC incorrect result

I am trying to replicate the encryption result from here in Ruby using OpenSSL: https://emvlab.org/descalc/?key=18074F7ADD44C903&iv=18074F7ADD44C903&input=4E5A56564F4C563230313641454E5300&mode=cbc&action=Encrypt&output=25C843BA5C043FFFB50F76E43A211F8D
Original string = "NZVVOLV2016AENS"
String converted to hexadecimal = "4e5a56564f4c563230313641454e53"
iv = "18074F7ADD44C903"
key = "18074F7ADD44C903"
Expected result = "9B699B4C59F1444E8D37806FA9D15F81"
Here is my ruby code:
require 'openssl'
require "base64"
include Base64
iv = "08074F7ADD44C903"
cipher = "08074F7ADD44C903"
def encode(string)
puts "Attempting encryption - Input: #{string}"
encrypt = OpenSSL::Cipher.new('DES-CBC')
encrypt.encrypt
encrypt.key = ["18074F7ADD44C903"].pack('H*') #.scan(/../).map{|b|b.hex}.pack('c*')
encrypt.iv = ["18074F7ADD44C903"].pack('H*')
result = encrypt.update(string) + encrypt.final
puts "Raw output: #{result.inspect}"
unpacked = result.unpack('H*')[0]
puts "Encrypted key is: #{unpacked}"
puts "Encrypted Should be: 9B699B4C59F1444E8D37806FA9D15F81"
return unpacked
end
res = encode("NZVVOLV2016AENS")
Output:
Encrypted key is: 9b699b4c59f1444ea723ab91e89c023a
Encrypted Should be: 9B699B4C59F1444E8D37806FA9D15F81
Interestingly, the first half of the result is correct, and the last 16 digits are incorrect.
The web site uses Zero padding by default, while the Ruby code uses PKCS#7 padding by default.
Ruby does not seem to support Zero padding, so disable the default padding and implement Zero padding yourself.
Zero padding pads to the next full block size with 0x00 values. The block size for DES is 8 bytes. If the last block of the plaintext is already filled, no padding is done:
def zeroPad(string, blocksize)
len = string.bytes.length
padLen = (blocksize - len % blocksize) % blocksize
string += "\0" * padLen
return string
end
In the encode() function (which should better be called encrypt() function) the following lines must be added before encryption:
encrypt.padding = 0 # disable PKCS#7 padding
string = zeroPad(string, 8) # enable Zero padding
The modified Ruby code then gives the same ciphertext as the web site.
Note that DES is insecure, also it' s insecure to use the key as IV (as well as a static IV). Furthermore, Zero padding is unreliable in contrast to PKCS#7 padding.

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

Why are my byte arrays not different even though print() says they are?

I am new to python so please forgive me if I'm asking a dumb question. In my function I generate a random byte array for a given number of bytes called "input_data", then I add bytewise some bit errors and store the result in another byte array called "output_data". The print function shows that it works exactly as expected, there are different bytes. But if I compare the byte arrays afterwards they seem to be identical!
def simulate_ber(packet_length, ber, verbose=False):
# generate input data
input_data = bytearray(random.getrandbits(8) for _ in xrange(packet_length))
if(verbose):
print(binascii.hexlify(input_data)+" <-- simulated input vector")
output_data = input_data
#add bit errors
num_errors = 0
for byte in range(len(input_data)):
error_mask = 0
for bit in range(0,7,1):
if(random.uniform(0, 1)*100 < ber):
error_mask |= 1 << bit
num_errors += 1
output_data[byte] = input_data[byte] ^ error_mask
if(verbose):
print(binascii.hexlify(output_data)+" <-- output vector")
print("number of simulated bit errors: " + str(num_errors))
if(input_data == output_data):
print ("data identical")
number of packets: 1
bytes per packet: 16
simulated bit error rate: 5
start simulation...
0d3e896d61d50645e4e3fa648346091a <-- simulated input vector
0d3e896f61d51647e4e3fe648346001a <-- output vector
number of simulated bit errors: 6
data identical
Where is the bug? I am sure the problem is somewhere between my ears...
Thank you in advance for your help!
output_data = input_data
Python is a referential language. When you do the above, both variables now refer to the same object in memory. e.g:
>>> y=['Hello']
>>> x=y
>>> x.append('World!')
>>> x
['Hello', 'World!']
>>> y
['Hello', 'World!']
Cast output_data as a new bytearray and you should be good:
output_data = bytearray(input_data)

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Lua string.format using UTF8 characters

How can I get the 'right' formatting using string.format with strings containing UTF-8 characters?
Example:
local str = "\xE2\x88\x9E"
print(utf8.len(str), string.len(str))
print(str)
print(string.format("###%-5s###", str))
print(string.format("###%-5s###", 'x'))
Output:
1 3
∞
###∞ ###
###x ###
It looks like the string.format uses the byte length of the infinity sign instead of the "character length".
Is there an UTF-8 string.format equivalent?
function utf8.format(fmt, ...)
local args, strings, pos = {...}, {}, 0
for spec in fmt:gmatch'%%.-([%a%%])' do
pos = pos + 1
local s = args[pos]
if spec == 's' and type(s) == 'string' and s ~= '' then
table.insert(strings, s)
args[pos] = '\1'..('\2'):rep(utf8.len(s)-1)
end
end
return (
fmt:format(table.unpack(args))
:gsub('\1\2*', function() return table.remove(strings, 1) end)
)
end
local str = "\xE2\x88\x9E"
print(string.format("###%-5s###", str)) --> ###∞ ###
print(string.format("###%-5s###", 'x')) --> ###x ###
print(utf8.format ("###%-5s###", str)) --> ###∞ ###
print(utf8.format ("###%-5s###", 'x')) --> ###x ###
Lua added the UTF-8 library with version 5.3 with just small functionality for minimal needs. It's "fresh" and not really in focus for this language. Your issue is how the characters are interpreted & rendered but graphics isn't a point for the standard library or usual use of Lua.
For now, you should just fix your pattern for the input.

Resources