How to efficiently slice binary data in Ruby? - ruby

After reviewing SO post Ruby: Split binary data, I used the following code which works.
z = 'A' * 1_000_000
z.bytes.each_slice( STREAMING_CHUNK_SIZE ).each do | chunk |
c = chunk.pack( 'C*' )
end
However, it is very slow:
Benchmark.realtime do
...
=> 0.0983949700021185
98ms to slice and pack a 1MB file. This is very slow.
Use Case:
Server receives binary data from an external API, and streams it using socket.write chunk.pack( 'C*' ).
The data is expected to be between 50KB and 5MB, with an average of 500KB.
So, how to efficiently slice binary data in Ruby?

Notes
Your code looks nice, uses the correct Ruby methods and the correct syntax, but it still :
creates a huge Array of Integers
slices this big Array in multiple Arrays
pack those Arrays back to a String
Alternative
The following code extracts the parts directly from the string, without converting anything :
def get_binary_chunks(string, size)
Array.new(((string.length + size - 1) / size)) { |i| string.byteslice(i * size, size) }
end
(string.length + size - 1) / size) is just to avoid missing the last chunk if it is smaller than size.
Performance
With a 500kB pdf file and chunks of 12345 bytes, Fruity returns :
Running each test 16 times. Test will take about 28 seconds.
_eric_duminil is faster than _b_seven by 380x ± 100.0
get_binary_chunks is also 6x times faster than StringIO#each(n) with this example.
Further optimization
If you're sure the string is binary (not UTF8 with multibyte characters like 'ä'), you can use slice instead of byteslice:
def get_binary_chunks(string, size)
Array.new(((string.length + size - 1) / size)) { |i| string.slice(i * size, size) }
end
which makes the code even faster (about 500x compared to your method).
If you use this code with a Unicode String, the chunks will have size characters but might have more than size bytes.
Using the chunks directly
Finally, if you're not interested in getting an Array of Strings, you could use the chunks directly :
def send_binary_chunks(socket, string, size)
((string.length + size - 1) / size).times do |i|
socket.write string.slice(i * size, size)
end
end

Use StringIO#each(n) with a string that has BINARY encoding:
require 'stringio'
string.force_encoding(Encoding::BINARY)
StringIO.new(string).each(size) { |chunk| socket.write(chunk) }
This only allocates the intermediate arrays just before pushing them to the socket.

Related

merge multiple large hashes

I'm trying to merge 3 hashes using .merge method. It works perfectly with small hashes, but when I try whit larger hashes I'm getting an error. Probably a memory overflow.
[1] 50734 killed ruby bin/app.rb
Example:
a = {sheet_01: { 1=>"One", 2=>"Two", 3=>"Three"} }
b = {sheet_02: { 1=>"aaa", 2=>"bbb", 3=>"ccc"} }
c = {sheet_03: { 1=>"zzz", 2=>"www", 3=>"yyy"} }
a[:sheet_01].merge(b[:sheet_02], c[:sheet_03]) do |_key, v1, v2|
result << v1 + v2
end
# {1=>"Oneaaazzz", 2=>"Twobbbwww", 3=>"Threecccyyy"}
but if I test these hashes with 600 values, my program crash
Avoid Copying Large In-Memory Data Structures Under Memory Pressure
Without a lot more information about your system or your real data, no one can really debug this for you. However, it seems likely that Ruby or its parent process is running out of memory, but the limited error message you provided doesn't tell us that it's being reaped by the OOM killer. It's not a Ruby problem per se; you'll have to look at both your memory and swap usage at the system level for that.
However, it's safe to say that merging large hashes the way you are is potentially memory intensive. This isn't just about the size of the hashes, but also (potentially) about their contents. If you're under memory pressure, you may want to consider:
Using Hash#merge! rather than Hash#merge, as the latter will make a new copy of the merged hash rather than mutate the existing hashes.
Using scope gates to ensure that excess memory from your hash variables are prunable by the garbage collector as soon as you're done with them.
Switching to a different storage mechanism, as large in-memory hashes getting passed around is inherently more memory intensive than calls to a database, on-disk structure, or external key/value store.
You may also want to revisit why your hashes are so big, and whether that's really the best representation of your data or your business logic. Large, persistent data structures in memory are sometimes an indication that you're not representing or manipulating your data structures as efficiently as possible, but your mileage may vary.
So, I wrote a little benchmark to try and reproduce it, but I failed.
as you can see, I am using a random string of ~1000 characters (an integer of 768 bytes is encoded into roughly 1024 characters), which is 100x the size of your strings. I am using 10000 lines, which is more than 10x what you have. And I am using 26 hashes, which is almost 10x what you have. All in all, my memory usage should be roughly 10000x the one you have.
With this benchmark, the merge itself takes about 1.3s, and the entire memory usage for the Ruby process never even touches 1GB. I also tried it with 100000 lines per sheet, and the memory usage went to a little over 8GB, but still no crash.
#!/usr/bin/env ruby
require 'securerandom'
require 'benchmark/ips'
def generate_sheet
Array.new(10_000) {|i| [i, SecureRandom.base64(768)] }.to_h
end
def generate_hash
{ sheet: generate_sheet }
end
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z =
Array.new(26) { generate_hash }
Benchmark.ips do |bm|
bm.config warmup: 20, time: 50
bm.report do
a[:sheet].merge(
b[:sheet], c[:sheet], d[:sheet], e[:sheet], f[:sheet], g[:sheet],
h[:sheet], i[:sheet], j[:sheet], k[:sheet], l[:sheet], m[:sheet],
n[:sheet], o[:sheet], p[:sheet], q[:sheet], r[:sheet], s[:sheet],
b[:sheet], u[:sheet], v[:sheet], w[:sheet], x[:sheet], y[:sheet],
z[:sheet]
) do |_key, v1, v2| v1 + v2 end
end
end

How to continuously read a binary file in Crystal and get Bytes out of it?

Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?
One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice

File with random data but specific size

I am trying to generate a file in ruby that has a specific size. The content doesn't matter.
Here is what I got so far (and it works!):
File.open("done/#{NAME}.txt", 'w') do |f|
contents = "x" * (1024*1024)
SIZE.to_i.times { f.write(contents) }
end
The problem is: Once I zip or rar this file the created archive is only a few kb small. I guess thats because the random data in the file got compressed.
How do I create data that is more random as if it were just a normal file (for example a movie file)? To be specific: How to create a file with random data that keeps its size when archived?
You cannot guarantee an exact file size when compressing. However, as you suggest in the question, completely random data does not compress.
You can generate a random String using most random number generators. Even simple ones are capable of making hard-to-compress data, but you would have to write your own string-creation code. Luckily for you, Ruby comes with a built-in library that already has a convenient byte-generating method, and you can use it in a variation of your code:
require 'securerandom'
one_megabyte = 2 ** 20 # or 1024 * 1024, if you prefer
# Note use 'wb' mode to prevent problems with character encoding
File.open("done/#{NAME}.txt", 'wb') do |f|
SIZE.to_i.times { f.write( SecureRandom.random_bytes( one_megabyte ) ) }
end
This file is not going to compress much, if at all. Many compressors will detect that and just store the file as-is (making a .zip or .rar file slightly larger than the original).
For a given string size N and compression method c (e.g., from the rubyzip, libarchive or seven_zip_ruby gems), you want to find a string str such that:
str.size == c(str).size == N
I'm doubtful that you can be assured of finding such a string, but here's a way that should come close:
Step 0: Select a number m such that m > N.
Step 1: Generate a random string s with m characters.
Step 2: Compute str = c(str). If str.size <= N, increase m and repeat Step 1; else go to Step 3.
Step 3: Return str[0,N].

String to BigNum and back again (in Ruby) to allow circular shift

As a personal challenge I'm trying to implement the SIMON block cipher in Ruby. I'm running into some issues finding the best way to work with the data. The full code related to this question is located at: https://github.com/Rami114/Personal/blob/master/Simon/Simon.rb
SIMON requires both XOR, shift and circular shift operations, the last of which is forcing me to work with BigNums so I can perform the left circular shift with math rather than a more complex/slower double loop on byte arrays.
Is there a better way to convert a string to a BigNum and back again.
String -> BigNum (where N is 64 and pt is a string of plaintext)
pt = pt.chars.each_slice(N/8).map {|x| x.join.unpack('b*')[0].to_i(2)}.to_a
So I break the string into individual characters, slice into N-sized arrays (the word size in SIMON) and unpack each set into a BigNum. That appears to work fine and I can convert it back.
Now my SIMON code is currently broken, but that's more the math I think/hope and not the code. The conversion back is (where ct is an array of bignums representing the ciphertext):
ct.map { |x| [x.to_s(2).rjust(128,'0')].pack('b*') }.join
I seem to have to right-justify pad the string as bignums are of undefined width so I have no leading 0s. Unfortunately the pack requires the defined with to have sensible output.
Is this a valid method of conversion? Is there a better way? I'm not sure on either count and hoping someone here can help out.
E: For #torimus, the circular shift implementation I'm using (From link above)
def self.lcs (bytes, block_size, shift)
((bytes << shift) | (bytes >> (block_size - shift))) & ((1<< block_size)-1)
end
If you would be equally happy with unpack('B*') with msb first binary numbers (which you could well be if all your processing is circular), then you could also use .unpack('Q>') instead of .unpack('B*')[0].to_i(2) for generating pt:
pt = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890!#"
# Your version (with 'B' == msb first) for comparison:
pt_nums = pt.chars.each_slice(N/8).map {|x| x.join.unpack('B*')[0].to_i(2)}.to_a
=> [8176115190769218921, 8030025283835160424, 7668342063789995618, 7957105551900562521,
6145530372635706438, 5136437062280042563, 6215616529169527604, 3834312847369707840]
# unpack to 64-bit unsigned integers directly
pt_nums = pt.unpack('Q>8')
=> [8176115190769218921, 8030025283835160424, 7668342063789995618, 7957105551900562521,
6145530372635706438, 5136437062280042563, 6215616529169527604, 3834312847369707840]
There are no native 128-bit pack/unpacks to return in the other direction, but you can use Fixnum to solve this too:
split128 = 1 << 64
ct = pt # Just to show round-trip
ct.map { |x| [ x / split128, x % split128 ].pack('Q>2') }.join
=> "\x00\x00\x00\x00\x00\x00\x00\x00qwertyui . . . " # truncated
This avoids a lot of the temporary stages on your code, but at the expense of using a different byte coding - I don't know enough about SIMON to comment whether this is adaptable to your needs.

Generating an Instagram- or Youtube-like unguessable string ID in ruby/ActiveRecord

Upon creating an instance of a given ActiveRecord model object, I need to generate a shortish (6-8 characters) unique string to use as an identifier in URLs, in the style of Instagram's photo URLs (like http://instagram.com/p/P541i4ErdL/, which I just scrambled to be a 404) or Youtube's video URLs (like http://www.youtube.com/watch?v=oHg5SJYRHA0).
What's the best way to go about doing this? Is it easiest to just create a random string repeatedly until it's unique? Is there a way to hash/shuffle the integer id in such a way that users can't hack the URL by changing one character (like I did with the 404'd Instagram link above) and end up at a new record?
Here's a good method with no collision already implemented in plpgsql.
First step: consider the pseudo_encrypt function from the PG wiki.
This function takes a 32 bits integer as argument and returns a 32 bits integer that looks random to the human eye but uniquely corresponds to its argument (so that's encryption, not hashing). Inside the function, you may change the formula: (((1366.0 * r1 + 150889) % 714025) / 714025.0) with another function known only by you that produces a result in the [0..1] range (just tweaking the constants will probably be good enough, see below my attempt at doing just that). Refer to the wikipedia article on the Feistel cypher for more theorical explanations.
Second step: encode the output number in the alphabet of your choice. Here's a function that does it in base 62 with all alphanumeric characters.
CREATE OR REPLACE FUNCTION stringify_bigint(n bigint) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
alphabet text:='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
base int:=length(alphabet);
_n bigint:=abs(n);
output text:='';
BEGIN
LOOP
output := output || substr(alphabet, 1+(_n%base)::int, 1);
_n := _n / base;
EXIT WHEN _n=0;
END LOOP;
RETURN output;
END $$
Now here's what we'd get for the first 10 URLs corresponding to a monotonic sequence:
select stringify_bigint(pseudo_encrypt(i)) from generate_series(1,10) as i;
stringify_bigint
------------------
tWJbwb
eDUHNb
0k3W4b
w9dtmc
wWoCi
2hVQz
PyOoR
cjzW8
bIGoqb
A5tDHb
The results look random and are guaranteed to be unique in the entire output space (2^32 or about 4 billion values if you use the entire input space with negative integers as well).
If 4 billion values was not wide enough, you may carefully combine two 32 bits results to get to 64 bits while not loosing unicity in outputs. The tricky parts are dealing correctly with the sign bit and avoiding overflows.
About modifying the function to generate your own unique results: let's change the constant from 1366.0 to 1367.0 in the function body, and retry the test above. See how the results are completely different:
NprBxb
sY38Ob
urrF6b
OjKVnc
vdS7j
uEfEB
3zuaT
0fjsab
j7OYrb
PYiwJb
Update: For those who can compile a C extension, a good replacement for pseudo_encrypt() is range_encrypt_element() from the permuteseq extension, which has of the following advantages:
works with any output space up to 64 bits, and it doesn't have to be a power of 2.
uses a secret 64-bit key for unguessable sequences.
is much faster, if that matters.
You could do something like this:
random_attribute.rb
module RandomAttribute
def generate_unique_random_base64(attribute, n)
until random_is_unique?(attribute)
self.send(:"#{attribute}=", random_base64(n))
end
end
def generate_unique_random_hex(attribute, n)
until random_is_unique?(attribute)
self.send(:"#{attribute}=", SecureRandom.hex(n/2))
end
end
private
def random_is_unique?(attribute)
val = self.send(:"#{attribute}")
val && !self.class.send(:"find_by_#{attribute}", val)
end
def random_base64(n)
val = base64_url
val += base64_url while val.length < n
val.slice(0..(n-1))
end
def base64_url
SecureRandom.base64(60).downcase.gsub(/\W/, '')
end
end
Raw
user.rb
class Post < ActiveRecord::Base
include RandomAttribute
before_validation :generate_key, on: :create
private
def generate_key
generate_unique_random_hex(:key, 32)
end
end
You can hash the id:
Digest::MD5.hexdigest('1')[0..9]
=> "c4ca4238a0"
Digest::MD5.hexdigest('2')[0..9]
=> "c81e728d9d"
But somebody can still guess what you're doing and iterate that way. It's probably better to hash on the content

Resources