I have a collection of input sources -- strings, files, etc. -- that I want to concatenate and pass to an API that expects to read from a single IO object. The files can be quite large (~10 GB), so reading them into memory and concatenating them into a single string isn't an option. (I also considered using IO.pipe, but spinning up extra threads or processes seems like overkill.)
Is there an existing library class for this in Ruby, cf. Java's SequenceInputStream? If not, is there some other way to do it straightforwardly and idiomatically?
Unfortunately it's writing to a socket with IO.copy_stream
For IO::copy_stream(src, ...) to work, the « IO-like object for src should have readpartial or read method »
So, let's try to create a class that can read over a sequence of IO objects; here's the spec of IO#read:
read(maxlen = nil) → string or nil
read(maxlen = nil, out_string) → out_string or nil
Reads bytes from the stream (in binary mode):
If maxlen is nil, reads all bytes.
Otherwise reads maxlen bytes, if available.
Otherwise reads all bytes.
Returns a string (either a new string or the given out_string) containing the bytes read. The encoding of the string depends on both maxLen and out_string:
maxlen is nil: uses internal encoding of self (regardless of whether out_string was given).
maxlen not nil:
out_string given: encoding of out_string not modified.
out_string not given: ASCII-8BIT is used.
class ConcatIO
def initialize(*io)
#array = io
#index = 0
end
def read(maxlen = nil, out_string = (maxlen.nil? ? "" : String.new))
out_string.clear
if maxlen.nil?
if #index < #array.count
#array[#index..-1].each{|io| out_string.concat(io.read)}
#index = #array.count
end
elsif maxlen >= 0
while out_string.bytesize < maxlen && #index < #array.count
bytes = #array[#index].read(maxlen - out_string.bytesize)
if bytes.nil?
#index += 1
else
out_string.concat(bytes)
end
end
return nil unless out_string.bytesize
end
out_string
end
end
note: The code is inaccurate in regard to the encoding part of the spec.
Now let's use this class with IO::copy_stream:
require 'stringio'
io1 = StringIO.new( "1")
io2 = StringIO.new( "22")
io3 = StringIO.new("333")
ioN = StringIO.new( "\n")
catio = ConcatIO.new(io1,io2,io3,ioN)
print catio.read(2), "\n"
IO.copy_stream(catio,STDOUT)
And it works!
12
2333
Aside
In fact there's a multi_io gem for concatenating multiple IO sources into a single IO object; the problem is that its methods don't follow the specs of the IO class; for ex. you can't use it with IO::copy_stream, it doesn't work.
Additionally, even if you're able to use ARGF (ie. you're only handling input files stored in ARGV), you still have to be cautious: there are slight differences between some of ARGF's and IO's methods, so it's not 100% safe to feed ARGF to an API that needs to read from an IO object.
Conclusion
Because there's no gem nor core class for it, the only sensible work-around would be to determine the IO methods that the API requires and write a class that implements them. It isn't so straining, as long as you don't have to implement the whole IO interface. Furthermore, you already have a working read method in my answer 😉.
We can try using StringIO to concatenate multiple input sources and pass them to an API that expects to read from a single IO object
require 'stringio'
# Create a new StringIO object from the first input source
input1 = StringIO.new("First input source")
# Create a new StringIO object from the second input source
input2 = StringIO.new("Second input source")
# Concatenate the two input sources into a single IO object
inputs = input1 + input2
# Pass the concatenated input sources to the API
api.process(inputs)
Related
I am trying to receive custom framed raw bytes via QSerialPort using value 0 as delimiter in asynchronous mode (using signals instead of polling).
The inconvenience is that QSerialPort doesn't seem to have a method that can read serial data until a specified byte value is encountered e.g. read_until (delimiter_value) in pyserial.
I was wondering if it's possible to reimplement QSerialPort's readLine() function in Python so that it reads until 0 byte value is encountered instead of '\n'. Similarly, it would be handy to reimplement canReadLine() as well.
I know that it is possible to use readAll() method and then parse the data for delimiter value. But this approach likely implies more code and decrease in efficiency. I would like to have the lowest overhead possible when processing the frames (serial baud rate and number of incoming bytes are large). However, if you know a fast approach to do it, I would like to take a look.
I ended up parsing the frame, it seems to work well enough.
Below is a method extract from my script which receives and parses serial data asynchronously. self.serial_buffer is a QByteArray array initialized inside a custom class init method. You can also use a globally declared bytearray but you will have to check for your delimiter value in another way.
#pyqtSlot()
def receive(self):
self.serial_buffer += self.serial.readAll() # Read all data from serial buffer
start_pos, del_pos = 0, 0
while True:
del_pos = self.serial_buffer.indexOf(b'\x00', start_pos) # b'\x00' is delimiter byte
if del_pos == -1: break # del_pos is -1 if b'\x00' is not found
frame = self.serial_buffer[start_pos: del_pos] # Copy data until delimiter
start_pos = del_pos + 1 # Exclude old delimiter from your search
self.serial_buffer = self.serial_buffer[start_pos:] # Copy remaining data excluding frame
self.process_frame(frame) # Process frame
Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?
One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice
I'm parsing a resource file and splitting on empty lines, using the following code:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val fooString = source.mkString
val fooParsedSections = fooString.split("\\r\\n[\\f\\t ]*\\r\\n")
I believe this is pulling the input stream into memory as a full string, and then splitting on the regex. This works fine for the relatively small file I'm parsing, but it's not ideal and I'm curious how I could improve it--
Two ideas are:
read the input stream line-by-line and have a buffer of segments that I build up, splitting on empty lines
read the stream character-by-character and parse segments based off of a small finite state machine
However, I'd love to not maintain a mutable buffer if possible.
Any suggestions? This is just for a personal fun project, and I want to learn how to do this in an efficent and functional manner.
You can use Stream.span method to get the prefix before the empty line, then repeat. Here's a helper function for that:
def sections(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) Stream.empty
else {
// cutting off the longest `prefix` before an empty line
val (prefix, suffix) = lines.span { _.trim.nonEmpty }
// dropping any empty lines (there may be several)
val rest = suffix.dropWhile{ _.trim.isEmpty }
// grouping back the prefix lines and calling recursion
prefix.mkString("\n") #:: sections(rest)
}
}
Note, that Stream's method #:: is lazy and doesn't evaluate the right operand until it's needed. Here is how you can apply it to your use case:
val inputStream = getClass.getResourceAsStream("foo.txt")
val source = scala.io.Source.fromInputStream(inputStream)
val parsedSections = sections(source.getLines.toStream)
Source.getLines
method returns Iterator[String] which we convert to Stream and apply the helper function. You can also call .toIterator in the end if you process the groups of lines on the way and don't need to store them. See the Stream docs for details.
EDIT
If you still want to use regex, you can change .trim.nonEmpty in the function above to the use of the String matches method.
Upon creating an instance of a given ActiveRecord model object, I need to generate a shortish (6-8 characters) unique string to use as an identifier in URLs, in the style of Instagram's photo URLs (like http://instagram.com/p/P541i4ErdL/, which I just scrambled to be a 404) or Youtube's video URLs (like http://www.youtube.com/watch?v=oHg5SJYRHA0).
What's the best way to go about doing this? Is it easiest to just create a random string repeatedly until it's unique? Is there a way to hash/shuffle the integer id in such a way that users can't hack the URL by changing one character (like I did with the 404'd Instagram link above) and end up at a new record?
Here's a good method with no collision already implemented in plpgsql.
First step: consider the pseudo_encrypt function from the PG wiki.
This function takes a 32 bits integer as argument and returns a 32 bits integer that looks random to the human eye but uniquely corresponds to its argument (so that's encryption, not hashing). Inside the function, you may change the formula: (((1366.0 * r1 + 150889) % 714025) / 714025.0) with another function known only by you that produces a result in the [0..1] range (just tweaking the constants will probably be good enough, see below my attempt at doing just that). Refer to the wikipedia article on the Feistel cypher for more theorical explanations.
Second step: encode the output number in the alphabet of your choice. Here's a function that does it in base 62 with all alphanumeric characters.
CREATE OR REPLACE FUNCTION stringify_bigint(n bigint) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
alphabet text:='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
base int:=length(alphabet);
_n bigint:=abs(n);
output text:='';
BEGIN
LOOP
output := output || substr(alphabet, 1+(_n%base)::int, 1);
_n := _n / base;
EXIT WHEN _n=0;
END LOOP;
RETURN output;
END $$
Now here's what we'd get for the first 10 URLs corresponding to a monotonic sequence:
select stringify_bigint(pseudo_encrypt(i)) from generate_series(1,10) as i;
stringify_bigint
------------------
tWJbwb
eDUHNb
0k3W4b
w9dtmc
wWoCi
2hVQz
PyOoR
cjzW8
bIGoqb
A5tDHb
The results look random and are guaranteed to be unique in the entire output space (2^32 or about 4 billion values if you use the entire input space with negative integers as well).
If 4 billion values was not wide enough, you may carefully combine two 32 bits results to get to 64 bits while not loosing unicity in outputs. The tricky parts are dealing correctly with the sign bit and avoiding overflows.
About modifying the function to generate your own unique results: let's change the constant from 1366.0 to 1367.0 in the function body, and retry the test above. See how the results are completely different:
NprBxb
sY38Ob
urrF6b
OjKVnc
vdS7j
uEfEB
3zuaT
0fjsab
j7OYrb
PYiwJb
Update: For those who can compile a C extension, a good replacement for pseudo_encrypt() is range_encrypt_element() from the permuteseq extension, which has of the following advantages:
works with any output space up to 64 bits, and it doesn't have to be a power of 2.
uses a secret 64-bit key for unguessable sequences.
is much faster, if that matters.
You could do something like this:
random_attribute.rb
module RandomAttribute
def generate_unique_random_base64(attribute, n)
until random_is_unique?(attribute)
self.send(:"#{attribute}=", random_base64(n))
end
end
def generate_unique_random_hex(attribute, n)
until random_is_unique?(attribute)
self.send(:"#{attribute}=", SecureRandom.hex(n/2))
end
end
private
def random_is_unique?(attribute)
val = self.send(:"#{attribute}")
val && !self.class.send(:"find_by_#{attribute}", val)
end
def random_base64(n)
val = base64_url
val += base64_url while val.length < n
val.slice(0..(n-1))
end
def base64_url
SecureRandom.base64(60).downcase.gsub(/\W/, '')
end
end
Raw
user.rb
class Post < ActiveRecord::Base
include RandomAttribute
before_validation :generate_key, on: :create
private
def generate_key
generate_unique_random_hex(:key, 32)
end
end
You can hash the id:
Digest::MD5.hexdigest('1')[0..9]
=> "c4ca4238a0"
Digest::MD5.hexdigest('2')[0..9]
=> "c81e728d9d"
But somebody can still guess what you're doing and iterate that way. It's probably better to hash on the content
I'd like to use Builder to construct a set of XML files based on a table of ActiveRecord models. I have nearly a million rows, so I need to use find_each(batch_size: 5000) to iterate over the records and write an XML file for each batch of them, until the records are exhausted. Something like the following:
filecount = 1
count = 0
xml = ""
Person.find_each(batch_size: 5000) do |person|
xml += person.to_xml # pretend .to_xml() exists
count += 1
if count == MAX_PER_FILE
File.open("#{filecount}.xml", 'w') {|f| f.write(xml) }
xml = ""
filecount += 1
count = 0
end
end
This doesn't work well with Builder's interface, as it wants to work in blocks, like so:
xml = builder.person { |p| p.name("Jim") }
Once the block ends, Builder closes its current stanza; you can't keep a reference to p and use it outside of the block (I tried). Basically, Builder wants to "own" the iteration.
So to make this work with builder, I'd have to do something like:
filecount = 0
offset = 0
while offset < Person.count do
count = 0
builder = Builder::XmlMarkup.new(indent: 5)
xml = builder.people do |people|
Person.limit(MAX_PER_FILE).offset(offset).each do |person|
people.person { |p| p.name(person.name) }
count += 1
end
end
File.open("#output#file_count.xml", 'w') {|f| f.write(xml) }
filecount += 1
offset += count
end
Is there a way to use Builder without the block syntax? Is there a way to programmatically tell it "close the current stanza" rather than relying on a block?
My suggestion: don't use builder.
XML is a simple format as long as you escape the xml entities correctly.
Batch your db retrieve then just write out the batch as xml to a file handle. Don't buffer via a string as your example shows. Just write to the filehandle. Let the OS deal with buffering. Files can be of any size, why the limit?
Also, don't include the indentation spaces, with million rows, they'd add up.
Added
When writing xml files, I also include xml comments at the top of the file:
The name of the software and version that generated the xml file
Date / timestamp the file was written
Other useful info. Eg in this case you could say that the file is batch # x of the original data set.
I ended up generating the XML manually, as per Larry K's suggestion. Ruby's built-in XML encoding made this a piece of cake. I'm not sure why this feature not more widely advertised... I wasted a lot of time Googling and trying various to_xs implementations before I stumbled upon the built-in "foo".encode(xml: :text).
My code now looks like:
def run
count = 0
Person.find_each(batch_size: 5000) do |person|
open_new_file if #current_file.nil?
# simplified- I actually have many more fields and elements
#
#current_file.puts " <person>#{person.name.encode(xml: :text)}</person>"
count += 1
if count == MAX_PER_FILE
close_current_file
count = 0
end
end
close_current_file
end
def open_new_file
#file_count += 1
#current_file = File.open("people#{#file_count}.xml", 'w')
#current_file.puts "<?xml version='1.0' encoding='UTF-8'?>"
#current_file.puts " <people>"
end
def close_current_file
unless #current_file.nil?
#current_file.puts " </people>"
#current_file.close
#current_file = nil
end
end