How to read a large file into a string - ruby

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.

Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Related

Opening file with write throws "No implicit conversion of String into Integer"

It's been quite a while time since I last wrote code in Ruby (Ruby 2 was new and wow it's 3 already), so I feel like an idiot.
I have a text file containing only the word:
hello
My ruby file contains the following code:
content = File.read("test_file_str.txt","w")
puts content
When I run it, I get:
`read': no implicit conversion of String into Integer (TypeError)
I've never had this happen before, but it has been quite a while since I wrote code, so clearly PEBKAC.
However, when I run this without ,"w" all is seemingly well. What am I doing wrong?
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x64-mingw32]
As per the docs, the second argument for File.read is the length of bytes to be read from the given file which is meant to be an integer.
Opens the file, optionally seeks to the given offset, then returns length bytes (defaulting to the rest of the file). read ensures the file is closed before returning.
So, in your case the error happens because you're passing an argument which must be an integer. It doesn't state this per-se in the docs for File.read, but it does it for File#read:
Reads length bytes from the I/O stream.
length must be a non-negative integer or nil.
If you want to specify the mode, you can use the mode option for that:
File.read("filename", mode: "r") # "r" or any other
# or
File.new("filename", mode: "r").read(1)
Open Files for Reading Don't Accept Write Mode
In general, it doesn't make sense to open a filehandle for reading in write mode. So, you need to refactor your method to something like:
content = File.read("test_file_str.txt")
or perhaps:
content = File.new("test_file_str.txt", "r+").read
depending on exactly what you're trying to do.
See Also: File Permissions in IO#new
The documentation for File in Ruby 3.0.3 points you to IO#new for the available mode permissions. You might take a look there if you don't see exactly the options you're looking for.

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Ruby File.read vs. File.gets

If I want to append the contents of a src file into the end of a dest file in Ruby, is it better to use:
while line = src.gets do
or
while buffer = src.read( 1024 )
I have seen both used and was wondering when should I use each method and why?
One is for reading "lines", one is for reading n bytes.
While byte buffering might be faster, a lot of that may disappear into the OS which likely does buffering anyway. IMO it has more to do with the context of the read--do you want lines, or are you just shuffling chunks of data around?
That said, a performance test in your specific environment may be helpful when deciding.
You have a number of options when reading a file that are tailored to different situations.
Read in the file line-by-line, but only store one line at a time:
while (line = file.gets) do
# ...
end
Read in all lines of a file at once:
file.readlines.each do |line|
# ...
end
Read the file in as a series of blocks:
while (data = file.read(block_size))
# ...
end
Read in the whole file at once:
data = file.read
It really depends on what kind of data you're working with. Generally read is better suited towards binary files, or those where you want it as one big string. gets and readlines are similar, but readlines is more convenient if you're confident the file will fit in memory. Don't do this on multi-gigabyte log files or you'll be in for a world of hurt as your system starts swapping. Use gets for situations like that.
gets will read until the end of the line based on a separator
read will read n bytes at a time
It all depends on what you are trying to read.
It may be more efficient to use read if your src file has unpredictable line lengths.

How do I wrap ruby IO with a sliding window filter

I'm using an opaque API in some ruby code which takes a File/IO as a parameter. I want to be able to pass it an IO object that only gives access to a given range of data in the real IO object.
For example, I have a 8GB file, and I want to give the api an IO object that has a 1GB range within the middle of my real file.
real_file = File.new('my-big-file')
offset = 1 * 2**30 # start 1 GB into it
length = 1 * 2**30 # end 1 GB after start
filter = IOFilter.new(real_file, offset, length)
# The api only sees the 1GB of data in the middle
opaque_api(filter)
The filter_io project looks like it would be the easiest to adapt to do this, but doesn't seem to support this use case directly.
I think you would have to write it yourself, as it seems like a rather specific thing: you would have to implement all (or, a subset that you need) of IO's methods using a chunk of the opened file as a data source. An example of the "speciality" would be writing to such stream - you would have to take care not to cross the boundary of the segment given, i.e. constantly keeping track of your current position in the big file. Doesn't seem like a trivial job, and I don't see any shortcuts that could help you there.
Perhaps you can find some OS-based solution, e.g. making a loopback device out of the part of the large file (see man losetup and particularly -o and --sizelimit options, for example).
Variant 2:
If you are ok with keeping the contents of the window in memory all the time, you may wrap StringIO like this (just a sketch, not tested):
def sliding_io filename, offset, length
File.open(filename, 'r+') do |f|
# read the window into a buffer
f.seek(offset)
buf = f.read(length)
# wrap a buffer into StringIO and pass it given block
StringIO.open(buf) do |buf_io|
yield(buf_io)
end
# write altered buffer back to the big file
f.seek(offset)
f.write(buf[0,length])
end
end
And use it as you would use block variant of IO#open.
I believe the IO object has the functionality you are looking for. I've used it before for MD5 hash summing similarly sized files.
incr_digest = Digest::MD5.new()
file = File.open(filename, 'rb') do |io|
while chunk = io.read(50000)
incr_digest << chunk
end
end
This was the block I used, where I was passing the chunk to the MD5 Digest object.
http://www.ruby-doc.org/core/classes/IO.html#M000918

StringScanner scanning IO instead of a string

I've got a parser written using ruby's standard StringScanner. It would be nice if I could use it on streaming files. Is there an equivalent to StringScanner that doesn't require me to load the whole string into memory?
You might have to rework your parser a bit, but you can feed lines from a file to a scanner like this:
File.open('filepath.txt', 'r') do |file|
scanner = StringScanner.new(file.readline)
until file.eof?
scanner.scan(/whatever/)
scanner << file.readline
end
end
StringScanner was intended for that, to load a big string and going back and forth with an internal pointer, if you make it a stream, then the references get lost, you can not use unscan, check_until, pre_match, post_match,
well you can, but for that you need to buffer all the previous input.
If you are concerned about the buffer size then just load by chunk of data, and use a simple regexp or a gem called Parser.
The simplest way is to read a fix size of data.
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(1024)
# parse here the record using regexp or parser
end
end
[Updated]
Even with this loop you can use StringSanner, you just need to update the string with each new chunk of data:
string=(str)
Changes the string being scanned to str and resets the scanner.
Returns str
There is StringIO.
Sorry misread you question. Take a look at this seems to have streaming options

Resources