IO read not reading entire file - ruby

I have a very large text file, 958 MBAnd I have created the following script
f = IO.read ("Playback.xml").encode ("utf-8", replace: nil)
separate_files_array = f.scan /strong text<Bla>.*?<\/Bla>/
counter=0
separate_files_array.each do |x|
.
.
.
end
The following code only iterates over the first 31 occurences of that regex - and I have no idea why.
No, there is no way these are all the occurrences, I could see its not, and the script runs for a few seconds - this makes no sense for a file that size

The problem is IO.read is creating a buffer on default - and loading only part of the file to cache - In the end I used the following to answer my question
Regexp search through a very large file
the reason is because File.read is not creating a buffer on default - which when using a too big a file can cause the program to crush.

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Problems reading large JSON file in Ruby

I have problems reading a large JSON file (2.9GB) in Ruby. I am using this code
json_file = File.read(filename)
results = JSON.parse(json_file)
and when I try to read the file I get the error:
Errno::EINVAL: Invalid argument - <filename>
I have tested the same code with smaller files and it works fine. To verify that the file is written correctly I have tried to read it with python and it works.
Is there a limitation on the size of the file for JSON.parse? If so, could you recommend an alternative?
I have looked in the msgpack to reduce the size of the files, but unfortunately I am constraint by the fact that I cannot install gems.
This is a limitation of IO.read.
You may split your file into smaller parts (for example, 1 gigabyte) and read them separately:
dirname = File.dirname(filename)
`split -b 1024m #{filename} #{filename}.parts.`
Dir.chdir(dirname)
parts = Dir["#{filename}.parts.*"]
json = ''
parts.each do |partname|
json += File.read(partname)
File.delete(partname)
end
results = JSON.parse(json)
Be patient, this could take a while.

How to edit each line of a file in Ruby, without using a temp file

Is there a way to edit each line in a file, without involving 2 files? Say, the original file has,
test01
test02
test03
I want to edit it like
test01,a
test02,a
test03,a
Tried something as show in the code block, but it replaces some of the characters.
Writing it to a temporary file and then replace the original file works, However, I need to edit the file quite often and therefore prefer to do it within the file itself .Any pointers are appreciated.
Thank you!
File.open('mytest.csv', 'r+') do |file|
file.each_line do |line|
file.seek(-line.length, IO::SEEK_CUR)
file.puts 'a'
end
end
f = open 'mytest.csv', 'r+'
r = f.readlines.map { |e| e.strip << ',a' }
f.rewind
f.puts r
f.close # you can leave out this line if it's the last one that runs
Here is a one-liner variation, note that in this case 2 descriptors are left open until the program exits.
open(F='mytest.csv','r+').puts open(F,'r').readlines.map{|e|e.strip<<',a'}
Writing to a file doesn't insert; it always overwrites. This makes it awkward to modify text in-place, because you have to rewrite the entire rest of the contents of the file every time you add something new.
If the file is small enough to fit in memory, you can read it in, modify it, and write it back out. Otherwise, you really are better off with the temporary file.

Ruby file writes in windows returning wrong file sizes?

I'm still learning ruby, so I'm sure I'm doing something wrong here, but using ruby 1.9.3 on windows, I'm having a problem writing a file with random ascii garbage to be a specific size. I need to be able to write these files for a test on an application I'm QAing. On Mac and on *nix, the file size is written correctly every time. But on windows, it generates files of random size, generally between 1,024 bytes and 1,031 bytes.
I'm sure the problem is one of the characters that the rstr is generating is counting as two characters but... it seems like this shouldn't happen.
Here is my code:
num = 10
k = 1
for i in 1..num
fname = "f#{i}.txt"
f = File.new(fname, "w")
for k in 1..size
rstr = "#{(1..1024).map{rand(255).chr}.join}"
f.write rstr
print " #{rstr.size} " # this returns 1024 every time.
rstr = ""
end
f.close
end
Also tried:
opts = {}
opts[:encoding] = "UTF-8"
fname = "f#{i}.txt"
f = File.new(fname, "w", opts)
By default files open in Windows are open with text mode meaning that line endings and other details are adjusted.
If you want the files be written byte-to-byte exactly as you want, you need to open the files in binary mode:
File.new("foo", "wb") do |f|
# ...
end
The b is a ignored on POSIX operating systems, so your scripts are now cross-platform compatible.
Note: I used block syntax to manage the file so it properly closes and disposes the file handler once the block is executed. You no longer need to worry about closing the file ;-)
Hope this helps.
There is not any 255 ASCII. The values goes from 0~254.
If you try to printf 255.chr, you'll get a multibyte character.
As Windows does not standard utf-8, you'll get incorrect values. Hence the problem you're facing!
Try adding #coding: utf-8 at the top of your file. It should get things working.

How can I achieve a Unix Tail operation without using files. In Ruby

I used Ruby to read an image file and save that into a string.
partial_image100 = File.read("image.tga")
partial_image99 = File.read("image.tga")
partial_image98 = File.read("image.tga")
...
I read those images at one end of a distributed system. In another system I want to do a Tail operation. The system receives just the images.
I have around a 100 partial images. I want to do a Tail operation, like this:
tail -c +19 image100 >> image99
tail -c +19 image99 >> image98
tail -c +19 image97 >> image96
...
Basically it just removes the first 18 bytes of the partial image and append what is left to the next image.
The problem is that this is slow. Calling 100 unix commands from Ruby is slow. I want to refactor this so that this happen in Ruby world. Just in memory. No files.
How can I do this in Ruby?
Thanks
edit:
The images are stored in a hash like this:
{"27"=>"\u0000\u0000\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u000E\u0001\xD0\a\xD0\a\u0018 \xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF\u0000\xFF\xFF...
EDIT:
You have all the relevant code here: https://gist.github.com/989563
There are two files. The code and a hash object encoded in json in a file. When you run the code there will be two image files created at /tmp
/tmp/image-tail-merger.tga – The output from the tail-merge algorithm
/tmp/image-/time/.tga – the output from the in-memory-tail algorithm
Currently the in-memory algorithm fails because the generated image is a Picasso.
If you manage to make the in-memory-algorithm generate the same image that the tail-merge algorithm do then you have succeeded.
EDIT:
I got it right finally!!!
Here is the code
https://gist.github.com/989563
I might look at File::Tail, similar to the Perl module.
File.open(filename) do |log|
log.extend(File::Tail)
log.interval = 10
log.backward(10)
log.tail { |line| puts line }
end
You can also monkey-patch your own File to use File::Tail as well for cleaner usage.
You may want to take a look at String#unpack (and its inverse Array#pack).
In your case some like that should do what you want:
trunked = image.unpack('#19c*').pack('c*')
You might try something like this
image100 = "some image string"
image99 = "some other image string"
image99 += image100.slice(0,19)
EDIT: In your specific example you could do this to iterate through the entire image
(image_hash.size..1).each do i
# Here we use slice to select everything *except* the first 19 bytes
# Note: To select just the first 19 bytes we could do slice(0,19)
# To select just the last 19 bytes we could do slice(-19,19)
# We then append this result to the next image down the line
image_hash[i-1] += image_hash[i].slice(19,image_hash[i].size-19)
end
If you want to remove the "tailed" bits permanently you can use slice! to do an inline replace.
Maybe a bit cleaner:
# Strip the headers
image_hash.each { |k,v| v.slice!(0,19) }
# Append them together
(image_hash.keys.sort).collect{ |i| image_hash[i] }.join
EDIT: Working code example https://gist.github.com/989563

Resources