'Failed to allocate memory' error with large array - ruby

I am trying to import a large text file (approximately 2 million rows of numbers at 260MB) into an array, make edits to the array, and then write the results to a new text file, by writing:
file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
f.write(file_data.map(&:strip).uniq.join("\n"))
end
However, I have received the error failed to allocate memory (NoMemoryError). How can I allocate more memory to complete the task? Or, ideally, is there another method I can use where I can avoid having to re-allocate memory?

You can read the file line by line:
require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new
while (line = file_data.gets)
line.strip!
line.gsub!(/,.*\z/, "")
# Check if the line is unique
line_hash = Digest::MD5.hexdigest(line)
if not unique_lines_set.include? line_hash
# It is unique so add its hash to the set
unique_lines_set.add(line_hash)
# Write the line in the output file
file_output.puts(line)
end
end
file_data.close
file_output.close

You can try reading and writing one line at once:
new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
file.each_line do |line|
new_file.puts line.strip.gsub(/,.*\z/, "")
end
end
new_file.close
The only thing pending is find duplicated lines

Alternatively you can read file in chunks which should be faster compared to reading it line by line:
FILENAME="massive_file.txt"
MEGABYTE = 1024*1024
class File
def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
yield read(chunk_size) until eof?
end
end
filedata = ""
open(FILENAME, "rb") do |f|
f.each_chunk() {|chunk|
chunk.gsub!(/,.*\z/, "")
filedata += chunk
}
end
ref: https://stackoverflow.com/a/1682400/3035830

Related

Ruby: how to read an mp4 file into chunks

I want to be able to read an mp4 file in chunks of 1mb.
I've tried opening the file with the following API's:
video_file = File.open(#video_filename, 'rb')
video_file = IO.binread(#video_filename)
The problem is, video_file is a string afterwards and I cannot use read to get chunks of the file.
chunk = video_file.read(4*1024*1024)
What is the right interface/tools to use in Ruby to open this file, and read it for N bytes at a time?
I suppose I would do:
chnk_size=4*1024*1024
f=File.open(fn, 'rb')
until f.eof?
chnk=f.read(chnk_size)
# process the chnk
end
Try something like this :
`FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end`

read file and send data to yml file

i read multipe file and i try to get data in yaml file, but i dont know why i get nothing in my yaml file .
Do you have an idea where i can make a mistake ?
a = array.size
i = 0
array.each do |f|
while i < a
puts array[i]
output = File.new('/home/zyriuse/documents/Ruby-On-Rails/script/Api_BK/licence.yml', 'w')
File.readlines(f).each do |line|
output.puts line
output.puts line.to_yaml
#output.puts YAML::dump(line)
end
i += 1
end
end
There's two problems...
You are initializing i to zero too early... when you process the
first file 'f' you process JUST that first file as many times as you
have files in the array, but for all following files i is now always >= a so you're not doing anything with them.
You are doing File.new in every iteration of 'f' so you are wiping out your last iteration.
This might work better...
output = File.new('licence.yml', 'w')
array.each do |f|
puts f
File.readlines(f).each do |line|
output.puts line
output.puts line.to_yaml
end
end

Parse remote file with FasterCSV

I'm trying to parse the first 5 lines of a remote CSV file. However, when I do, it raises Errno::ENOENT exception, and says:
No such file or directory - [file contents] (with [file contents] being a dump of the CSV contents
Here's my code:
def preview
#csv = []
open('http://example.com/spreadsheet.csv') do |file|
CSV.foreach(file.read, :headers => true) do |row|
n += 1
#csv << row
if n == 5
return #csv
end
end
end
end
The above code is built from what I've seen others use on Stack Overflow, but I can't get it to work.
If I remove the read method from the file, it raises a TypeError exception, saying:
can't convert StringIO into String
Is there something I'm missing?
Foreach expects a filename. Try parse.each
You could manually pass each line to CSV for parsing:
require 'open-uri'
require 'csv'
def preview(file_url)
#csv = []
open(file_url).each_with_index do |line, i|
next if i == 0 #Ignore headers
#csv << CSV.parse(line)
if i == 5
return #csv
end
end
end
puts preview('http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt')

Read Certain Lines from File

Hi just getting into Ruby, and I am trying to learn some basic file reading commands, and I haven't found any solid sources yet.
I am trying to go through certain lines from that file, til the end of the file.
So in the file where it says FILE_SOURCES I want to read all the sources til end of file, and place them in a file.
I found printing the whole file, and replacing words in the file, but I just want to read certain parts in the file.
Usually you follow a pattern like this if you're trying to extract a section from a file that's delimited somehow:
open(filename) do |f|
state = nil
while (line = f.gets)
case (state)
when nil
# Look for the line beginning with "FILE_SOURCES"
if (line.match(/^FILE_SOURCES/))
state = :sources
end
when :sources
# Stop printing if you hit something starting with "END"
if (line.match(/^END/))
state = nil
else
print line
end
end
end
end
You can change from one state to another depending on what part of the file you're in.
I would do it like this (assuming you can read the entire file into memory):
source_lines = IO.readlines('source_file.txt')
start_line = source_lines.index{ |line| line =~ /SOURCE_LINE/ } + 1
File.open( 'other_file.txt', 'w' ) do |f|
f << source_lines[ start_line..-1 ].join( "\n" )
end
Relevant methods:
IO.readlines to read the lines into an array
Array#index to find the index of the first line matching a regular expression
File.open to create a new file on disk (and automatically close it when done)
Array#[] to get the subset of lines from the index to the end
If you can't read the entire file into memory, then I'd do a simpler variation on #tadman's state-based one:
started = false
File.open( 'other_file.txt', 'w' ) do |output|
IO.foreach( 'source_file.txt' ) do |line|
if started then
output << line
elsif line =~ /FILE_SOURCES/
started = true
end
end
end
Welcome to Ruby!
File.open("file_to_read.txt", "r") {|f|
line = f.gets
until line.include?("FILE_SOURCES")
line = f.gets
end
File.open("file_to_write.txt", "w") {|new_file|
f.each_line {|line|
new_file.puts(line)
}
new_file.close
}
f.close
}
IO functions have no idea what "lines" in a file are. There's no straightforward way to skip to a certain line in a file, you'll have to read it all and ignore the lines you don't need.

Read a file in chunks in Ruby

I need to read a file in MB chunks, is there a cleaner way to do this in Ruby:
FILENAME="d:\\tmp\\file.bin"
MEGABYTE = 1024*1024
size = File.size(FILENAME)
open(FILENAME, "rb") do |io|
read = 0
while read < size
left = (size - read)
cur = left < MEGABYTE ? left : MEGABYTE
data = io.read(cur)
read += data.size
puts "READ #{cur} bytes" #yield data
end
end
Adapted from the Ruby Cookbook page 204:
FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end
Disclaimer: I'm a ruby newbie and haven't tested this.
Alternatively, if you don't want to monkeypatch File:
until my_file.eof?
do_something_with( my_file.read( bytes ) )
end
For example, streaming an uploaded tempfile into a new file:
# tempfile is a File instance
File.open( new_file, 'wb' ) do |f|
# Read in small 65k chunks to limit memory usage
f.write(tempfile.read(2**16)) until tempfile.eof?
end
You can use IO#each(sep, limit), and set sep to nil or empty string, for example:
chunk_size = 1024
File.open('/path/to/file.txt').each(nil, chunk_size) do |chunk|
puts chunk
end
If you check out the ruby docs:
http://ruby-doc.org/core-2.2.2/IO.html
there's a line that goes like this:
IO.foreach("testfile") {|x| print "GOT ", x }
The only caveat is. Since, this process can read the temp file faster than the
generated stream, IMO, a latency should be thrown in.
IO.foreach("/tmp/streamfile") {|line|
ParseLine.parse(line)
sleep 0.3 #pause as this process will discontine if it doesn't allow some buffering
}
https://ruby-doc.org/core-3.0.2/IO.html#method-i-read gives an example of iterating over fixed length records with read(length):
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(256)
# ...
end
end
If length is a positive integer, read tries to read length bytes without any conversion (binary mode). It returns nil if an EOF is encountered before anything can be read. Fewer than length bytes are returned if an EOF is encountered during the read. In the case of an integer length, the resulting string is always in ASCII-8BIT encoding.
FILENAME="d:/tmp/file.bin"
class File
MEGABYTE = 1024*1024
def each_chunk(chunk_size=MEGABYTE)
yield self.read(chunk_size) until self.eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk {|chunk| puts chunk }
end
It works, mbarkhau. I just moved the constant definition to the File class and added a couple of "self"s for clarity's sake.

Resources