How to delete several repeating triplets from a big text using Ruby? - ruby

class RNAtoAA
def self.rna_convert(rna)
rna.slice!"AUG"
end
end
I tried this to delete "AUG" (I also need to delete 2 more repeating patterns) but it did not produce the desired result. I also tried .gsub("AUG", "UAA")

string = 'AAAAUG'
RNAtoAA.rna_convert(string)
puts string
# AAA
Seems to work as expected
If you want to return the updated string, use this:
class RNAtoAA
def self.rna_convert(rna)
rna.slice!"AUG"
return rna
end
end

You need to read the file in chunks so that the memory isn't swamped.
This example creates a testfile and uses it as source for a filtered version.
If you were to create a gigabyte big testfile you need to do the same thing for the creation of the testfile.
If your file contains linefeeds this could be simpler by lazy loading the lines but I'm going to assume it hasn't.
# create a testfile
patterns = ["AUG", "UAA", "UAG", "UGA","AAA", "BBB", "CCC"]
large_string = ""
1_000_000.times{large_string << patterns.sample}
File.write("rna.dat", large_string)
#read the file, remove some patterns and write in a new file
filtered = ["AUG", "UAA", "UAG", "UGA"]
File.open("filtered.dat", "w") do |out_file|
File.open("rna.dat", "r") do |in_file|
while chunk = in_file.read(3)
# Read small chunks of 3 bytes to limit memory usage
out_file.write chunk unless filtered.include? chunk
end
end
end

Related

How to write a multidimensional array into separate files and then read from them in order in Ruby

I want to take a file, read the file into my program and split it into characters, split the resulting character array into a multidimensional array of 5,000 characters each, then write each separate array into a file found in the same location.
I have taken a file, read it, and created the multidimensional array. Now I want to write each separate single dimension array into separate files.
The file is obtained via user input. Then I created a chain helper method that stores the file to an array in the first mixin, this is then passed to another method that breaks it down into a multidimensional array, which finally hands it off to the end of the chain which currently is setup to make a new directory for which I will put these files.
require 'Benchmark/ips'
file = "C:\\test.php"
class String
def file_to_array
file = self
return_file = File.open(file) do |line|
line.each_char.to_a
end
return return_file
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
read_file = File.read("I:/file_to_array/tmp.txt")
else
Dir.mkdir("I:\\file_to_array")
end
end
end
class Array
def file_divider
file_to_divide = self
file_to_separate = []
count = 0
while count != file_to_divide.length
separator = count % 5000
if separator == 0
start = count - 5000
stop = count
file_to_separate << file_to_divide[start..stop]
end
count = count + 1
end
return file_to_separate
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
else
Dir.mkdir("I:\\file_to_array")
end
end
end
Benchmark.ips do |result|
result.report { file.file_to_array.file_divider.file_write }
end
Test.php
<?php
echo "hello world"
?>
This untested code is where I'd start to split text into chunks and save it:
str = "I want to take a file"
str_array = str.scan(/.{1,10}/) # => ["I want to ", "take a fil", "e"]
str_array.each.with_index(1) do |str_chunk, i|
File.write("output#{i}", str_chunk)
end
This doesn't honor word-boundaries.
Reading a separate input file is easy; You can use read if you KNOW the input will never exceed the available memory and you don't care about performance.
Thinking about it further, if you want to read a text file and break its contents into smaller files, then read it in chunks:
input = File.open('input.txt', 'r')
i = 1
until input.eof? do
chunk = input.read(10)
File.write("output#{i}", chunk)
i += 1
end
input.close
Or even better because it automatically closes the input:
File.open('input.txt', 'r') do |input|
i = 1
until input.eof? do
chunk = File.read(10)
File.write("output#{i}", chunk)
i += 1
end
end
Those are not tested but it look about right.
Use standard File API and Serialisation.
File.write('path/to/yourfile.txt', Marshal.dump([1, 2, 3]))

ruby read and write/change the same file

I am trying to change the content of an existing file. I have this piece of code, which works. But I would like to find a better way to do the manipulation in one time of opening file.
File.open(file_name , 'r') do |f|
content = f.read
end
File.open(file_name , 'w') do |f|
content.insert(0, "something ")
f.write(content)
end
Is there a way we can do it only opening once the file?
I have tried using File.open(file_name , 'r+'), which seems only append to the end of the file (not be able to insert thing at the beginning of the file).
[Edit: I misunderstood your question, but my code below can be fixed by simply inserting the line:
text_to_prepend = ''
after
line_out = text_to_prepend + buf.shift
It could be simplified a little (for your question), but I'll leave it as is to show how the same string could be prepended to each line.]
You can open the file but once, and not read the entire file before writing, but it's messy and a bit tricky. Basically, you need to move the file pointer between reading and writing and maintain a buffer that contains lines from the file that will be wholly or partially overwritten when each modified line is written.
At each step, remove the first line from the buffer and modify it in preparation for writing. Before writing, however, you may need to read one or more additional lines into the buffer, in order that the read pointer remains ahead of the write pointer after the modified line is written. After all lines have been read, each remaining line in the buffer is modified and written.
Code
def prepend_file_lines(file_name, text_to_prepend)
f = File.open(file_name, 'r+')
return if f.eof?
write_pos = 0
line_in = f.readline
read_pos = line_in.size
buf = [line_in]
last_line_read = f.eof?
loop do
break if buf.empty?
line_out = text_to_prepend + buf.shift
while (!last_line_read && read_pos <= write_pos + line_out.size) do
line_in = f.readline
buf << line_in
read_pos += line_in.size
last_line_read = f.eof?
end
f.seek(write_pos, IO::SEEK_SET)
write_pos += f.write(line_out)
f.seek(read_pos, IO::SEEK_SET)
end
end
Example
First, create a test file.
text =<<_
Now is
the time
for all Rubiests
to raise their
glasses to Matz.
_
F_NAME = "sample.txt"
File.write(F_NAME, text)
We can confirm the file was written correctly:
File.readlines(F_NAME).each { |l| puts l }
# Now is
# the time
# for all Rubiests
# to raise their
# glasses to Matz.
Now let's try it:
prepend_file_lines("sample.txt", "Here's to Matz: ")
File.readlines(F_NAME).each { |l| puts l }
# Here's to Matz: Now is
# Here's to Matz: the time
# Here's to Matz: for all Rubiests
# Here's to Matz: to raise their
# Here's to Matz: glasses to Matz.
Note that when testing, it's necessary to write the test file before each call to prepend_file_lines, since the file is being modified.
It looks like you want IO::SEEK_SET with 0 to rewind the file pointer after reading.
file_name = "File.txt";
File.open(file_name , 'r+') do |f|
content = f.read
content.insert(0, "somehting else")
f.seek(0, IO::SEEK_SET)
f.write(content)
end
You can do it in the same file, but you'll likely overwrite the content of the file.
Each file operation sets the file's cursor to a different position, which is the position used for the latter operations. So if you read 8 bytes, you have to back your cursor 8 bytes earlier and write exactly 8 bytes to not overwrite anything, if you write fewer bytes, you'll keep unchanged bytes.
The Ruby File class is IO class, which is documented in http://www.ruby-doc.org/core-1.9.3/IO.html.
To open a file for read/write operations, use "r+" mode.

Best way of Parsing 2 CSV files and printing the common values in a third file

I am new to Ruby, and I have been struggling with a problem that I suspect has a simple answer. I have two CSV files, one with two columns, and one with a single column. The single column is a subset of values that exist in one column of my first file. Example:
file1.csv:
abc,123
def,456
ghi,789
jkl,012
file2.csv:
def
jkl
All I need to do is look up the column 2 value in file1 for each value in file2 and output the results to a separate file. So in this case, my output file should consist of:
456
012
I’ve got it working this way:
pairs=IO.readlines("file1.csv").map { |columns| columns.split(',') }
f1 =[]
pairs.each do |x| f1.push(x[0]) end
f2 = IO.readlines("file2.csv").map(&:chomp)
collection={}
pairs.each do |x| collection[x[0]]=x[1] end
f=File.open("outputfile.txt","w")
f2.each do |col1,col2| f.puts collection[col1] end
f.close
...but there has to be a better way. If anyone has a more elegant solution, I'd be very appreciative! (I should also note that I will eventually need to run this on files with millions of lines, so speed will be an issue.)
To be as memory efficient as possible, I'd suggest only reading the full file2 (which I gather would be the smaller of the two input files) into memory. I'm using a hash for fast lookups and to store the resulting values, so as you read through file1 you only store the values for those keys you need. You could go one step further and write the outputfile while reading file2.
require 'CSV'
# Read file 2, the smaller file, and store keys in result Hash
result = {}
CSV.foreach("file2.csv") do |row|
result[row[0]] = false
end
# Read file 1, the larger file, and look for keys in result Hash to set values
CSV.foreach("file1.csv") do |row|
result[row[0]] = row[1] if result.key? row[0]
end
# Write the results
File.open("outputfile.txt", "w") do |f|
result.each do |key, value|
f.puts value if value
end
end
Tested with Ruby 1.9.3
Parsing For File 1
data_csv_file1 = File.read("file1.csv")
data_csv1 = CSV.parse(data_csv_file1, :headers => true)
Parsing For File 2
data_csv_file2 = File.read("file2.csv")
data_csv2 = CSV.parse(data_csv_file1, :headers => true)
Collection of names
names_from_sheet1 = data_csv1.collect {|data| data[0]} #returns an array of names
names_from_sheet2 = data_csv2.collect {|data| data[0]} #returns an array of names
common_names = names_from_sheet1 & names_from_sheet2 #array with common names
Collecting results to be printed
results = [] #this will store the values to be printed
data_csv1.each {|data| results << data[1] if common_names.include?(data[0]) }
Final output
f = File.open("outputfile.txt","w")
results.each {|result| f.puts result }
f.close

Read, edit, and write a text file line-wise using Ruby

Is there a good way to read, edit, and write files in place in Ruby?
In my online search I've found stuff suggesting to read it all into an array, modify said array, then write everything out. I feel like there should be a better solution, especially if I'm dealing with a very big file.
Something like:
myfile = File.open("path/to/file.txt", "r+")
myfile.each do |line|
myfile.replace_puts('blah') if line =~ /myregex/
end
myfile.close
Where replace_puts would write over the current line, rather than (over)writing the next line as it currently does because the pointer is at the end of the line (after the separator).
So then every line that matches /myregex/ will be replaced with 'blah'. Obviously what I have in mind is a bit more involved than that, as far as processing, and would be done in one line, but the idea is the same - I want to read a file line by line, and edit certain lines, and write out when I'm done.
Maybe there's a way to just say "rewind back to just after the last separator"? Or some way of using each_with_index and write via a line index number? I couldn't find anything of the sort, though.
The best solution I have so far is to read things line-wise, write them out to a new (temp) file line-wise (possibly edited), then overwrite the old file with the new temp file and delete. Again, I feel like there should be a better way - I don't think I should have to create a new 1gig file just to edit some lines in an existing 1GB file.
In general, there's no way to make arbitrary edits in the middle of a file. It's not a deficiency of Ruby. It's a limitation of the file system: Most file systems make it easy and efficient to grow or shrink the file at the end, but not at the beginning or in the middle. So you won't be able to rewrite a line in place unless its size stays the same.
There are two general models for modifying a bunch of lines. If the file is not too large, just read it all into memory, modify it, and write it back out. For example, adding "Kilroy was here" to the beginning of every line of a file:
path = '/tmp/foo'
lines = IO.readlines(path).map do |line|
'Kilroy was here ' + line
end
File.open(path, 'w') do |file|
file.puts lines
end
Although simple, this technique has a danger: If the program is interrupted while writing the file, you'll lose part or all of it. It also needs to use memory to hold the entire file. If either of these is a concern, then you may prefer the next technique.
You can, as you note, write to a temporary file. When done, rename the temporary file so that it replaces the input file:
require 'tempfile'
require 'fileutils'
path = '/tmp/foo'
temp_file = Tempfile.new('foo')
begin
File.open(path, 'r') do |file|
file.each_line do |line|
temp_file.puts 'Kilroy was here ' + line
end
end
temp_file.close
FileUtils.mv(temp_file.path, path)
ensure
temp_file.close
temp_file.unlink
end
Since the rename (FileUtils.mv) is atomic, the rewritten input file will pop into existence all at once. If the program is interrupted, either the file will have been rewritten, or it will not. There's no possibility of it being partially rewritten.
The ensure clause is not strictly necessary: The file will be deleted when the Tempfile instance is garbage collected. However, that could take a while. The ensure block makes sure that the tempfile gets cleaned up right away, without having to wait for it to be garbage collected.
If you want to overwrite a file line by line, you'll have to ensure the new line has the same length as the original line. If the new line is longer, part of it will be written over the next line. If the new line is shorter, the remainder of the old line just stays where it is.
The tempfile solution is really much safer. But if you're willing to take a risk:
File.open('test.txt', 'r+') do |f|
old_pos = 0
f.each do |line|
f.pos = old_pos # this is the 'rewind'
f.print line.gsub('2010', '2011')
old_pos = f.pos
end
end
If the line size does change, this is a possibility:
File.open('test.txt', 'r+') do |f|
out = ""
f.each do |line|
out << line.gsub(/myregex/, 'blah')
end
f.pos = 0
f.print out
f.truncate(f.pos)
end
Just in case you are using Rails or Facets, or you otherwise depend on Rails' ActiveSupport, you can use the atomic_write extension to File:
File.atomic_write('path/file') do |file|
file.write('your content')
end
Behind the scenes, this will create a temporary file which it will later move to the desired path, taking care of closing the file for you.
It further clones the file permissions of the existing file or, if there isn't one, of the current directory.
You can write in the middle of a file but you have to be carefull to keep the length of the string you overwrite the same otherwise you overwrite some of the following text. I give an example here using File.seek, IO::SEEK_CUR gives he current position of the file pointer, at the end of the line that is just read, the +1 is for the CR character at the end of the line.
look_for = "bbb"
replace_with = "xxxxx"
File.open(DATA, 'r+') do |file|
file.each_line do |line|
if (line[look_for])
file.seek(-(line.length + 1), IO::SEEK_CUR)
file.write line.gsub(look_for, replace_with)
end
end
end
__END__
aaabbb
bbbcccddd
dddeee
eee
After executed, at the end of the script you now have the following, not what you had in mind I assume.
aaaxxxxx
bcccddd
dddeee
eee
Taking that in consideration, the speed using this technique is much better than the classic 'read and write to a new file' method.
See these benchmarks on a file with music data of 1.7 GB big.
For the classic approach I used the technique of Wayne.
The benchmark is done withe the .bmbm method so that caching of the file doesn't play a very big deal. Tests are done with MRI Ruby 2.3.0 on Windows 7.
The strings were effectively replaced, I checked both methods.
require 'benchmark'
require 'tempfile'
require 'fileutils'
look_for = "Melissa Etheridge"
replace_with = "Malissa Etheridge"
very_big_file = 'D:\Documents\muziekinfo\all.txt'.gsub('\\','/')
def replace_with file_path, look_for, replace_with
File.open(file_path, 'r+') do |file|
file.each_line do |line|
if (line[look_for])
file.seek(-(line.length + 1), IO::SEEK_CUR)
file.write line.gsub(look_for, replace_with)
end
end
end
end
def replace_with_classic path, look_for, replace_with
temp_file = Tempfile.new('foo')
File.foreach(path) do |line|
if (line[look_for])
temp_file.write line.gsub(look_for, replace_with)
else
temp_file.write line
end
end
temp_file.close
FileUtils.mv(temp_file.path, path)
ensure
temp_file.close
temp_file.unlink
end
Benchmark.bmbm do |x|
x.report("adapt ") { 1.times {replace_with very_big_file, look_for, replace_with}}
x.report("restore ") { 1.times {replace_with very_big_file, replace_with, look_for}}
x.report("classic adapt ") { 1.times {replace_with_classic very_big_file, look_for, replace_with}}
x.report("classic restore") { 1.times {replace_with_classic very_big_file, replace_with, look_for}}
end
Which gave
Rehearsal ---------------------------------------------------
adapt 6.989000 0.811000 7.800000 ( 7.800598)
restore 7.192000 0.562000 7.754000 ( 7.774481)
classic adapt 14.320000 9.438000 23.758000 ( 32.507433)
classic restore 14.259000 9.469000 23.728000 ( 34.128093)
----------------------------------------- total: 63.040000sec
user system total real
adapt 7.114000 0.718000 7.832000 ( 8.639864)
restore 6.942000 0.858000 7.800000 ( 8.117839)
classic adapt 14.430000 9.485000 23.915000 ( 32.195298)
classic restore 14.695000 9.360000 24.055000 ( 33.709054)
So the in_file replacement was 4 times faster.

Ruby: Deleting last iterated item?

What I'm doing is this: have one file as input, another as output. I chose a random line in the input, put it in the output, and then delete it.
Now, I've iterated over the file and am on the line I want. I've copied it to the output file. Is there a way to delete it? I'm doing something like this:
for i in 0..number_of_lines_to_remove
line = rand(lines_in_file-2) + 1 #not removing the first line
counter = 0
IO.foreach("input.csv", "r") { |current_line|
if counter == line
File.open("output.csv", "a") { |output|
output.write(current_line)
}
end
counter += 1
}
end
So, I have current_line, but I'm not sure how to remove it from the source file.
Array.delete_at might do. Given an index, it removes the object at that index, returning the object.
input.csv:
one,1
two,2
three,3
Program:
#!/usr/bin/ruby1.8
lines = File.readlines('/tmp/input.csv')
File.open('/tmp/output.csv', 'a') do |file|
file.write(lines.delete_at(rand(lines.size)))
end
p lines # ["two,2\n", "three,3\n"]
output.csv:
one,1
Here is a randomline class. You create a new randomline object by passing it an input file name and an output file name. You can then call the deleterandom method on that object and pass it a number of lines to delete.
The data is stored internally in arrays as well as being put to file. Currently output is in append mode so if you use the same file it will just add to the end, you could change the a to a w if you wanted to start the file fresh each time.
class Randomline
attr_accessor :inputarray, :outputarray
def initialize(filein, fileout)
#filename = filein
#filein = File.open(filein,"r+")
#fileoutput = File.open(fileout,"a")
#inputarray = []
#outputarray = []
readin()
end
def readin()
#filein.each do |line|
#inputarray << line
end
end
def deleterandom(numtodelete)
numtodelete.times do |num|
random = rand(#inputarray.size)
#outputarray << inputarray[random]
#fileoutput.puts inputarray[random]
#inputarray.delete_at(random)
end
#filein = File.open(#filename,"w")
#inputarray.each do |line|
#filein.puts line
end
end
end
here is an example of it being used
a = Randomline.new("testin.csv","testout.csv")
a.deleterandom(3)
You have to re-write the source-file after removing a line otherwise the modifications won't stick as they're performed on a copy of the data.
Keep in mind that any operation which modifies a file in-place runs the risk of truncating the file if there's an error of any sort and the operation cannot complete.
It would be safer to use some kind of simple database for this kind of thing as libraries like SQLite and BDB have methods for ensuring data integrity, but if that's not an option, you just need to be careful when writing the new input file.

Resources