Writing Arrays using grep - ruby

I'm trying to search through a specified string and assign the results to an array.
Opening and writing to "input.txt" and "ms3.txt" files works fine. putting a normal string like reassign << "hello" in works fine its just when i use line.grep and the regex following it prints nothing to the console or the ms3 file it doesn't even throw up any errors
i've also tried a search and replace: reassign << line.gsub(/[abc]/, '£')
Here's the code
# encoding: utf-8
#!/usr/bin/ruby
file = File.open("input.txt", "w+")
reassign = []
file.each_line do |line|
reassign << line.grep(/[abc]/)
end
new_file = File.open("ms3.txt", "w+")
new_file.puts(reassign)
new_file.close

Your code can be streamlined a lot to make it more Ruby-like, and to make it behave better:
# encoding: utf-8
#!/usr/bin/ruby
file = File.open("input.txt", "w+")
reassign = []
file.each_line do |line|
reassign << line.grep(/[abc]/)
end
new_file = File.open("ms3.txt", "w+")
new_file.puts(reassign)
new_file.close
#! lines have to be first, so reverse the encoding and "slash-bang" lines.
open takes a block, which will allow Ruby to automatically close the file when the block exits. This is a very powerful, and smart thing to do, as it keeps your file I/O environment clean. It's possible, and often a problem as in your code, to open files but never close them, which can exhaust all the file handles on a machine if it's done in a loop, causing all apps to fail. Using the block form will avoid this.
The IO class has foreach, which makes it simple to iterate over the lines of a file. Take advantage of it instead of opening the file then using each_line, because it simplifies your code.
Here's how I'd initially write your code:
#!/usr/bin/ruby
# encoding: utf-8
reassign = []
File.foreach("input.txt") do |line|
reassign << line[/[abc]/]
end
File.write("ms3.txt", reassign.join("\n"))
But, after refactoring it I'd end up with:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/[abc/])
end
The open opens the output file using a block to take advantage of automatically closing the file when the block exits.
foreach is an iterator, and normally is used with a block to pass each line read into the block. Instead, I'm letting grep read all the lines found and search for the pattern.
Any lines found by grep that match the pattern are returned as an array to puts which will iterate over them, appending "\n" to the end of each.
fo.puts directs the output of puts to the output file.
end causes the block to exit, which causes open to close the file.
That's untested but looks correct.

There are several issues with your code:
You open "input.txt" with open mode "w+". According to the documentation, this truncates your file to zero length. An empty file doesn't contain any lines and therefore, file.each_line doesn't invoke the block.
If you want to read from the file, use "r", which is the default:
file = File.open("input.txt")
You don't close file. Use the block form which closes the file automatically:
File.open("input.txt") do |file|
# ...
end
line is a String and there's no String#grep method. But since File includes Enumerable, you can use Enumerable#grep instead:
reassign = file.grep(/[abc]/)
A complete example:
File.open("input.txt") do |file|
reassign = file.grep(/[abc]/)
File.open("ms3.txt", "w+") do |new_file|
new_file.puts(reassign)
end
end

Related

How to delete lines from multiple files

I'm trying to read a file (d:\mywork\list.txt) line by line and search if that string occurs in any of the files (one by one) in a particular directory (d:\new_work).
If present in any of the files (may be one or more) I want to delete the string (car\yrui3,) from the respective files and save the respective file.
list.txt:
car\yrui3,
dom\09iuo,
id\byt65_d,
rfc\some_one,
desk\aa_tyt_99,
.........
.........
Directory having multiple files: d:\new_work:
Rollcar-access.txt
Mycar-access.txt
Newcar-access.txt
.......
......
My code:
value=File.open('D:\\mywork\\list.txt').read
value.gsub!(/\r\n?/, "\n")
value.each_line do |line|
line.chomp!
print "For the string: #{line}"
Dir.glob("D:/new_work/*-access.txt") do |fn|
print "checking files:#{fn}\n"
text = File.read(fn)
replace = text.gsub(line.strip, "")
File.open(fn, "w") { |file| file.puts replace }
end
end
The issue is, values are not getting deleted as expected. Also, text is empty when I tried to print the value.
There are a number of things wrong with your code, and you're not safely handling your file changes.
Meditate on this untested code:
ACCESS_FILES = Dir.glob("D:/new_work/*-access.txt")
File.foreach('D:/mywork/list.txt') do |target|
target = target.strip.sub(/,$/, '')
ACCESS_FILES.each do |filename|
new_filename = "#{filename}.new"
old_filename = "#{filename}.old"
File.open(new_filename, 'w') do |fileout|
File.foreach(filename) do |line_in|
fileout.puts line_in unless line_in[target]
end
end
File.rename(filename, old_filename)
File.rename(new_filename, filename)
File.delete(old_filename)
end
end
In your code you use:
File.open('D:\\mywork\\list.txt').read
instead, a shorter, and more concise and clear way would be to use:
File.read('D:/mywork/list.txt')
Ruby will automatically adjust the pathname separators based on the OS so always use forward slashes for readability. From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb".
The problem using read is it isn't scalable. Imagine if you were doing this in a long term production system and your input file had grown into the TB range. You'd halt the processing on your system until the file could be read. Don't do that.
Instead use foreach to read line-by-line. See "Why is "slurping" a file not a good practice?". That'll remove the need for
value.gsub!(/\r\n?/, "\n")
value.each_line do |line|
line.chomp!
While
Dir.glob("D:/new_work/*-access.txt") do |fn|
is fine, its placement isn't. You're doing it for every line processed in your file being read, wasting CPU. Read it first and store the value, then iterate over that value repeatedly.
Again,
text = File.read(fn)
has scalability issues. Using foreach is a better solution. Again.
Replacing the text using gsub is fast, but it doesn't outweigh the potential problems of scalability when line-by-line IO is just as fast and sidesteps the issue completely:
replace = text.gsub(line.strip, "")
Opening and writing to the same file as you were reading is an accident waiting to happen in a production environment:
File.open(fn, "w") { |file| file.puts replace }
A better practice is to write to a separate, new, file, rename the old file to something safe, then rename the new file to the old file's name. This preserves the old file in case the code or machine crashes mid-save. Then, when that's finished it's safe to remove the old file. See "How to search file text for a pattern and replace it with a given value" for more information.
A final recommendation is to strip all the trailing commas from your input file. They're not accomplishing anything and are only making you do extra work to process the file.
I just ran your code and it works as expected on my machine. My best guess is that you're not taking the commas at the end of each line in list.txt into account. Try removing them with an extra chomp!:
value=File.open('D:\\mywork\\list.txt').read
value.gsub!(/\r\n?/, "\n")
value.each_line do |line|
line.chomp!
line.chomp!(",")
print "For the string: #{line}"
Dir.glob("D:/new_work/*-access.txt") do |fn|
print "checking files:#{fn}\n"
text = File.read(fn)
replace = text.gsub(line.strip, "")
File.open(fn, "w") { |file| file.puts replace }
end
end
By the way, you shouldn't need this line: value.gsub!(/\r\n?/, "\n") since you're chomping all the newlines away anyway, and chomp can recognize \r\n by default.

How to preserve format of content while writing to another file?

I'm reading some content from a file and use a regex and scan to discard a few things in the file and write the content to another file.
If I look at the newly written file, it has escape characters and "\n" in the file instead of actual new line.
filea.txt is:
test
in run
]
}
end
I'm getting the content between 'test' and 'end' using:
file = File.open('filea.txt', 'r')
result = file.read
regex = /(?<=test) .*?(?=end)/mx
ans = result.scan(regex)
Writing ans to a new file like fileb.txt puts:
in run'\"\n ]\n }
But, if I try writing the entire result, then it has correct content format in fileb.txt.
Your question isn't clear and needs work, but you're using read in a way that can cause scalability problems.
Here's how to accomplish the same sort of task without using read:
content = []
DATA.each_line do |li|
marker = li.lstrip
if marker =~ /^in run/i .. marker =~ /^end of file/i
content << li
end
end
content # => ["in run\n", "]\n", "}\n", "end of file\n"]
__END__
test file
in run
]
}
end of file
The .. (elipsis) is a multitool in Ruby (and other languages). We use it to define ranges but can also use it to flip-flop between logic states. In this case I'm using it in the second form, a "flip-flop".
When Ruby runs the code it checks
marker =~ /^in run/i`
If that is false the if fails and the code continues. If
marker =~ /^in run/i
succeeds, Ruby will remember that it succeeded and immediately test for
marker =~ /^end of file/i
If that fails Ruby will fall into the if block and do whatever is inside the block, then continue as normal.
The next loop of each_line will hit the if tests and .. will remember that
marker =~ /^in run/i
succeeded previously and immediately test the second condition. If it is true it steps into the block and resets itself to a false again, so that any subsequent loops will fail until
marker =~ /^in run/i
returns true again.
This logic is really powerful and makes it easy to build code that can scan huge files, extracting portions of them.
There are other ways to do it but they generally run into messier logic.
In the example code I'm also using __END__ which has some rarely seen magic to it also. You'll want to read about __END__ and DATA if you don't understand what's happening.
If you're dealing with files in the GB or TB range, with lots of content you're grabbing, it might be smart to not accumulate too much into your data-gathering array content. A minor tweak will keep that from happening:
if marker =~ /^in run/i .. marker =~ /^end of file/i
content << li
next
end
unless content.empty?
# do something that clears content:
end
In this code I'm using DATA.each_line. In real life you'd want to use File.foreach instead.

How to print from specific column range?

I want to grab only the first line of columns 46 to 245 of source.txt and write it to output.txt
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
Bonus: I also need to keep a count of the number of characters in this range, as some may be whitespace. i.e. 38 characters and the rest whitespace.
Example:
source_file: (first line only, columns 45 to 245): 13287912721981239854 + 180 blank columns
output_file: 13287912721981239854
count = 20 characters
Update: appending [46..245].delete(' ').size gives me the desired count.
If I am understanding what you are asking correctly, there's no reason to grab the whole file when you only want the first line. If this isn't what you're asking for, then you need to specify what you're trying to pull out of the source file more clearly.
This should grab the data you need:
output_line = source_file.gets [45..244]
If you write:
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
}
You will open, then close, your output file for each line read from the output file. That is the wrong way to do it, even if you only want to read one line of input.
Instead try something like one of these:
File.open(output_file, 'a') do |fo|
File.open('path/to/input_file') do |fi|
fo.puts fi.readline[46..245]
end
end
This uses IO.readline, which reads a single line from the file. The block falls through afterwards, causing both the input and output files to be closed automatically. Also, it opens the output file as 'a' which is append-mode only. 'a+' is wrong unless you intend to append and read, which is rarely done. From the documentation:
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing
Or:
File.open(output_file, 'a') do |fo|
File.foreach('path/to/input_file') do |li|
fo.puts li[46..245]
break
end
end
foreach is used most often when we're reading a file line-by-line. It's the mainstay for reading files in a scalable manner. It wants to loop over the file inside the block, which is why break is there, to break out of that loop.
Or:
File.foreach('path/to/input_file') do |li|
File.write(output_file, li[46..245], -1, :mode => 'a')
break
end
File.write is useful when you have a blob of text or binary, and want to write it in one chunk, then move on. The -1 tells Ruby to move to the end of the file. :mode => 'a' overrides the default mode which would normally truncate an existing file.
Maybe this will do the job:
line = f.readline
columns = line.split
File.open("output.txt", "w") do |out|
columns[46, (245 - 46 + 1)].each do |column|
out.puts column
end
end
break # only process first line
I have used 245 - 46 + 1 to indicate this is the number of columns we are interested in. I have also assumed that columns are separate by whitespaces. If that is not the case you will need to change the delimiter of split.

How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.
With the following sample code (abbreviated for simplicity)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV
# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end
# Output:
# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]
__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid
Just feed the file line by line to FasterCSV and rescue the error.
This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:
class MyFile < File
attr_accessor :last_gets
#last_gets = ''
def gets(*args)
line = super
#last_gets << $/ << line
line
end
end
# then...
file = MyFile.open(filename, 'r')
csv = FasterCSV.new file
row = true
while row
begin
break unless row = csv.shift
# do things with the good row here...
rescue FasterCSV::MalformedCSVError => e
bad_row = file.last_gets
# do something with bad_row here...
next
ensure
file.last_gets = '' # nuke the #last_gets "buffer"
end
end
Kinda neat, right? BUT! there are caveats, of course:
I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This fails utterly might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.
Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.
I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = infile.shift
# process each row here
end
This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

How to read lines of a file in Ruby

I was trying to use the following code to read lines from a file. But when reading a file, the contents are all in one line:
line_num=0
File.open('xxx.txt').each do |line|
print "#{line_num += 1} #{line}"
end
But this file prints each line separately.
I have to use stdin, like ruby my_prog.rb < file.txt, where I can't assume what the line-ending character is that the file uses. How can I handle it?
Ruby does have a method for this:
File.readlines('foo').each do |line|
puts(line)
end
http://ruby-doc.org/core-1.9.3/IO.html#method-c-readlines
File.foreach(filename).with_index do |line, line_num|
puts "#{line_num}: #{line}"
end
This will execute the given block for each line in the file without slurping the entire file into memory. See: IO::foreach.
I believe my answer covers your new concerns about handling any type of line endings since both "\r\n" and "\r" are converted to Linux standard "\n" before parsing the lines.
To support the "\r" EOL character along with the regular "\n", and "\r\n" from Windows, here's what I would do:
line_num=0
text=File.open('xxx.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
print "#{line_num += 1} #{line}"
end
Of course this could be a bad idea on very large files since it means loading the whole file into memory.
Your first file has Mac Classic line endings (that’s "\r" instead of the usual "\n"). Open it with
File.open('foo').each(sep="\r") do |line|
to specify the line endings.
I'm partial to the following approach for files that have headers:
File.open(file, "r") do |fh|
header = fh.readline
# Process the header
while(line = fh.gets) != nil
#do stuff
end
end
This allows you to process a header line (or lines) differently than the content lines.
It is because of the endlines in each lines.
Use the chomp method in ruby to delete the endline '\n' or 'r' at the end.
line_num=0
File.open('xxx.txt').each do |line|
print "#{line_num += 1} #{line.chomp}"
end
how about gets ?
myFile=File.open("paths_to_file","r")
while(line=myFile.gets)
//do stuff with line
end
Don't forget that if you are concerned about reading in a file that might have huge lines that could swamp your RAM during runtime, you can always read the file piece-meal. See "Why slurping a file is bad".
File.open('file_path', 'rb') do |io|
while chunk = io.read(16 * 1024) do
something_with_the chunk
# like stream it across a network
# or write it to another file:
# other_io.write chunk
end
end

Resources