Splitting a ruby file at a pattern? - ruby

I have a large ruby file that contains product data. I'm trying to split the file into sections based on a regular expression. I have product headers denoted by the word Product followed by a space and then a number. After that, I have a bunch of lines containing product information.
The format looks like this.
Product 1:
...data
Product 2:
...data
...
Product N:
...data
When reading from the file, I would like to ignore the Product Headers and instead only show the product data. For that, I'm trying to split the file based on a regular expression.
file = File.read('products.txt')
products = file.split(/\Aproduct \d+:\z/i)
This regex works and finds all product headers. The problem is, the file isn't being split into the appropriate sections.
When I run puts products[0], the entire file gets printed out to the console.

\A and \z match the beginning and end of a string, respectively. While what you want is to match the beginning and end of a line. Instead, you should use ^ and $anchors:
/^product \d+:$/i

The file will be readed complete, there's no way to avoid it. But you can iterate over each line and ignore the one that matches the expresion:
File.open("file.txt", "r") do |file|
file.each_line do |line|
if # condition
#fancy stuff
else
# not that fancy
end
end

Related

How to delete specific lines in text file?

Suppose, I have an input.txt file with the following text:
First line
Second line
Third line
Fourth line
I want to delete, for example, the second and fourth lines to get this:
First line
Third line
So far, I've managed to delete only one the second line using this code
require 'fileutils'
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt') do |line|
out_file.puts line unless line =~ /Second/
end
end
FileUtils.mv('output.txt', 'input.txt')
What is the right way to delete multiple lines in text file in Ruby?
Deleting lines cleanly and efficiently from a text file is "difficult" in the general case, but can be simple if you can constrain the problem somewhat.
Here are some questions from SO that have asked a similar question:
How do I remove lines of data in the middle of a text file with Ruby
Deleting a specific line in a text file?
Deleting a line in a text file
Delete a line of information from a text file
There are numerous others, as well.
In your case, if your input file is relatively small, you can easily afford to use the approach that you're using. Really, the only thing that would need to change to meet your criteria is to modify your input file loop and condition to this:
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt').with_index do |line,line_number|
out_file.puts line if line_number.even? # <== line numbers start at 0
end
end
The changes are to capture the line number, using the with_index method, which can be used due to the fact that File#foreach returns an Enumerator when called without a block; the block now applies to with_index, and gains the line number as a second block argument. Simply using the line number in your comparison gives you the criteria that you specified.
This approach will scale, even for somewhat large files, whereas solutions that read the entire file into memory have a fairly low upper limit on file size. With this solution, you're more constrained by available disk space and speed at which you can read/write the file; for instance, doing this to space-limited online storage may not work as well as you'd like. Writing to local disk or thumb drive, assuming that you have space available, should be no problem at all.
Use File.readlines to get an array of the lines in your input file.
input_lines = File.readlines('input.txt')
Then select only those with an even index.
output_lines = input_lines.select.with_index { |_, i| i.even? }
Finally, write those in your output file.
File.open('output.txt', 'w') do |f|
output_lines.each do |line|
f.write line
end
end

What does $/ mean in Ruby?

I was reading about Ruby serialization (http://www.skorks.com/2010/04/serializing-and-deserializing-objects-with-ruby/) and came across the following code. What does $/ mean? I assume $ refers to an object?
array = []
$/="\n\n"
File.open("/home/alan/tmp/blah.yaml", "r").each do |object|
array << YAML::load(object)
end
$/ is a pre-defined variable. It's used as the input record separator, and has a default value of "\n".
Functions like gets uses $/ to determine how to separate the input. For example:
$/="\n\n"
str = gets
puts str
So you have to enter ENTER twice to end the input for str.
Reference: Pre-defined variables
This code is trying to read each object into an array element, so you need to tell it where one ends and the next begins. The line $/="\n\n" is setting what ruby uses to to break apart your file into.
$/ is known as the "input record separator" and is the value used to split up your file when you are reading it in. By default this value is set to new line, so when you read in a file, each line will be put into an array. What setting this value, you are telling ruby that one new line is not the end of a break, instead use the string given.
For example, if I have a comma separated file, I can write $/="," then if I do something like your code on a file like this:
foo, bar, magic, space
I would create an array directly, without having to split again:
["foo", " bar", " magic", " space"]
So your line will look for two newline characters, and split on each group of two instead of on every newline. You will only get two newline characters following each other when one line is empty. So this line tells Ruby, when reading files, break on empty lines instead of every line.
I found in this page something probably interesting:
http://www.zenspider.com/Languages/Ruby/QuickRef.html#18
$/ # The input record separator (eg #gets). Defaults to newline.
The $ means it is a global variable.
This one is however special as it is used by Ruby. Ruby uses that variable as a input record separator
For a full list with the special global variables see:
http://www.rubyist.net/~slagell/ruby/globalvars.html

Reading a specific column of data from a text file in Ruby

I have tried Googling, but I can only find solutions for other languages and the ones about Ruby are for CSV files.
I have a text file which looks like this
0.222222 0.333333 0.4444444 this is the first line.
There are many lines in the same format. All of the numbers are floats.
I want to be able to read just the third column of data (0.444444, the values under that) and ignore the rest of the data.How can I accomplish this?
You can still use CSV; just set the column separator to the space character:
require 'csv'
CSV.open('data', :col_sep=>" ").each do |row|
puts row[2].to_f
end
You don't need CSV, however, and if the whitespace separating fields is inconsistent, this is easiest:
File.readlines('data').each do |line|
puts line.split[2].to_f
end
I'd recommend breaking the task down mentally to:
How can I read the lines of a file?
How can I split a string around whitespace?
Those are two problems that are easy to learn how to handle.

Delete first two lines of file with ruby

My script reads in large text files and grabs the first page with a regex. I need to remove the first two lines of each first page or change the regex to match 1 line after the ==Page 1== string. I include the entire script here because I've been asked to in past questions and because I'm new to ruby and don't always know how integrate snippets as answers:
#!/usr/bin/env ruby -wKU
require 'fileutils'
source = File.open('list.txt')
source.readlines.each do |line|
line.strip!
if File.exists? line
file = File.open(line)
end
text = (File.read(line))
match = text.match(/==Page 1(.*)==Page 2==/m)
puts match
end
Now, when You have updated your question, I had to delete a big part of so good answer :-)
I guess the main point of your problem was that you wanted to use match[1] instead of match. The object returned by Regexp.match method (MatchData) can be treated like an array, which holds the whole matched string as the first element, and each subquery in the following elements. So, in your case the variable match (and match[0]) is the whole matched string (together with '==Page..==' marks), but you wanted just the first subexpression which is hidden in match[1].
Now about other, minor problems I sense in your code. Please, don't be offended in case you already know what I say, but maybe others will profit from the warnings.
The first part of your code (if File.exists? line) was checking whether the file exists, but your code just opened the file (without closing it!) and still was trying to open the file few lines later.
You may use this line instead:
next unless File.exists? line
The second thing is that the program should be prepared to handle the situation when the file has no page marks, so it does not match the pattern. (The variable match would then be nil)
The third suggestion is that a little more complicated pattern might be used. The current one (/==Page 1==(.*)==Page 2==/m) would return the page content with the End-Of-Line mark as the first character. If you use this pattern:
/==Page 1==\s*\n(.*)==Page 2==/m
then the subexpression will not contain the white spaces placed in the same line as the '==Page 1==` text. And if you use this pattern:
/==Page 1==\s*\n(.*\n)==Page 2==/m
then you will be sure that the '==Page 2==' mark starts from the beginning of the line.
And the fourth issue is that very often programmers (sometimes including me, of course) tend to forget about closing the file after they opened it. In your case you have opened the 'source' file, but in the code there was no source.close statement after the loop. The most secure way of handling files is by passing a block to the File.open method, so You might use the following form of the first lines of your program:
File.open('list.txt') do |source|
source.readlines.each do |line|
...but in this case it would be cleaner to write just:
File.readlines('list.txt').each do |line|
Taking it all together, the code might look like this (I changed the variable line to fname for better code readability):
#!/usr/bin/env ruby -wKU
require 'fileutils'
File.readlines('list.txt').each do |fname|
fname.strip!
next unless File.exists? fname
text = File.read(fname)
if match = text.match(/==Page 1==\s*\n(.*\n)==Page 2==/m)
# The whole 'page' (String):
puts match[1].inspect
# The 'page' without the first two lines:
# (in case you really wanted to delete lines):
puts match[1].split("\n")[2..-1].inspect
else
# What to do if the file does not match the pattern?
raise "The file #{fname} does NOT include the page separators."
end
end

Ruby: Using a csv as a database

I think I may not have done a good enough job explaining my question the first time.
I want to open a bunch of text, and binary files and scan those files with my regular expression. What I need from the csv is to take the data in the second column, which are the paths to all the files, as the means to point to which file to open.
Once the file is opened and the regexp is scanned thru the file, if it matches anything, it displays to the screen. I am sorry for the confusion and thank you so much for everything! –
Hello,
I am sorry for asking what is probably a simple question. I am new to ruby and will appreciate any guidance.
I am trying to use a csv file as an index to leverage other actions.
In particular, I have a csv file that looks like:
id, file, description, date
1, /dir_a/file1, this is the first file, 02/10/11
2, /dir_b/file2, this is the second file, 02/11/11
I want to open every file defined in the "file" column and search for a regular expression.
I know that you can define the headers in each column with the CSV class
require 'rubygems'
require 'csv'
require 'pp'
index = CSV.read("files.csv", :headers => true)
index.each do |row|
puts row ['file']
end
I know how to create a loop that opens every file and search's for a regexp in each file, and if there is one, displays it:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
Dir.glob('/home/Bob/**/*').each do |file|
next unless File.file?(file)
File.open(file, "rb") do |f|
f.each_line do |line|
f.each_line do |line|
unless (pattern = line.scan(regex)).empty?
puts "#{pattern}"
end
end
end
end
end
Is there a way I can use the contents of the second column in my csv file as my variable to open each of the files, search the regexp and if there is a match in the file, output the the row in the csv that had the match to a new csv?
Thank you in advance!!!!
At a quick glance it looks like you could reduce it to:
index.each do |row|
File.foreach(row['file']) do |line|
puts "#{pattern}" if (line[regex])
end
end
A CSV file shouldn't be binary, so you can drop the 'rb' when opening the file, letting us reduce the file read to foreach, which iterates over the file, returning it line by line.
The depth of the files in your directory hierarchy is in question based on your sample code. It's not real clear what's going on there.
EDIT:
it tells me that "regex" is an undefined variable
In your question you said:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
the files I open to do the search on may be a binary.
According to the spec:
Common usage of CSV is US-ASCII, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter.
It goes on to say:
Security considerations:
CSV files contain passive text data that should not pose any
risks. However, it is possible in theory that malicious binary
data may be included in order to exploit potential buffer overruns
in the program processing CSV data. Additionally, private data
may be shared via this format (which of course applies to any text
data).
So, if you're seeing binary data you shouldn't because it's not CSV according to the spec. Unfortunately the spec has been abused over the years, so it's possible you are seeing binary data in the file. If so, continue to use 'rb' as the file mode but do it cautiously.
An important question to ask is whether you can read the file using Ruby's CSV library, which makes a lot of this a moot discussion.

Resources