Parsing flat text file in Ruby Rake task - ruby

OK, I am giving this a shot. There are a gazillion SO questions with answers for my question but none that solve my issue.
I am creating a rake task to parse a flat text file in RoR. The file has a header but there are not any delimiters other than blank space. So I was going to use the blank space as delimiter but it will not work. Here is example of text file:
Name Birthdate
Bill 12/25/86
John Smith 1/1/87
If i use ' ' as a delimiter than I get the correct result for the first entry but not the second as there are 2 strings before the date and not just one. Here is how I have been trying to do this:
File.open(file, "r").each do |line|
name, birthdate = line.strip.split("")
user = User.new(user_name: name, birth_date: birthdate)
user.save
end
I cannot figure out how to deal with the fact that the first "field" may or may not be a single word. Ultimately I would prefer to require csv and then my issue would not exist.
Thanks in advance.

You probably want to change the format if possible but working with your two examples you could do something like this
components = line.split(' ')
date = components.pop
name = components.join(' ')

Related

Splitting a ruby file at a pattern?

I have a large ruby file that contains product data. I'm trying to split the file into sections based on a regular expression. I have product headers denoted by the word Product followed by a space and then a number. After that, I have a bunch of lines containing product information.
The format looks like this.
Product 1:
...data
Product 2:
...data
...
Product N:
...data
When reading from the file, I would like to ignore the Product Headers and instead only show the product data. For that, I'm trying to split the file based on a regular expression.
file = File.read('products.txt')
products = file.split(/\Aproduct \d+:\z/i)
This regex works and finds all product headers. The problem is, the file isn't being split into the appropriate sections.
When I run puts products[0], the entire file gets printed out to the console.
\A and \z match the beginning and end of a string, respectively. While what you want is to match the beginning and end of a line. Instead, you should use ^ and $anchors:
/^product \d+:$/i
The file will be readed complete, there's no way to avoid it. But you can iterate over each line and ignore the one that matches the expresion:
File.open("file.txt", "r") do |file|
file.each_line do |line|
if # condition
#fancy stuff
else
# not that fancy
end
end

Reading a specific column of data from a text file in Ruby

I have tried Googling, but I can only find solutions for other languages and the ones about Ruby are for CSV files.
I have a text file which looks like this
0.222222 0.333333 0.4444444 this is the first line.
There are many lines in the same format. All of the numbers are floats.
I want to be able to read just the third column of data (0.444444, the values under that) and ignore the rest of the data.How can I accomplish this?
You can still use CSV; just set the column separator to the space character:
require 'csv'
CSV.open('data', :col_sep=>" ").each do |row|
puts row[2].to_f
end
You don't need CSV, however, and if the whitespace separating fields is inconsistent, this is easiest:
File.readlines('data').each do |line|
puts line.split[2].to_f
end
I'd recommend breaking the task down mentally to:
How can I read the lines of a file?
How can I split a string around whitespace?
Those are two problems that are easy to learn how to handle.

Ruby REGEX parser

Can someone have a look at the below code and tell me whether this is truly the correct way to go about parsing text after the ":" sign.
require 'yaml'
the_file = ARGV[0]
f = File.open(the_file)
content = f.read
r = Regexp.new(/((?=:).+)/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
This script parses email addresses from text files to clean out junk. TEXT:email_address.
I'm trying to make my scripts a bit more efficient. So all my ruby/regex scripts look the same, only with different regex patterns. I wrote them in ruby by cutting an dpasting here and there, and because I have ruby on the majority of my servers, so it's easier to run any script anywhere.
Any help would be appreciated.
If you truly just want text after the first :, I would not use a Regex. I would use String#split
lines = File.readlines(the_file)
emails = lines.map { |line| line.split(':', 2).last }.uniq
If you only want valid emails, I would just search for a regexp that captures emails:
email_regexp = /[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}/
puts YAML.dump(
File.read(ARGV[0]).scan(email_regexp)
)
If you know the colon is the left delimiter before the email, and a close paren on the right, then you can just use
:(.+[^)])
as your regex to extract whatever is in between. There are some very specific email-matching regexen out there though, which may be more appropriate (for when the source text is less 'regular')

Replacing manually written date with a string containing it

I have these 2 things I am working with:
CSV.foreach('datafile.csv','r') {|row| D_Location << row[0]}
puts Date.new(2003,05,02).cwday
In the first line I would like to change the datafile.csv to something like a string so I can change one string and it changes for all of these codes. I have many, each controlling 1 csv column.
In the second one I would like to replace the actual date written, and replace it with a string. This is so that can be automatic, because the string will be generated based on other criteria.
I trust the mods will ban me if I'm being too much of a noob hehe. Then I'll toughen up and find these answers myself eventually. But so far I've solved a lot, but not this. Thanks in advance!
Make a function which takes in a string representing a weekday, and returns a number. Call this function later in your code:
Date.new(2003, 05, yourfun('Tuesday')).cwday
For the first part of your question, you're already working with a string. I think what you mean is that you want it to be in a variable:
csv_file = 'datafile.csv'
CSV.foreach(csv_file,'r') {|row| D_Location << row[0]}
For the second part of your question, Date.parse() works with strings, but they need to be in a format that it can recognize. If your date strings use commas, you can replace them with hyphens:
date_str = "2003,05,02"
Date.parse(date_str.gsub(",", "-")).cwday # => 5
It's not clear where your date strings will be coming from or what format they'll be in, but the general concepts you need to understand are that you can use variables, and that you can transform strings.

Ruby: Using a csv as a database

I think I may not have done a good enough job explaining my question the first time.
I want to open a bunch of text, and binary files and scan those files with my regular expression. What I need from the csv is to take the data in the second column, which are the paths to all the files, as the means to point to which file to open.
Once the file is opened and the regexp is scanned thru the file, if it matches anything, it displays to the screen. I am sorry for the confusion and thank you so much for everything! –
Hello,
I am sorry for asking what is probably a simple question. I am new to ruby and will appreciate any guidance.
I am trying to use a csv file as an index to leverage other actions.
In particular, I have a csv file that looks like:
id, file, description, date
1, /dir_a/file1, this is the first file, 02/10/11
2, /dir_b/file2, this is the second file, 02/11/11
I want to open every file defined in the "file" column and search for a regular expression.
I know that you can define the headers in each column with the CSV class
require 'rubygems'
require 'csv'
require 'pp'
index = CSV.read("files.csv", :headers => true)
index.each do |row|
puts row ['file']
end
I know how to create a loop that opens every file and search's for a regexp in each file, and if there is one, displays it:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
Dir.glob('/home/Bob/**/*').each do |file|
next unless File.file?(file)
File.open(file, "rb") do |f|
f.each_line do |line|
f.each_line do |line|
unless (pattern = line.scan(regex)).empty?
puts "#{pattern}"
end
end
end
end
end
Is there a way I can use the contents of the second column in my csv file as my variable to open each of the files, search the regexp and if there is a match in the file, output the the row in the csv that had the match to a new csv?
Thank you in advance!!!!
At a quick glance it looks like you could reduce it to:
index.each do |row|
File.foreach(row['file']) do |line|
puts "#{pattern}" if (line[regex])
end
end
A CSV file shouldn't be binary, so you can drop the 'rb' when opening the file, letting us reduce the file read to foreach, which iterates over the file, returning it line by line.
The depth of the files in your directory hierarchy is in question based on your sample code. It's not real clear what's going on there.
EDIT:
it tells me that "regex" is an undefined variable
In your question you said:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
the files I open to do the search on may be a binary.
According to the spec:
Common usage of CSV is US-ASCII, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter.
It goes on to say:
Security considerations:
CSV files contain passive text data that should not pose any
risks. However, it is possible in theory that malicious binary
data may be included in order to exploit potential buffer overruns
in the program processing CSV data. Additionally, private data
may be shared via this format (which of course applies to any text
data).
So, if you're seeing binary data you shouldn't because it's not CSV according to the spec. Unfortunately the spec has been abused over the years, so it's possible you are seeing binary data in the file. If so, continue to use 'rb' as the file mode but do it cautiously.
An important question to ask is whether you can read the file using Ruby's CSV library, which makes a lot of this a moot discussion.

Resources