Reading a specific column of data from a text file in Ruby - ruby

I have tried Googling, but I can only find solutions for other languages and the ones about Ruby are for CSV files.
I have a text file which looks like this
0.222222 0.333333 0.4444444 this is the first line.
There are many lines in the same format. All of the numbers are floats.
I want to be able to read just the third column of data (0.444444, the values under that) and ignore the rest of the data.How can I accomplish this?

You can still use CSV; just set the column separator to the space character:
require 'csv'
CSV.open('data', :col_sep=>" ").each do |row|
puts row[2].to_f
end
You don't need CSV, however, and if the whitespace separating fields is inconsistent, this is easiest:
File.readlines('data').each do |line|
puts line.split[2].to_f
end
I'd recommend breaking the task down mentally to:
How can I read the lines of a file?
How can I split a string around whitespace?
Those are two problems that are easy to learn how to handle.

Related

How to delete specific lines in text file?

Suppose, I have an input.txt file with the following text:
First line
Second line
Third line
Fourth line
I want to delete, for example, the second and fourth lines to get this:
First line
Third line
So far, I've managed to delete only one the second line using this code
require 'fileutils'
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt') do |line|
out_file.puts line unless line =~ /Second/
end
end
FileUtils.mv('output.txt', 'input.txt')
What is the right way to delete multiple lines in text file in Ruby?
Deleting lines cleanly and efficiently from a text file is "difficult" in the general case, but can be simple if you can constrain the problem somewhat.
Here are some questions from SO that have asked a similar question:
How do I remove lines of data in the middle of a text file with Ruby
Deleting a specific line in a text file?
Deleting a line in a text file
Delete a line of information from a text file
There are numerous others, as well.
In your case, if your input file is relatively small, you can easily afford to use the approach that you're using. Really, the only thing that would need to change to meet your criteria is to modify your input file loop and condition to this:
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt').with_index do |line,line_number|
out_file.puts line if line_number.even? # <== line numbers start at 0
end
end
The changes are to capture the line number, using the with_index method, which can be used due to the fact that File#foreach returns an Enumerator when called without a block; the block now applies to with_index, and gains the line number as a second block argument. Simply using the line number in your comparison gives you the criteria that you specified.
This approach will scale, even for somewhat large files, whereas solutions that read the entire file into memory have a fairly low upper limit on file size. With this solution, you're more constrained by available disk space and speed at which you can read/write the file; for instance, doing this to space-limited online storage may not work as well as you'd like. Writing to local disk or thumb drive, assuming that you have space available, should be no problem at all.
Use File.readlines to get an array of the lines in your input file.
input_lines = File.readlines('input.txt')
Then select only those with an even index.
output_lines = input_lines.select.with_index { |_, i| i.even? }
Finally, write those in your output file.
File.open('output.txt', 'w') do |f|
output_lines.each do |line|
f.write line
end
end

How to clean a csv file where fields contains the csv separator and delimiter

I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.
Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.

Line count in csv doesn't match

I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

replace the first or nth line of file with ruby

How would I replace the first line of a text file or xml file using ruby? I'm having problems replicating a strange xml API and need to edit the document instruction after I create the XML file. It is strange that I have to do this, but in this case it is necessary.
If you are editing XML, use a tool specially designed for the task. sub, gsub and regex are not good choices if the XML being manipulated is not under your control.
Use Nokogiri to parse the XML, locate nodes and change them, then emit the updated XML.
There are many examples on SO showing how to do this, plus the tutorials on the Nokogiri site.
There are a couple different ways you can do this:
Use ARGF (assuming that your ruby program takes a file name as a command line parameter)
ruby -e "puts ARGF.to_a[n]" yourfile.xml
Open the file regularly then read n lines
File.open("yourfile") { |f|
line = nil
n.times { line = f.gets }
puts line
}
This approach is less intensive on memory, as only a single line is considered at a time, it is also the simplest method.
Use IO.readlines() (will only work if the entire file will fit in memory!)
IO.readlines("yourfile")[n]
IO.readlines(...) will read every line from your file into an array.
Where n in all the above examples is the nth line of your file.

Ruby: Using a csv as a database

I think I may not have done a good enough job explaining my question the first time.
I want to open a bunch of text, and binary files and scan those files with my regular expression. What I need from the csv is to take the data in the second column, which are the paths to all the files, as the means to point to which file to open.
Once the file is opened and the regexp is scanned thru the file, if it matches anything, it displays to the screen. I am sorry for the confusion and thank you so much for everything! –
Hello,
I am sorry for asking what is probably a simple question. I am new to ruby and will appreciate any guidance.
I am trying to use a csv file as an index to leverage other actions.
In particular, I have a csv file that looks like:
id, file, description, date
1, /dir_a/file1, this is the first file, 02/10/11
2, /dir_b/file2, this is the second file, 02/11/11
I want to open every file defined in the "file" column and search for a regular expression.
I know that you can define the headers in each column with the CSV class
require 'rubygems'
require 'csv'
require 'pp'
index = CSV.read("files.csv", :headers => true)
index.each do |row|
puts row ['file']
end
I know how to create a loop that opens every file and search's for a regexp in each file, and if there is one, displays it:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
Dir.glob('/home/Bob/**/*').each do |file|
next unless File.file?(file)
File.open(file, "rb") do |f|
f.each_line do |line|
f.each_line do |line|
unless (pattern = line.scan(regex)).empty?
puts "#{pattern}"
end
end
end
end
end
Is there a way I can use the contents of the second column in my csv file as my variable to open each of the files, search the regexp and if there is a match in the file, output the the row in the csv that had the match to a new csv?
Thank you in advance!!!!
At a quick glance it looks like you could reduce it to:
index.each do |row|
File.foreach(row['file']) do |line|
puts "#{pattern}" if (line[regex])
end
end
A CSV file shouldn't be binary, so you can drop the 'rb' when opening the file, letting us reduce the file read to foreach, which iterates over the file, returning it line by line.
The depth of the files in your directory hierarchy is in question based on your sample code. It's not real clear what's going on there.
EDIT:
it tells me that "regex" is an undefined variable
In your question you said:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
the files I open to do the search on may be a binary.
According to the spec:
Common usage of CSV is US-ASCII, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter.
It goes on to say:
Security considerations:
CSV files contain passive text data that should not pose any
risks. However, it is possible in theory that malicious binary
data may be included in order to exploit potential buffer overruns
in the program processing CSV data. Additionally, private data
may be shared via this format (which of course applies to any text
data).
So, if you're seeing binary data you shouldn't because it's not CSV according to the spec. Unfortunately the spec has been abused over the years, so it's possible you are seeing binary data in the file. If so, continue to use 'rb' as the file mode but do it cautiously.
An important question to ask is whether you can read the file using Ruby's CSV library, which makes a lot of this a moot discussion.

Resources