How to reformat CSV file to match proper CSV format - ruby

I have a web application that parse users uploaded csv files.
Some users upload csv files don't match proper csv format mentioned here
For example:
abc,hello mahmoud,this is" description, bad
This should be
abc,hello mahmoud,"this is"" description", bad
When I used ruby fastercsv library to parse the wrong csv, it fails. However, it success when I open the file by excel or openoffice.
Is there any ruby library can reformat the csv text to put it in a proper format?

From the docs:
What you don‘t want to do is feed FasterCSV invalid CSV. Because of
the way the CSV format works, it‘s common for a parser to need to read
until the end of the file to be sure a field is invalid. This eats a
lot of time and memory.
Luckily, when working with invalid CSV, Ruby‘s built-in methods will
almost always be superior in every way. For example, parsing
non-quoted fields is as easy as:
data.split(",")
This would give you an array. If you really want valid CSV (f.e. because you rescued the MalformedCSVError) then there is... fasterCSV!
require 'csv'
str= %q{abc,hello mahmoud,this is" description, bad}
puts str.split(',').to_csv
#=> abc,hello mahmoud,"this is"" description", bad

Related

Reading a specific column of data from a text file in Ruby

I have tried Googling, but I can only find solutions for other languages and the ones about Ruby are for CSV files.
I have a text file which looks like this
0.222222 0.333333 0.4444444 this is the first line.
There are many lines in the same format. All of the numbers are floats.
I want to be able to read just the third column of data (0.444444, the values under that) and ignore the rest of the data.How can I accomplish this?
You can still use CSV; just set the column separator to the space character:
require 'csv'
CSV.open('data', :col_sep=>" ").each do |row|
puts row[2].to_f
end
You don't need CSV, however, and if the whitespace separating fields is inconsistent, this is easiest:
File.readlines('data').each do |line|
puts line.split[2].to_f
end
I'd recommend breaking the task down mentally to:
How can I read the lines of a file?
How can I split a string around whitespace?
Those are two problems that are easy to learn how to handle.

Figure date format from string in ruby

I am working in a simple data loader for text files and would like to add a feature for correctly loading dates into the tables. The problem I have is that I do not know the date format before hand, and it will not be my script doing the inserts - it has to generate insert statements for later use.
The Date.parse is almost what I'd need. If there was a way to grab the format it identified on the string in a way I could use to generate a to_date(...)(Oracle standard) would be perfect.
An example:
My input file:
user_name;birth_date
Sue;20130427
Amy;31/4/1984
Should generate:
insert into my_table values ('Sue', to_date('20130427','yyyymmdd'));
insert into my_table values ('Amy', to_date('31/4/1984','dd/mm/yyyy'));
Note that it is important the original string remains unchanged - so I cannot parse it to a standard format used in the inserts (it is a requirement).
At the moment I am just testing a bunch of regexes and doing some validation, but I was wondering if there was a more robust way.
Suppose (using for example String#scan), you extracted an array of the date strings from a single file. It may be like:
strings = ["20130427", "20130102", ...]
Prepare in advance an array of all formats you can think of. It may be like:
Formats = ["%Y%m%d", "%y%m%d", "%y/%m/%d", "%m/%d/%y", "%d/%m/%y", ...]
Then check all formats that can parse all of the strings:
require "date"
formats =
Formats.select{|format| strings.all?{|s| Date.strptime(s, format) rescue nil}}
If this array formats includes exactly one element, then that means the strings were unambiguously parsed with that format. Using that format, you can go back to the strings and parse them with that format.
Otherwise, either you failed to provide the appropriate format within Formats, or the strings remained ambiguous.
I would use the Chronic gem. It will extract dates in most formats.
It has options to resolve the ambiguity in the xx/xx/xxxx format, but you'd have to specify which to prefer when either match.

Ruby CSV parsing from Excel with multilingual document

I have a csv document exported from Excel and containing both english and non-english (russian) letters.
I've managed to open it with
CSV.open #tmp, "rb:ISO-8859-1", {col_sep: ";"}
but it read russian symbols as \xCE\xF1\xF2\xE0\xEB\xFC\xED\xFB\xE5 \xE7\xE0\xEF\xF7 etc.
I've try "rb:ISO-8859-1:UTF-8" but get "ArgumentError: invalid byte sequence in UTF-8", same as csv.open runned without mode.
How this could be fixed? Also, how I could find 'mode' argument options - I couldn't understand from docs where it is described.
Main environment is Ubuntu server, if it matters.
try using this format
r:ISO-8859-15:UTF-8

How to search binary file and replace string with Ruby?

Ruby newbie here. I'm using Ruby version 1.9.2. I working at a military facility and whenever when need to send support data to our vendors it needs to be scrubbed of idenfying IP and Hostname info. This is new role for me and now the task of scrubbing files (both text and binary) falls on me when handling support issues.
I created the following script to "scrub" files plain text files of IP address info:
File.open("subnet.htm", 'r+') do |f|
text = f.read
text.gsub!(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/, "000.000.000.000")
f.rewind
f.write(text)
end
I need to modify my script to search and replace hostname AND IP address information on text files AND .dat binary files. I'm looking for something really simple like my little script above and I'd like the keep the processing of txt and dat files as separate scripts. The task of creating one script to do both is one I'd like to take up as learning exercise from the two separate scripts. Right now I'm under certain time constraints to scrub the supports files and send them out.
The priority for me is to scrub my binary .dat trace files which are of data type XML. These are binary performance trace files from our storage arrays and they need to have the identifying IP address information scrubbed out before sending off to support for analysis.
I've searched stackoverflow.com somewhat extensively and haven't found a question with answer that addresses my specific need and I simply having a time trying to figure out string.unpack.
Thanks.
In general Ruby processes binary files the same as other files, with two caveats:
On Windows reading files normally translates CRLF pairs into just LF. You need to read in binary mode to ensure no conversion:
File.open('foo.bin','rb'){ ... }
In order to ensure that your binary data is not interpreted as text in some other encoding under Ruby 1.9+ you need to specify the ASCII-8BIT encoding:
File.open('foo.bin','r:ASCII-8BIT'){ ... }
However, as noted in this post, setting the 'b' flag as shown above also sets the encoding for you. Thus, just use the first code snippet above.
However, as noted in the comment by #ennuikiller, I suspect that you don't actually have true binary data. If you're really reading text files with a non-ASCII encoding (e.g. UTF-8) there is a small chance that treating them as binary will accidentally find only half of a multi-byte encoding and cause harm in the resulting file.
Edit: To use Nokogiri on XML files, you might do something like the following:
require 'nokogiri'
File.open("foo.xml", 'r+') do |f|
doc = Nokogiri.XML(f.read)
doc.xpath('//text()').each do |text_node|
# You cannot use gsub! here
text_node.content = text_node.content.gsub /.../, '...'
end
f.rewind
f.write doc.to_xml
end
I've done some binary file parsing, and this is how I read it in and cleaned it up:
data = File.open("file", 'rb' ) {|io| io.read}.unpack("C*").map do |val|
val if val == 9 || val == 10 || val == 13 || (val > 31 && val < 127)
end
For me, my binary file didn't have sequential character strings, so I had to do some shifting and filtering before I could read it (Hence the .map do |val| ... end Unpack with the "C" tag (see http://www.ruby-doc.org/core-1.9.2/String.html#method-i-unpack) will give ASCII character codes rather than the letters, so call val.chr if you'd like to use the interpreted character instead.
I'd suggest that you open your files in a binary editor and look through them to determine how to best handle the data parsing. If they are XML, you might consider parsing them with Nokogiri or a similar XML tool.

Ruby: Using a csv as a database

I think I may not have done a good enough job explaining my question the first time.
I want to open a bunch of text, and binary files and scan those files with my regular expression. What I need from the csv is to take the data in the second column, which are the paths to all the files, as the means to point to which file to open.
Once the file is opened and the regexp is scanned thru the file, if it matches anything, it displays to the screen. I am sorry for the confusion and thank you so much for everything! –
Hello,
I am sorry for asking what is probably a simple question. I am new to ruby and will appreciate any guidance.
I am trying to use a csv file as an index to leverage other actions.
In particular, I have a csv file that looks like:
id, file, description, date
1, /dir_a/file1, this is the first file, 02/10/11
2, /dir_b/file2, this is the second file, 02/11/11
I want to open every file defined in the "file" column and search for a regular expression.
I know that you can define the headers in each column with the CSV class
require 'rubygems'
require 'csv'
require 'pp'
index = CSV.read("files.csv", :headers => true)
index.each do |row|
puts row ['file']
end
I know how to create a loop that opens every file and search's for a regexp in each file, and if there is one, displays it:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
Dir.glob('/home/Bob/**/*').each do |file|
next unless File.file?(file)
File.open(file, "rb") do |f|
f.each_line do |line|
f.each_line do |line|
unless (pattern = line.scan(regex)).empty?
puts "#{pattern}"
end
end
end
end
end
Is there a way I can use the contents of the second column in my csv file as my variable to open each of the files, search the regexp and if there is a match in the file, output the the row in the csv that had the match to a new csv?
Thank you in advance!!!!
At a quick glance it looks like you could reduce it to:
index.each do |row|
File.foreach(row['file']) do |line|
puts "#{pattern}" if (line[regex])
end
end
A CSV file shouldn't be binary, so you can drop the 'rb' when opening the file, letting us reduce the file read to foreach, which iterates over the file, returning it line by line.
The depth of the files in your directory hierarchy is in question based on your sample code. It's not real clear what's going on there.
EDIT:
it tells me that "regex" is an undefined variable
In your question you said:
regex = /[0-9A-Za-z]{8,8}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{4,4}-[0-9A-Za-z]{12,12}/
the files I open to do the search on may be a binary.
According to the spec:
Common usage of CSV is US-ASCII, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter.
It goes on to say:
Security considerations:
CSV files contain passive text data that should not pose any
risks. However, it is possible in theory that malicious binary
data may be included in order to exploit potential buffer overruns
in the program processing CSV data. Additionally, private data
may be shared via this format (which of course applies to any text
data).
So, if you're seeing binary data you shouldn't because it's not CSV according to the spec. Unfortunately the spec has been abused over the years, so it's possible you are seeing binary data in the file. If so, continue to use 'rb' as the file mode but do it cautiously.
An important question to ask is whether you can read the file using Ruby's CSV library, which makes a lot of this a moot discussion.

Resources