Read a csv in ruby with UTF-8 literal - ruby

i have this csv file
file data.csv:
data.csv: ASCII text
This file has ~10000 lines with some UTF-8 literal chars.
For example:
1388357672.209253000,48:a2:2d:78:84:10,\xe5\x87\xb6\xe5\xb7\xb4\xe5\xb7\xb4\xe8\x87\xad\xe7\x98\xaa\xe7\x98\xaa\xe7\x9a\x84\xe6\x80\xaa\xe5\x85\xbd\xe5\x87\xba
I iterate over this file in Ruby and save every line in my postgresql db
File.open(filename, "r").each_line do |line|
CSV.parse(line, encoding: 'UTF-8') do |row|
//Save to Postgresql
end
end
I have now the problem that the UTF-8 literal string is saved in the db and not the correct UTF-8 string. I can convert every line with echo -e "line" but this takes much time. Is ther a way that ruby can do this task?

Try this:
CSV.parse(line, encoding: 'UTF-8') do |row|
row = row.map do |elem|
elem.gsub(/\\x../) {|s| [s[2..-1].hex].pack("C")}.force_encoding("UTF-8")
end
//Save to Postgresql
end

Just put each cell in double quotes:
"\xe5\x87\xb6\xe5\xb7\xb4\xe5\xb7\xb4\xe8\x87\xad\xe7\x98\xaa\xe7\x98\xaa\xe7\x9a\x84\xe6\x80\xaa\xe5\x85\xbd\xe5\x87\xba"
=> "凶巴巴臭瘪瘪的怪兽出"

Related

Change Headers for Certain Columns in CSV File

I have a CSV file that I want to change the headers only for certain columns (about 20 of them in my actual file). Here's a sample CSV file:
CSV File
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
I have been trying this with a File/IO class, but when it reads the file to do the gsub it removes all of the quotation marks around each string separated by commas. Here's the code I'm using:
Ruby Code
file = 'file.csv'
replacements = {
'blah_01_blah' => 'newblah1',
'foo_01_foo' => 'coolfoo1',
'bacon_01_bacon' => 'goodpig1',
'bacon_01_bacon' => 'goodpig2'
}
matcher = /#{replacements.keys.join('|')}/
outdata = File.read(file).gsub(matcher, replacements)
File.open(file, 'w') do |out|
out << outdata
end
What I end up with is this in the CSV file:
New CSV File
name,blah_01_blah,foo_1_01_foo,bacon_01_bacon,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae
It's keeping the quotation marks in fields that are blank, but taking them out around the strings elsewhere. I want to retain those quotation marks in case for some reason a rogue comma ends up in a string somewhere so it doesn't get thrown off. How can I change the headers without losing my quotation marks around the strings?
EDIT - This is what I want the file to look like at the end.
Expected Result CSV File
"name","newblah1","coolfoo1","goodpig1","goodpig2"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
Thanks!
You don’t need to handle CSV at all:
File.write(
file,
File.readlines(file).tap do |lines|
lines.first.gsub!(matcher, replacements)
end.join
)
File#readlines.
The trick here is we actually deal with the first line only, as with plain text.
Let's first create the input CSV file.
text =<<_
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
_
file_in = 'file_in.csv'
file_out = 'file_out.csv'
File.write(file_in, text)
#=> 137
Here is the replacements hash, which I simplified slightly.
replacements = {'blah_01_blah'=>'newblah1', 'foo_01_foo'=>'coolfoo1',
'bacon_01_bacon'=>'goodpig1'}
The first task is to modify this hash so that if it has no key k, replacements[k] will return k. For this we use the method Hash#default_proc=.
replacements.default_proc = ->(_,k) { k }
Here are two examples of how this hash is used.
replacements['bacon_01_bacon']
#=> "goodpig1"
replacements['name']
#=> "name"`
The latter follows because replacements has no key 'name'.
The code is as follows.
require 'csv'
f_in = CSV.read(file_in, headers:true)
CSV.open(file_out, 'w') do |csv_out|
csv_out << replacements.values_at(*f_in.headers)
f_in.each { |row| csv_out << row }
end
#=> #<CSV::Table mode:col_or_row row_count:3>
Note that
f_in.headers
#=> ["name", "blah_01_blah", "foo_1_01_foo", "bacon_01_bacon", "bacon_02_bacon"]
Let's look at the output file.
puts File.read(file_out)
prints
name,newblah1,foo_1_01_foo,goodpig1,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae

CSV writing Ruby Encoding Error

I am trying to write a UTF-8 character in a CSV file with csv library of Ruby. And I have got an error:
csv ruby write problem ASCII-8BIT (Encoding::CompatibilityError)
#create csv file
CSV.open(CSV_file,"wb",) do |csv|
csv << First_line
rows.each do |r|
csv << r.generate_array
end
end
That's the code where UTF-8 conflicts with ASCII-8BIT.
Example text that fails:
demás
Here is an example of CSV writing and reading with UTF-8:
fn="/tmp/f.csv"
require "csv"
d1=DATA.read.split(/\n/).map {|e| e.split}
CSV.open(fn, "w:utf-8") do |row|
d1.each { |dr| row << dr }
end
d2=[]
CSV.foreach(fn) do |row|
d2 << row
end
puts d1==d2
# true
__END__
privé face à face à un tête-à-tête
Face to face with one-on-one
demás
Without a more detailed example from you, I cannot help further.

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Ruby CSV: How can I read a tab-delimited file?

CSV.open(name, "r").each do |row|
puts row
end
And I get the following error:
CSV::MalformedCSVError Unquoted fields do not allow \r or \n
The name of the file is a .txt tab-delimited file. I made it specifically. I have a .csv file, I went to excel, and saved the file as .txt tab delimited. So it is tab delimited.
Shouldn't CSV.open be able to read tab-delimited files?
Try specifying the field delimiter like this:
CSV.open("name", "r", { :col_sep => "\t" }).each do |row|
puts row
end
And remember to require 'csv' and read the DOCS
By default CSV uses the comma as separator, this comes from the fact that CSV stands for 'Comma Separated Values'.
If you want a different separator (in this case tabs) you need to make it explicit.
Example:
p CSV.new("aaa\tbbb\tccc\nddd\teee", col_sep: "\t").read
Relevant documentation: http://ruby-doc.org/stdlib-2.1.0/libdoc/csv/rdoc/CSV.html#new
as an alternative to CSV, you can also use smarter_csv like this:
require 'smarter_csv'
data = SmarterCSV.process(filename, col_sep: "\t")
If you use smarter_csv >= 1.4.2, you can also do this:
require 'smarter_csv'
data = SmarterCSV.process(filename, col_sep: :auto)
SmarterCSV will return an array of hashes, and can do batch processing

Ruby CSV parsing string with escaped quotes

I have a line in my CSV file that has some escaped quotes:
173,"Yukihiro \"The Ruby Guy\" Matsumoto","Japan"
When I try to parse it the the Ruby CSV parser:
require 'csv'
CSV.foreach('my.csv', headers: true, header_converters: :symbol) do |row|
puts row
end
I get this error:
.../1.9.3-p327/lib/ruby/1.9.1/csv.rb:1914:in `block (2 levels) in shift': Missing or stray quote in line 122 (CSV::MalformedCSVError)
How can I get around this error?
The \" is typical Unix whereas Ruby CSV expects ""
To parse it:
require 'csv'
text = File.read('test.csv').gsub(/\\"/,'""')
CSV.parse(text, headers: true, header_converters: :symbol) do |row|
puts row
end
Note: if your CSV file is very large, it uses a lot of RAM to read the entire file. Consider reading the file one line at a time.
Note: if your CSV file may have slashes in front of slashes, use Andrew Grimm's suggestion below to help:
gsub(/(?<!\\)\\"/,'""')
CSV supports "converters", which we can normally use to massage the content of a field before it's passed back to our code. For instance, that can be used to strip extra spaces on all fields in a row.
Unfortunately, the converters fire off after the line is split into fields, and it's during that step that CSV is getting mad about the embedded quotes, so we have to get between the "line read" step, and the "parse the line into fields" step.
This is my sample CSV file:
ID,Name,Country
173,"Yukihiro \"The Ruby Guy\" Matsumoto","Japan"
Preserving your CSV.foreach method, this is my example code for parsing it without CSV getting mad:
require 'csv'
require 'pp'
header = []
File.foreach('test.csv') do |csv_line|
row = CSV.parse(csv_line.gsub('\"', '""')).first
if header.empty?
header = row.map(&:to_sym)
next
end
row = Hash[header.zip(row)]
pp row
puts row[:Name]
end
And the resulting hash and name value:
{:ID=>"173", :Name=>"Yukihiro \"The Ruby Guy\" Matsumoto", :Country=>"Japan"}
Yukihiro "The Ruby Guy" Matsumoto
I assumed you were wanting a hash back because you specified the :headers flag:
CSV.foreach('my.csv', headers: true, header_converters: :symbol) do |row|
Open the file in MSExcel and save as MS-DOS Comma Separated(.csv)

Resources