Unicode string from CSV: replace \xHex symbols with unicode values

Unicode string from CSV: replace \xHex symbols with unicode values - ruby

I read some Unicode data from a CSV file using standard Ruby 1.9 csv library like this:
def read_csv(file_name, value)
CSV.foreach(file_name) do |row|
if row[0] == value
return row[1]
end
end
end
And I get a string, the Unicode symbols looks okay in debug.
Invitación
But if I put it (or compare with another string) it looks like this:
Invitaci\xC3\xB3n
How to convert those hex symbols to values? Or maybe I read this CSV file wrong somehow?

Actually found this myself.
Just change line
CSV.foreach(file_name) do |row|
on line
CSV.foreach(file_name, encoding: "UTF-8") do |row|
and this work flawless

Related

Ruby UTF-8 string to UCS-2 conversion

I have a UTF-8 string in my Ruby code. Due to limitations I want to convert the UTF-8 characters in that string to either their escaped equivalents (such as \u23) or simply convert the whole string to UCS-2. I need to explicitly do this to export the data to a file
I tried to do the following in IRB:
my_string = '7.0mΩ'
my_string.encoding
my_string.encode!(Encode::UCS_2BE)
my_string.encoding
The output of that is:
=> "7.0mΩ"
=> #<Encoding::UTF-8>
=> "7.0m\u2126"
=> #<Encoding::UTF-16BE>
This seemed to work fine (I got "ohm" as 2126) until I was reading data out of an array (in Rails):
data.each_with_index do |entry, idx|
puts "#{idx} !! #{entry['title']} !! #{entry['value']} !! #{entry['value'].encode!(Encoding::UCS_2BE)}"
end
That results in the error:
incompatible character encodings: UTF-8 and UTF-16BE
I then tried to write a basic file conversion routine:
File.open(target, 'w', encoding: Encoding::UCS_2BE) do |file|
File.open(source, 'r', encoding: Encoding::UTF_8).each_line do |line|
output.puts(line)
end
end
This resulted in all kinds of weird characters in the file.
Not sure what is going wrong.
Is there a better way to approach this problem of converting UTF-8 data to UCS-2 in Ruby? I really wouldn't mind this actually being changed in the string to \u2126 as a literal part of the string rather than the actual value.
Help!
Temporary Workaround
I monkey-patched this to do what I want. It's not very elegant, but it does the job (and yes, I know it's not pretty... it's just a hack to get what I need):
def hacky_encode
encoded = self
unless encoded.ascii_only?
encoded = scan(/./).map do |char|
char.ascii_only? ? char : char.unpack('U*').map { |i| '\\u' + i.to_s(16).rjust(4, '0') }
end.join
end
encoed
end
Which can be used:
"7.0mΩ".hacky_encode

How to convert a whitespace string to CSV string in Ruby?

How can I convert a whitespace-delimited string to CSV string in Ruby? Is there a built-in method that could be used to achieve this?
Code:
#stores = current_user.channels
puts #stores
Current Output:
TMSUS TMSCA
Expected Output:
TMSUS,TMSCA

There is a CSV library in Ruby Here
require 'csv'
stores = 'TMSUS THSCA'
stores.split(' ').to_csv
Don't use gsub to do this. If you had a string with a comma in it, it would break your CSV. The CSV library does escaping for you.

You could use the CSV library:
require 'csv'
string = 'TMSUS THSCA'
CSV.generate do |csv|
csv << string.split
end
# => "TMSUS,THSCA\n"
The advantage to using the CSV library is it properly escapes and quotes values which might require that.

Ruby CSV::Row remove new line

I am opening a CSV file and then converting it to JSON. This is all working fine except the JSON data has \n characters in the string. These are not part of the last element as far as I can tell from printing it and trying to chomp it. When I print the row it does have \n
require 'csv'
require 'json'
def csv_to_json (tmpfile)
JSON_ARRAY = Array.new
CSV.foreach(tmpfile) do |row|
print row[row.length - 1]
if row[row.length - 1].chomp! == nil
print row
end
JSON_ARRAY.push(row)
end
return JSON_ARRAY.to_json
end
The JSON then looks like this when it is returned
["field11,field12\n",
"field21,field22\n"]
How can I remove these new line characters?
EDIT:
These are CSV::Row objects and do not support string operations like chomp or strip
tmpfile is in the format
field11,field21
field21,field22

Set the row_sep to nil.
JSON_ARRAY.push( row.to_s( row_sep: nil ) )
or
JSON_ARRAY.push( row.to_csv( row_sep: nil ) )
As a comment pointed out, CSV::row#to_s is an alias for CSV::row#to_csv, which adds a row separator after each line automatically. To get around this you can just set the row_sep to nil and it will not add \n at the end of each row.
Hope that helps.

The simplest way:
File.read(tmpfile).split("\n")
By the way, if you want to remove the newline from the string, you could use String::strip method.
CSV.foreach(tmpfile) do |row|
# here row should be an array.
p row
end

CSV.foreach(tmpfile) do |row|
print row[row.length - 1]
if row[row.length - 1].chomp! == nil
print row
end
row.map{|cell| cell.strip!}
JSON_ARRAY.push(row)
end
The row doesn't support stripping, but the cells do.

I was able to get it to work using a map! after the fact
json_array.map! { |row| row = row.to_s.chomp! }
You could also do the to_s.chomp! inside of the loop. This wasn't an option for me because I needed the regular objects to do some calculations before returning the json

Ruby: Why is there a `nil` at the end of all the strings in my array?

So I've written some code in Ruby to split a text file up into individual lines, then to group those lines based on a delimiter character. This output is then written to an array, which is passed to a method, which spits out HTML into a text file. I started running into problems when I tried to use gsub in different methods to replace placeholders in a HTML text file with values from the record array - Ruby kept telling me that I was passing in nil values. After trying to debug that part of the program for several hours, I decided to look elsewhere, and I think I'm on to something. A modified version of the program is posted below.
Here is a sample of the input text file:
26188
WHL
1
Delco
B-7101
A-63
208-220/440
3
285 w/o pallet
1495.00
C:/img_converted/26188B.jpg
EDM Machine Part 2 of 3
AC Motor, 3/4 Hp, Frame 182, 1160 RPM
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
Here is a snippet of the code that I've been testing with:
# function to import file as a string
def file_as_string(filename)
data = ''
f = File.open(filename, "r")
f.each_line do |line|
data += line
end
return data
end
Dir.glob("single_listing.jma") do |filename|
content = file_as_string(filename)
content = content.gsub(/\t/, "\n")
database_array = Array.new
database_array = content.split("|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|")
for i in database_array do
record = Array.new
record = i.split("\n")
puts record[0]
puts record[0].class end
end
When that code is run, I get this output:
john#starfire:~/code/ruby/idealm_db_parser$ ruby putsarray.rb
26188
String
nil
NilClass
... which means that each array position in record apparently has data of type String and of type nil. why is this?

Your database_array has more dimensions than you think.
Your end-of-stanza marker, |--|--|...|--| has a newline after it. So, file_as_string returns something like this:
"26188\nWHL...|--|--|\n"
and is then split() on end-of-stanza into something like this:
["26188\nWHL...1160 RPM\n", "\n"] # <---- Note the last element here!
You then split each again, but "\n".split("\n") gives an empty array, the first element of which comes back as nil.

What's the best way to parse a tab-delimited file in Ruby?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?

The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")

The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values

There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)

I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unicode string from CSV: replace \xHex symbols with unicode values - ruby

Actually found this myself. Just change line CSV.foreach(file_name) do |row| on line CSV.foreach(file_name, encoding: "UTF-8") do |row| and this work flawless

Related

Ruby UTF-8 string to UCS-2 conversion

How to convert a whitespace string to CSV string in Ruby?

Ruby CSV::Row remove new line

Ruby: Why is there a `nil` at the end of all the strings in my array?

What's the best way to parse a tab-delimited file in Ruby?

Categories

Resources