I am parsing a CSV file and converting each element in UTF-8 :
CSV.foreach(#data_source, { :col_sep => ';' , quote_char: "\x00", :encoding => "CP850"}) do |row|
row.map! {|x| x.force_encoding('UTF-8') unless x.nil? ; x.scrub!("") unless x.nil? ; x.delete('\u2FEC') unless x.nil? }
end
The script then does a bunch of calculations and then saves the data in xlsx format using axlsx gem.
I added x.delete('\u2FEC') unless x.nil? because I found that in the source file, there was this strange sequence that later causes and "Unreadable content" error in Excel.
I found that it solves the "Unreadable content" issue but it not only deletes the "\u2FEC" sequence, it deletes every occurence of the character "2" too.
Do you have any idea how I can get rid of only "\u2FEC" and not of every "2" in my rows ?
Thanks.
Single-quoted strings don't support Unicode escapes. (In fact, they don't support any escapes other than \' and \\.)
You need to use either a double-quoted string or enter the character directly into the single-quoted string instead of a Unicode escape sequence.
Related
I want to append </tag> to each line where it's missing:
text = '<tag>line 1</tag>
<tag>line2 # no closing tag, append
<tag>line3 # no closing tag, append
line4</tag> # no opening tag, but has a closing tag, so ignore
<tag>line5</tag>'
I tried to create a regular expression to match this but I know its wrong:
text.gsub! /.*?(<\/tag>)Z/, '</tag>'
How can I create a regular expression to conditionally append each line?
Here you go:
text.gsub!(%r{(?<!</tag>)$}, "</tag>")
Explanation:
$ means end of line and \z means end of string. \Z means something similar, with complications.
(?<!) work together to create a negative lookbehind.
Given the example provided, I'd just do something like this:
text.split(/<\/?tag>/).
reject {|t| t.strip.length == 0 }.
map {|t| "<tag>%s</tag>" % t.strip }.
join("\n")
You're basically treating either and as record delimiters, so you can just split on them, reject any blank records, then construct a new combined string from the extracted values. This works nicely when you can't count on newlines being record delimiters and will generally be tolerant of missing tags.
If you're insistent on a pure regex solution, though, and your data format will always match the given format (one record per line), you can use a negative lookbehind:
text.strip.gsub(/(?<!<\/tag>)(\n|$)/, "</tag>\\1")
One that could work is:
/<tag>[^\n ]+[^>][\s]*(\n)/
This is will return all the newline chars without a ">" before them.
Replace it with "\n", i.e.
text.gsub!( /<tag>[^\n ]+[^>][\s]*(\n)/ , "</tag>\n")
For more polishing, try http://rubular.com/
text = '<tag>line 1</tag>
<tag>line2
<tag>line3
line4</tag>
<tag>line5</tag>'
result = ""
text.each_line do |line|
line.rstrip!
line << "</tag>" if not line.end_with?("</tag>")
result << line << "\n"
end
puts result
--output:--
<tag>line 1</tag>
<tag>line2</tag>
<tag>line3</tag>
line4</tag>
<tag>line5</tag>
I am retrieving a large hash of results from a database query and writing them to a csv file. The code block below takes the results and creates the CSV. With the quote_char: option it will replace the quotes with NULL characters which I need to properly create the tab-delimited file.
However, the NULL characters are getting converted into "" when they are loaded into their destination so I would like to remove those. If I leave out quote_char: every field is double quoted which causes the same result.
How can I remove the NULL characters?
begin
CSV.open("#{file_path}"'file.tab', "wb", Options = {col_sep: "\t", quote_char: "\0"}) do |csv|
csv << ["Key","channel"]
series_1_results.each_hash do |series_1|
csv << ["#{series_1['key']}","#{series_1['channel']}"]
end
end
end
As it is stated in the csv documentation you have to the set quote_char to some character, and this character will always be used to quote empty fields.
It seems the only solution in this case is to remove used quote_chars from the created csv file. You can do it like this:
quotedFile = File.read("#{file_path}"'file.tab')
unquotedFile = quotedFile.gsub("\0", "")
File.open("#{file_path}"'unquoted_file.tab',"w") { |file| file.puts replace }
I assume here that NULL's are the only escaped fields. If that's not the case use default quote_char: '"' and gsub(',"",', '') which should handle almost all possible cases of fields containing special characters.
But as you note that the results of your query are large it might be more practical to prepare the csv file on your own and avoid processing the outputs twice. You could simply write:
File.open("#{file_path}"'unquoted_file.tab',"w") do |file|
csv.puts ["Key","channel"]
series_1_results.each_hash do |series_1|
csv.puts ["#{series_1['key']},#{series_1['channel']}"]
end
end
Once more, you might need to handle fields with special characters.
From the Ruby CSV docs, setting force_quotes: false in the options seems to work.
CSV.open("#{file_path}"'file.tab', "wb", { col_sep: "\t", force_quotes: false }) do |csv|
The above does the trick. I'd suggest against setting quote_char to \0 since that doesn't work as expected.
There is one thing to note though. If the field is an empty string "" - it will force the quote_char to be printed into the CSV. But strangely a nil value does not. I'd suggest that if at all you're expecting empty strings in the data, you somehow convert them into nil when writing to the CSV (maybe using the ActiveSupport presence method or anything similar).
First, a tab-separated file is "TSV", vs. a comma-separated file which is "CSV".
Wrapping quotes around fields is necessary any time there could be an occurrence of the field delimiter inside a field.
For instance, how are you going to embed this string in a tab-delimited file?
Foo\tbar
The \t is the representation of an embedded Tab.
The same problem occurs when writing a CSV file with a field containing commas. The field has to be wrapped in double-quotes to delimit the field itself.
If your input contains any data that needs to be escaped (such as the column separator or the quote character), then you do need to quote your data. Otherwise it cannot be parsed correctly later.
CSV.open('test.csv', 'wb', col_sep: "\t") do |csv|
csv << ["test", "'test'", '"test"', nil, "test\ttest"]
end
puts open('test.csv').read
#test 'test' """test""" "test test"
The CSV class won't quote anything unnecessarily (as you can see above). So I'm not sure why you're saying all your fields are being quoted. It could be somehow force_quotes is getting set to true somewhere.
If you're absolutely certain your data will never contain \t or ", then the default quote_char (") should work just fine. Otherwise, if you want to avoid quoting anything, you'll need to pick another quote character that you're absolutely certain won't occur in your data.
CSV.open('test.csv', 'wb', col_sep: "\t", quote_char: "|") do |csv|
csv << ["test", "'test'", nil, '"test"']
end
puts open('test.csv').read
#test 'test' "test"
I have a problem with saving records in MongoDB using Mongoid when they contain multibyte characters. This is the string:
a="Chris \xA5\xEB\xAE\xDFe\xA5"
I first convert it to BINARY and I then gsub it like this:
a.force_encoding("BINARY").gsub(0xA5.chr,"oo")
...which works fine:
=> "Chris oo\xEB\xAE\xDFeoo"
But it seems that I can not use the chr method if I use Regexp:
a.force_encoding("BINARY").gsub(/0x....?/.chr,"")
NoMethodError: undefined method `chr' for /0x....?/:Regexp
Anybody with the same issue?
Thanks a lot...
You can do that with interpolation
a.force_encoding("BINARY").gsub(/#{0xA5.chr}/,"")
gives
"Chris \xEB\xAE\xDFe"
EDIT: based on the comments, here a version that translates the binary encode string to an ascii representation and do a regex on that string
a.unpack('A*').to_s.gsub(/\\x[A-F0-9]{2}/,"")[2..-3] #=>"Chris "
the [2..-3] at the end is to get rid of the beginning [" and and trailing "]
NOTE: to just get rid of the special characters you also could just use
a.gsub(/\W/,"") #=> "Chris"
The actual string does not contain the literal characters \xA5: that is just how characters that would otherwise be unprintable are shown to you (similar when a string contains a newline ruby shows you \n).
If you want to change any non ascii stuff you could do this
a="Chris \xA5\xEB\xAE\xDFe\xA5"
a.force_encoding('BINARY').encode('ASCII', :invalid => :replace, :undef => :replace, :replace => 'oo')
This starts by forcing the string to the binary encoding (you always want to start with a string where the bytes are valid for its encoding. binary is always valid since it can contain arbitrary bytes). Then it converts it to ASCII. Normally this would raise an error since there are characters that it doesn't know what to do with but the extra options we've passed tell it to replace invalid/undefined sequences with the characters 'oo'
What's the best (most efficient) way to parse a tab-delimited file in Ruby?
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).
I have a hash names hsh that has values that are UTF-8 encoded. For example:
hsh ={:name => some_utf_8_string, :text => :some_other_utf_8_string}
I am currently doing the following:
$KCODE="UTF8"
File.open("save.tsv","w") do{|file|
file.puts hsh.values.map{|x| x.to_s.gsub("\t",' ')}.join("\t")
}
But this croaks randomly because I think some of the multibyte contents sort of match "\t" and it fails. Is there a recommended string I can use instead of "\t" and also is there a better way of doing the above?
If your data is valid utf8, there is no way for a tab character to "sort of" match part of a multibyte sequence (this is one of the advantages of utf8 over some other multibyte encodings). Can you go into more detail about what you mean by "croak"?