quote_char causing fits in ruby CSV import - ruby

I have a simple CSV file that uses the | (pipe) as a quote character. After upgrading my rails app from Ruby 1.9.2 to 1.9.3 I'm getting an "CSV::MalformedCSVError: Missing or stray quote in line 1" error.
If I pop open vim and replace the | with regular quotes, single quotes or even "=", the file works fine, but | and * result in the error. Anyone have any thoughts on what might be causing this? Here's a simple one-liner that can reproduce the error:
#csv = CSV.read("public/sample_file.csv", {quote_char: '|', headers: false})
Also reproduced this in Ruby 2.0 and also in irb w/out loading rails.
Edit: here are some sample lines from the CSV
|076N102 |,|CARD |,| 1|,|NEW|,|PCS |
|07-1801 |,|BASE |,| 18|,|NEW|,|PCS |

I think you've just discovered a bug in CSV ruby module.
From csv.rb :
1587: #re_chars = /#{%"[-][\\.^$?*+{}()|# \r\n\t\f\v]".encode(#encoding)}/
This Regexp is used to escape characters conflicting with special regular expression symbols, including your "pipe" char | .
I don't see any reason for the prepending [-], so if you do remove it, your example starts to work:
edit: the hyphen has to be escaped inside character set expression (surrounded with brackets []) only when not as the leading character. So had to update the fixed Regexp:
1587: #re_chars = /#{%"(?<!\\[)-(?=.*\\])|[\\.^$?*+{}()|# \r\n\t\f\v]".encode(#encoding)}/
CSV.read('sample.csv', {quote_char: '|'})
# [["076N102 ",
# "CARD ",
# " 1", "NEW", "PCS "],
# ["07-1801 ",
# "BASE ",
# " 18", "NEW", "PCS "]]
As most languages does not support lookbehind expressions with quantifiers, Ruby included, I had to write it as a negative version for the left bracket. It would also match hyphens with missing left one of a bracket pair. If you'd find a better solution, leave a comment pls.
Glad to hear any comments before fill in a bug report to ruby-lang.org .

Related

Why does File.dirname returns a period when I expect a path?

I am trying to get the directory of a file on a Windows box using File.dirname. I get the file ("file1" below) from the Windows box and return it to my the Mac OS X box that the script is run on.
file1 = "C:\Administrator\proj1\testFile.txt" below is to simplify my example, however, to make it more clear, I am getting this value from a remote box and returning it to my development box:
file1 = "C:\Administrator\proj1\testFile.txt"
path = "#{File.dirname(file1)}"
puts "#{path}"
>> .
I am confused on why it would return '.'. I saw on ruby-doc.org that File.dirname says the following:
"Returns all components of the filename given in file_name except the last one. The filename can be formed using both File::SEPARATOR and File::ALT_SEPARETOR as the separator when File::ALT_SEPARATOR is not nil."
I did a puts on File::SEPARATOR and File::ALT_SEPARATOR and got the following:
File::SEPARATOR >> /
File::ALT_SEPARATOR >>
I assumed it was because "\" wasn't a valid file separator. So I set File::ALT_SEPARATOR to "\". However, even after that, I still got the same value when I puts path.
I tried using File.realdirpath and this was the result:
file1 = "C:\Administrator\proj1\testFile.txt"
path = "#{File.realdirpath(file1)}"
puts "{path}"
>> /Users/me/myProject/C:\Administrator\proj1\testFile.txt
It seemed to add the path from where I called the Ruby script and appended the full path (including the file name). Seems to be odd behavior.
Any ideas, comments or suggestions would be great.
The problem is that when you declare file1, those backslashes define escape characters. Notice the return:
file1 = "C:\Administrator\proj1\testFile.txt"
=> "C:Administratorproj1\testFile.txt"
If you want to store a filepath in a string, you either need to use forward slashes or double backslashes (to escape the escape character):
file1 = "C:\\Administrator\\proj1\\testFile.txt"
file1 = "C:/Administrator/proj1/testFile.txt"
Okay, I was able to duplicate this problem as well.
As #fbonetti pointed out, you have to enclose your directory with single quotes to keep ruby from interpreting the backslashes as escapes, so start with that...
>> file1='C:\Administrator\proj1\testFile.txt'
=> "C:\\Administrator\\proj1\\testFile.txt"
Then, passing file1 through gsub to 'normalize' the slashes, gives you the results you're expecting.
>> File.dirname(file1.gsub('\\', '/'))
=> "C:/Administrator/proj1"
Of course, you could always reverse the gsub if you needed them to be backslashes again.
>> File.dirname(file1.gsub('\\', '/')).gsub('/', '\\')
=> "C:\\Administrator\\proj1"
I figured it out. It was an issue with the version of Ruby I was using. I was using ruby 1.9.3 and then I switched to jruby 1.7.3 and it works correctly now.
Ruby's IO documentation is of great help when dealing with different OS path separators. From the documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
Our examples here will use the Unix-style forward slashes; File::ALT_SEPARATOR can be used to get the platform-specific separator character.
So, in other words, you don't need to hassle with backslashes, and whether you need to use single or double-quotes. Keep it simple, and use forward-slashes and let Ruby worry about it. That way your code is portable across *nix/Mac OS and Windows.
Beyond that, it looks like there's a real need to learn how character escaping works in double-quoted strings vs. single-quoted strings. This is from "Programming Ruby":
Ruby provides a number of mechanisms for creating literal strings. Each generates objects of type String. The different mechanisms vary in terms of how a string is delimited and how much substitution is done on the literal's content.
Single-quoted string literals (' stuff ' and %q/stuff/) undergo the least substitution. Both convert the sequence into a single backslash, and the form with single quotes converts \' into a single quote.
'hello' » hello
'a backslash \'\\\'' » a backslash '\'
%q/simple string/ » simple string
%q(nesting (really) works) » nesting (really) works
%q no_blanks_here ; » no_blanks_here
Double-quoted strings ("stuff", %Q/stuff/, and %/stuff/) undergo additional substitutions, shown in Table 18.2 on page 203.
Substitutions in double-quoted strings
\\a Bell/alert (0x07) \\nnn Octal nnn
\\b Backspace (0x08) \\xnn Hex nn
\\e Escape (0x1b) \\cx Control-x
\\f Formfeed (0x0c) \\C-x Control-x
\\n Newline (0x0a) \\M-x Meta-x
\\r Return (0x0d) \\M-\\C-x Meta-control-x
\\s Space (0x20) \\x x
\\t Tab (0x09) #{expr} Value of expr
\\v Vertical tab (0x0b)
a = 123
"\123mile" » Smile
"Say \"Hello\"" » Say "Hello"
%Q!"I said 'nuts'," I said! » "I said 'nuts'," I said
%Q{Try #{a + 1}, not #{a - 1}} » Try 124, not 122
%<Try #{a + 1}, not #{a - 1}> » Try 124, not 122
"Try #{a + 1}, not #{a - 1}" » Try 124, not 122

Replace "OS agnostic" newlines

I have several different document formats coming in. I'd like to strip out all the newlines and replace them with a " ". How can I account for newlines other than "\n"?
Something like s.gsub("\n", " ")
Most operating systems use \n or \r (or a combination) for newlines.
s.gsub(/[\n\r]+/, " ") should do the trick.
/[\n\r]+/ is known as a regular expression. It matches \n, \r and any combination of the two.
To make it your code more readable you could however use my gem.
You can install it this way:
gem install linebreak
You can use it this way:
require 'aef/linebreak/string_extension'
"Something\n".linebreak_encode(" ")
# => "Something "
Other examples:
"Something\n".linebreak_encode(:windows)
# => "Something\r\n"
"Something\r\n".linebreak_encode(:unix)
# => "Something\n"
It additionally comes with a commandline tool. Documentation can be found here.

Rubular/Ruby discrepancy in captured text

I've carefully cut and pasted from this Rubular window http://rubular.com/r/YH8Qj2EY9j to my code, yet I get different results. The Rubular match capture is what I want. Yet
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
only gets me the first line, i.e.
<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
I don't think it's my test data, but that's possible. What am I missing?
(ruby 1.9 on Ubuntu 10.10(
Paste your test data into an editor that is able to display control characters and verify your line break characters. Normally it should be only \n on a Linux system as in your regex. (I had unusual linebreaks a few weeks ago and don't know why.)
The other check you can do is, change your brackets and print your capturing groups. so that you can see which part of your regex matches what.
/^<DD>(.*)\n?(.*)\n/
Another idea to get this to work is, change the .*. Don't say match any character, say match anything, but \n.
^<DD>([^\n]*\n?[^\n]*)\n
I believe you need the multiline modifier in your code:
/m Multiline mode: dot matches newlines, ^ and $ both match line starts and endings.
The following:
#!/usr/bin/env ruby
desc= '<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
<DT>la la this should not be matched oh good'
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
prints
#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
on my system (Linux, Ruby 1.8.7).
Perhaps your line breaks are really \r\n (Windows style)? What if you try:
desc_pattern = /^<DD>(.*\r?\n?.*)\r?\n/

in Ruby, trying to convert those weird quotes into "regular" quotes

I am trying to parse a text file that has the weird quotes like
“ and ” into "normal quotes like "
I tried this:
text.gsub!("“",'"')
text.gsub!("”",'"')
but when it's done, they are still there and show up as
\x93 and \x94
so I tried adding that too with no luck:
text.gsub!('\\x93', '"')
text.gsub!('\\x94', '"')
The problem is, when I try to show those weird quotes on a webpage, it makes that weird diamond with a question mark symbol: �
It seems to work:
text = "“foo”"
=> "\342\200\234foo\342\200\235"
irb(main):002:0> text.gsub!("“",'"')
=> "\"foo\342\200\235"
irb(main):003:0> text.gsub!("”",'"')
=> "\"foo\""
You need to use a hex editor to figure out all the character codes involved.
Re: the second question of why the weird quotes show on a web page as the � symbol:
Your problem is that your web page is not in UTF-8 mode. To get it there, see
http://www.w3.org/International/O-HTTP-charset
If you can't change your web server, add a meta line in the head section of your web pages: http://www.utf-8.com/
Larry
Your first gsubs should work. The reason the second set of gsubs don't work is that you're using single quotes and double backslash. Try the other way around:
text.gsub!("\x93", '"')
text.gsub!("\x94", '"')
You can also do this in one line:
text.gsub!("\x93", '"').gsub!("\x94", '"')
# or
text.gsub!(/(\x93|\x94)/, '"')
Are you sure the encoding of the string is correct?

Ruby RegEx problem text.gsub[^\W-], '') fails

I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.

Resources