In a malformed .csv file, there is a row of data with extra double quotes, e.g. the last line:
Name,Comment
"Peter","Nice singer"
"Paul","Love "folk" songs"
How can I remove the double quotes around folk and replace the string as:
Name,Comment
"Peter","Nice singer"
"Paul","Love _folk_ songs"
In Ruby 1.9, the following works:
result = subject.gsub(/(?<!^|,)"(?!,|$)/, '_')
Previous versions don't have lookbehind assertions.
Explanation:
(?<!^|,) # Assert that we're not at the start of the line or right after a comma
" # Match a quote
(?!,|$) # Assert that we're not at the end of the line or right before a comma
Of course this assumes that we won't run into pathological cases like
"Mary",""Oh," she said"
If you're not on Ruby 1.9, or just get tired of regexes sometimes, split the string on ,, strip the first/last quotes, replace remaining "s with _s, re-quote, and join with ,.
(We don't always have to worry about efficiency!)
$str = '"folk"';
$new = str_replace('"', '', $str);
/* now $new is only folk, without " */
Meta-strategy:
It's likely the case that the data was manually entered inconsistently, CSV's get messy when people manually enter either field terminators (double quote) or separators (comma) into the field itself. If you can have the file regenerated, ask them to use an extremely unlikely field begin/end marker, like 5 tilde's (~~~~~), and then you can split on "~~~~~,~~~~~" and get the correct number of fields every time.
Unless you have no other choice, get the file regenerated with correct escaping. Any other approach is asking for trouble, because the insertion of unescaped quotes is lossy, and thus cannot be reliably reversed.
If you can't get the file fixed from the source, then Tim Pietzcker's regex is better than nothing, but I strongly recommend that you have your script print all "fixed" lines and check them for errors manually.
Related
I was surprised that the following is a valid Parameter Expansion. Notice there are unescaped double quotes within double quotes:
result="${var1#"$var2"}"
Can someone please parse this for me?
There are double quotes nested in curly brackets which is OK.
But none of them is needed in this case.
result=${var1#$var2}
works the same even for values containing spaces and newlines.
The answer is that they get parsed separately. Let's take a simplified tour of the string.
result="${var1#"$var2"}" doesn't actually need any quotes in this case, but look over the string anyway...
result="...
The Parser says meh, it's an assignment, I know what to do with this, I'll ignore these, they aren't hurting anything, but now I have to find the terminating match. Then it reads the value after the quote, byte by byte, looking for the terminating double-quote. This starts a new context-1.
result="${...
Once it sees the open curly, it knows that the terminating quote cannot happen until it sees the matching closing curly. It starts a new context-2.
result="${var1#"...
Seeing a new double quote in this subcontext make this one the opening quote of an internal new context-3.
result="${var1#"$var2"...
When it sees this double-quote it matches it to the previous one, closing context-3, dropping back into context-2.
result="${var1#"$var2"}...
This close-curly allows it to close the still-open context-2, dropping back into context-1.
result="${var1#"$var2"}"
And finding this now-closing double-quote allows it to close context-1. The following newline may be used as a terminating character for the entire term, so it can be evaluated and assigned.
Backslash-Quoting the internal double-quotes, for example, would have added them to the string-term used for the tail trim, which would likely have failed because of it.
$: var1=aaa
$: var2=a
$: result="${var1#"$var2"}"
$: echo $result # does what you want/expect
aa
$: result="${var1#\"$var2\"}" # does NOT
$: echo $result
aaa
Doing it without the quotes, the parser knows this is an assignment and handles the values a little differently as mentioned in comments, but generally kinda treating them as if they were quoted.
$: result=${var1#$var2}
$: echo $result
aa
This means it doesn't have to deal with context-1 or context-3, and only has the curlies to worry about. The end result is the same.
Better?
This is related to cleaning files before parsing them elsewhere, namely, malformed/ugly CSV. I see plenty of examples for removing/matching all characters between certain strings/characters/delimiters, but I cannot find any for specific strings. Example portion of line would look something like:
","Should now be allowed by rule above "Server - Access" added by Rich"\r
To be clear, this is not the entire line, but the entire line is enclosed in quotes and separated by "," and ends in ^M (Windows newline/carriage return).The 'columns' preceding this would be enclosed at each side by ",". I would probably use this too to remove cruft that appears earlier in the line.
What I am trying to get to is the removal of all double quotes between "," and "\r ("Server - Access" - these ones) without removing the delimiters. Alternatively, I may just find and replace them with \" to delimit them for the Ruby CSV library. So far I have this:
(?<=",").*?(?="\\r)
Which basically matches everything between the delimiters. If I replace .*? with anything, be that a letter, double quotes etc, I get zero matches. What am I doing wrong?
Note: This should be Ruby compatible please.
If I understand you correctly, you can use negative lookahead and lookbehind:
text = '","Should now be allowed by rule above "Server - Access" added by Rich"\r'
puts text.gsub(/(?<!,)"(?![,\\r])/, '\"')
# ","Should now be allowed by rule above \"Server - Access\" added by Rich"\r
Of course, this won't work if the values themselves can contain comas and new lines...
Why does this string not split on each "\n"? (RUBY)
"ADVERTISING [7310]\n\t\tIRS NUMBER:\t\t\t\t061340408\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t0331\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-Q\n\t\tSEC ACT:\t\t1934 Act\n\t".split('\n')
>> ["ADVERTISING [7310]\n\t\tIRS NUMBER:\t\t\t\t061340408\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t0331\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-Q\n\t\tSEC ACT:\t\t1934 Act\n\t"]
You need .split("\n"). String interpolation is needed to properly interpret the new line, and double quotes are one way to do that.
In Ruby single quotes around a string means that escape characters are not interpreted. Unlike in C, where single quotes denote a single character. In this case '\n' is actually equivalent to "\\n".
So if you want to split on \n you need to change your code to use double quotes.
.split("\n")
Ruby has the methods String#each_line and String#lines
returns an enum:
http://www.ruby-doc.org/core-1.9.3/String.html#method-i-each_line
returns an array:
http://www.ruby-doc.org/core-2.1.2/String.html#method-i-lines
I didn't test it against your scenario but I bet it will work better than manually choosing the newline chars.
Or a regular expression
.split(/\n/)
You can't use single quotes for this:
"ADVERTISING [7310]\n\t\tIRS NUMBER:\t\t\t\t061340408\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t0331\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-Q\n\t\tSEC ACT:\t\t1934 Act\n\t".split("\n")
Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).
I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.