In Ruby, Does Dividing a String with Backslash Represent Implicit Concatenation? - ruby

Consider dividing a String in Ruby as follows:
"Some content "\
"and some more "\
"and the final line."
Does the above create three String objects and concatenate them together? Or does Ruby treat that as one string, which is merely visually split by the backslash character for developer convenience (that is, the backslash is not an operand / method) ?
Also, is there a word / term for dividing a string in this way, or is it simply another way to concatenate?

Automatic Concatenation of String Literals
Technically, you are creating three string literals, which are then automatically concatenated by the interpreter. This is documented here, where is says:
Adjacent string literals are automatically concatenated by the interpreter..Any combination of adjacent single-quote, double-quote, percent strings will be concatenated as long as a percent-string is not last.
In your specific example, escaping the newlines at the end of each line of code makes them "adjacent" for the purposes of automatic concatenation. It ignores the whitespace between strings, so the following are also logically equivalent to your original example:
"Some content "'and some more '"and the final line."
#=> "Some content and some more and the final line."
"Some content " "and some more " "and the final line."
#=> "Some content and some more and the final line."

Related

Unable to substitute escaped characters in string

I have this string:
str = "no,\"contact_last_name\",\"token\""
=> "no,\"contact_last_name\",\"token\""
I want to remove the escaped double quoted string character \". I use gsub:
result = str.gsub('\\"','')
=> "no,\"contact_last_name\",\"token\""
It appears that the string has not substituted the double quote escape characters in the string.
Why am I trying to do this? I have this csv file:
no,"contact_last_name","token",company,urbanization,sec-"property_address","property_address",city-state-zip,ase,oel,presorttrayid,presortdate,imbno,encodedimbno,fca,"property_city","property_state","property_zip"
1,MARIE A JEANTY,1083123,,,,17 SW 6TH AVE,DANIA BEACH FL 33004-3260,Electronic Service Requested,,T00215,12/14/2016,00-314-901373799-105112-33004-3260-17,TATTTADTATTDDDTTFDDFATFTDDDTTFADTTDFAAADDATDAATTFDTDFTTAFFTTATFFF,017,DANIA BEACH,FL, 33004-3260
When I try to open it with CSV, I get the following error:
CSV.foreach(path, headers: true) do |row|
end
CSV::MalformedCSVError: Illegal quoting in line 1.
Once I removed those double quoted strings in the first row (the header), the error went away. So I am trying to remove those double quoted strings before I run it through CSV:
file = File.open "file.csv"
contents = file.read
"no,\"contact_last_name\",\"token\" ... "
contents.gsub!('\\"','')
So again my question is why is gsub not removing the specified characters? Note that this actuall does work:
contents.gsub /"/, ""
as if the string is ignoring the \ character.
There is no escaped double quote in this string:
"no,\"contact_last_name\",\"token\""
The interpreter recognizes the text above as a string because it is enclosed in double quotes. And because of the same reason, the double quotes embedded in the string must be escaped; otherwise they signal the end of the string.
The enclosing double quote characters are part of the language, not part of the string. The use of backslash (\) as an escape character is also the language's way to put inside a string characters that otherwise have special meaning (double quotes f.e.).
The actual string stored in the str variable is:
no,"contact_last_name","token"
You can check this for yourself if you tell the interpreter to put the string on screen (puts str).
To answer the issue from the question's title, all your efforts to substitute escaped characters string were in vain just because the string doesn't contain the character sequences you tried to find and replace.
And the actual problem is that the CSV file is malformed. The 6th value on the first row (sec-"property_address") doesn't follow the format of a correctly encoded CSV file.
It should read either sec-property_address or "sec-property_address"; i.e. the value should be either not enclosed in quotes at all or completely enclosed in quotes. Having it partially enclosed in quotes confuses the Ruby's CSV parser.
The string looks fine; You're not understanding what you're seeing. Meditate on this:
"no,\"contact_last_name\",\"token\"" # => "no,\"contact_last_name\",\"token\""
'no,"contact_last_name","token"' # => "no,\"contact_last_name\",\"token\""
%q[no,"contact_last_name","token"] # => "no,\"contact_last_name\",\"token\""
%Q#no,"contact_last_name","token"# # => "no,\"contact_last_name\",\"token\""
When looking at a string that is delimited by double-quotes, it's necessary to escape certain characters, such as embedded double-quotes. Ruby, along with many other languages, has multiple ways of defining a string to remove that need.

Complex requirements for string split around select commas

TL;DR
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
Explanation of the problem:
This is a pretty complex problem so any help is greatly appreciated. I'm adding functionality to a library I've been using for a while now. I have this string that contains an array
"['d,og,f:asdf,:hello,",,\",,alsee',,,'ho,la', "-123,4,5.3", true, :good, false,,, "gr\'\'\'true,\',\'ee\"n", ":::testme", true]"
I would like to split this string only around select commas so that I have an array containing the following values
'd,og,f:asdf,:hello,",,\",,alsee'
nil
nil
'ho,la'
"-123,4,5.3"
true
:good
false
nil
nil
"gr\'\'\'true,\',\'ee\"n"
":::testme"
true
Then nil values are coming from the side by side commas that are not contained in any string. I wrote the following regex to split the string above (I already got rid of the start and end brackets):
/(?<=(?:['\"]|false|true|^|,)),(?=(?:\s*(?:(?::[\w]+)|(?:(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*'))|(?:false|true)))\s*(?:,|$)))/
This splits the string so I get these values:
(0) "'d,og,f:asdf,:hello,",,\",,alsee',,"
(1) "'ho,la'"
(2) " "-123,4,5.3""
(3) " true"
(4) " :good, false,,"
(5) " "gr\'\'\'true,\',\'ee\"n""
(6) " ":::testme""
(7) " true"
All the values are strings as can be seen by their surrounding double quotes. They will not all end up that way though. A true or false will be converted to a boolean. The values surrounded by internal quotes will end up as strings. Then a value preceded with a : will end up as a symbol.
There are problems with the values at index 0 and 4. Index 0 should be this:
(0.0) "'d,og,f:asdf,:hello,",,\",,alsee'"
(0.1) nil
(0.2) nil
As you can see, the two commas at the end are gone. They have become the two nil values you see above. Then the string starts at the first single quote and ends at the last single quote, signifying that this value in the array is a string.
Then index 4 (" :good, false,,") should be this:
(4.0) " :good"
(4.1) " false"
(4.2) nil
(4.3) nil
The two commas at the end have become nil. Then " false" is it's own value which will later be converted to a boolean, while " :good" is also it's own value and will later be converted to a symbol.
To fix the problem with index 4 I have all the values run through a second regex. Here it is:
/^(\s*:(?:(?:[\w]+|\"[\s\S]+\"|'[\s\S]+')\s*)),([\s\S]*)$/
Instead of splitting this one I get the capture groups. It ends up returning this array for the value at index 4:
(4.0) " :good"
(4.1) " false,,"
That's what I wanted except for one problem. The value at index 4.1 (" false,,") has the two trailing commas which should be nil values in the array.
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
"['d,og,f:asdf,:hello,"
,,\
",,alsee',,,'ho,la', "
-123,4,5.3
", true, :good, false,,, "
gr\
'\'
I count 4 strings. 3 in double quotes and the last one in single quotes?
You say this is broken down into smaller strings by your regx. But what about the characters outside the 4 strings?
Sorry, it looks a bit of a mess.
Try putting it all in a here document string and then breaking it down by a regx.
I finally figured it out myself. You can see how it fits in with the rest if you look at the description of the question above.
/^(([\s]*,)*)[\s]*((?::[\w]+)|(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*')|false|true))?(([\s]*,)*)$/

Ruby string sub without regex back references

I'm trying to do a simple string sub in Ruby.
The second argument to sub() is a long piece of minified JavaScript which has regular expressions contained in it. Back references in the regex in this string seem to be effecting the result of sub, because the replaced string (i.e., the first argument) is appearing in the output string.
Example:
input = "string <!--tooreplace--> is here"
output = input.sub("<!--tooreplace-->", "\&")
I want the output to be:
"string \& is here"
Not:
"string & is here"
or if escaping the regex
"string <!--tooreplace--> is here"
Basically, I want some way of doing a string sub that has no regex consequences at all - just a simple string replace.
To avoid having to figure out how to escape the replacement string, use Regex.escape. It's handy when replacements are complicated, or dealing with it is an unnecessary pain. A little helper on String is nice too.
input.sub("<!--toreplace-->", Regexp.escape('\&'))
You can also use block notation to make it simpler (as opposed to Regexp.escape):
=> puts input.sub("<!--tooreplace-->") {'\&'}
string \& is here
Use single quotes and escape the backslash:
output = input.sub("<!--tooreplace-->", '\\\&') #=> "string \\& is here"
Well, since '\\&' (that is, \ followed by &) is being interpreted as a special regex statement, it stands to reason that you need to escape the backslash. In fact, this works:
>> puts 'abc'.sub 'b', '\\\\&'
a\&c
Note that \\\\& represents the literal string \\&.

ruby parametrized regular expression

I have a string like "{some|words|are|here}" or "{another|set|of|words}"
So in general the string consists of an opening curly bracket,words delimited by a pipe and a closing curly bracket.
What is the most efficient way to get the selected word of that string ?
I would like do something like this:
#my_string = "{this|is|a|test|case}"
#my_string.get_column(0) # => "this"
#my_string.get_column(2) # => "is"
#my_string.get_column(4) # => "case"
What should the method get_column contain ?
So this is the solution I like right now:
class String
def get_column(n)
self =~ /\A\{(?:\w*\|){#{n}}(\w*)(?:\|\w*)*\}\Z/ && $1
end
end
We use a regular expression to make sure that the string is of the correct format, while simultaneously grabbing the correct column.
Explanation of regex:
\A is the beginnning of the string and \Z is the end, so this regex matches the enitre string.
Since curly braces have a special meaning we escape them as \{ and \} to match the curly braces at the beginning and end of the string.
next, we want to skip the first n columns - we don't care about them.
A previous column is some number of letters followed by a vertical bar, so we use the standard \w to match a word-like character (includes numbers and underscore, but why not) and * to match any number of them. Vertical bar has a special meaning, so we have to escape it as \|. Since we want to group this, we enclose it all inside non-capturing parens (?:\w*\|) (the ?: makes it non-capturing).
Now we have n of the previous columns, so we tell the regex to match the column pattern n times using the count regex - just put a number in curly braces after a pattern. We use standard string substition, so we just put in {#{n}} to mean "match the previous pattern exactly n times.
the first non skipped column after that is the one we care about, so we put that in capturing parens: (\w*)
then we skip the rest of the columns, if any exist: (?:\|\w*)*.
Capturing the column puts it into $1, so we return that value if the regex matched. If not, we return nil, since this String has no nth column.
In general, if you wanted to have more than just words in your columns (like "{a phrase or two|don't forget about punctuation!|maybe some longer strings that have\na newline or two?}"), then just replace all the \w in the regex with [^|{}] so you can have each column contain anything except a curly-brace or a vertical bar.
Here's my previous solution
class String
def get_column(n)
raise "not a column string" unless self =~ /\A\{\w*(?:\|\w*)*\}\Z/
self[1 .. -2].split('|')[n]
end
end
We use a similar regex to make sure the String contains a set of columns or raise an error. Then we strip the curly braces from the front and back (using self[1 .. -2] to limit to the substring starting at the first character and ending at the next to last), split the columns using the pipe character (using .split('|') to create an array of columns), and then find the n'th column (using standard Array lookup with [n]).
I just figured as long as I was using the regex to verify the string, I might as well use it to capture the column.

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources