Ruby Regex and string variable - ruby

I am trying to match a string with a non-breaking space ( ) given a variable that contains the string with a regular space. The string I am looking for is the text in a HTML link/anchor and I am using Watir (note the non-breaking space).
<a onlick='DoSomthing()' href=''>Some Text</a>
There appears to be a difference between a regex created by // and by Regex.new.
Interactive Ruby says the following is true (where my_text = 'Some Text'):
/Some Text/ == Regexp.new(my_text)
Yet while this returns True:
browser.link(:text, /Some Text/).exists?
This does not:
browser.link(:text, Regexp.new(my_text)).exists?
Nor does this:
browser.link(:text, /#{my_text}/).exists?
I've also tried the following with no luck:
Regexp.new(my_text.gsub(' ', '[[:space:]]'))
Does anyone know how I can accomplish this match?

Use alternation:
browser.link(:text, / | /).exists?
Also, try upgrading Ruby and gems. I've heard weird regex issues in Watir resolving magically that way.

A non breaking space is an html entity, and regex afaik does not recognize that as a space, so you need to convert one or the other before matching.
my_text = 'Some Text'
in other words, I don't think regex would ever match a space to " ". change your search string, or the source text, whichever is easier...

Related

Converting String to Regex string

How can I transform a string into a regex string, properly escaping all regex-specific characters? I am using interpolation to build the regex string to allow users to customize the regex without having to touch the code (or expecting them to know regex)
Example
custom_text = "Hello"
my_regex = /#{custom_text}:\s*(\d+)/i
Which results in the following regex when my code uses it
/Hello:\s*(\d+)/i
This allows users to perhaps provide language localizations without having to worry about figuring out where my regex is used, how it's used, or whether they will break the script if they changed something.
However if they wanted to include things like periods or question marks like Hello?, I would probably need to escape them first.
Use Regexp.escape:
my_regex = /#{Regexp.escape(custom_text)}:\s*(\d+)/i
For example:
>> puts /#{Regexp.escape('Hello?')}/.inspect
/Hello\?/

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

gsub ASCII code characters from a string in ruby

I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:
#parser.leads[0].phone_numbers[0].each_byte do |c|
puts "char=#{c}"
end
The characters in question have an ASCII code of 194 and 160.
I want to somehow strip these characters out while parsing.
I have tried the following code but it does not work.
#parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')
Can anyone tell me how to achieve this?
I found this question while trying to strip out invisible characters when "trimming" a string.
s.strip did not work for me and I found that the invisible character had the ord number 194
None of the methods above worked for me but then I found "Convert non-breaking spaces to spaces in Ruby " question which says:
Use /\u00a0/ to match non-breaking spaces: s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
So glad I found that! Now I'm using:
s.gsub(/[[:space:]]/,'')
This doesn't answer the question of how to gsub specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.
Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".
Instead, do gsub(160.chr, '').
Update (2018): This code does not work in current Ruby versions. Please refer to other answers.
You can also try
s.gsub(/\xA0|\xC2/, '')
or
s.delete 160.chr+194.chr
First thought would be should you be using gsub! instead of gsub
gsub returns a string and gsub! performs the substitution in place
I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\xA0]/n,"") instead and it worked perfectly fine for me.

how to convert strings like "this is an example" to "this-is-an-example" under ruby

How do I convert strings like "this is an example" to "this-is-an-example" under ruby?
The simplest version:
"this is an example".tr(" ", "-")
#=> "this-is-an-example"
You could also do something like this, which is slightly more robust and easier to extend by updating the regular expression:
"this is an example".gsub(/\s+/, "-")
#=> "this-is-an-example"
The above will replace all chunks of white space (any combination of multiple spaces, tabs, newlines) to a single dash.
See the String class reference for more details about the methods that can be used to manipulate strings in Ruby.
If you are trying to generate a string that can be used in a URL, you should also consider stripping other non-alphanumeric characters (especially the ones that have special meaning in URLs), or replacing them with an alphanumeric equivalent (example, as suggested by Rob Cameron in his answer).
If you are trying to make something that is a good URL slug, there are lots of ways to do it.
Generally, you want to remove everything that is not a letter or number, and then replace all whitespace characters with dashes.
So:
s = "this is an 'example'"
s = s.gsub(/\W+/, ' ').strip
s = s.gsub(/\s+/,'-')
At the end s will equal "this-is-an-example"
I used the source code from a ruby testing library called contest to get this particular way to do it.
If you're using Rails take a look at parameterize(), it does exactly what you're looking for:
http://api.rubyonrails.org/classes/ActiveSupport/CoreExtensions/String/Inflections.html#M001367
foo = "Hello, world!"
foo.parameterize => 'hello-world'

Locating the node by value containing whitespaces using XPath

I need to locate the node within an xml file by its value using XPath.
The problem araises when the node to find contains value with whitespaces inside.
F.e.:
<Root>
<Child>value</Child>
<Child>value with spaces</Child>
</Root>
I can not construct the XPath locating the second Child node.
Simple XPath /Root/Child perfectly works for both children, but /Root[Child=value with spaces] returns an empty collection.
I have already tried masking spaces with %20, & #20;, & nbsp; and using quotes and double quotes.
Still no luck.
Does anybody have an idea?
Depending on your exact situation, there are different XPath expressions that will select the node, whose value contains some whitespace.
First, let us recall that any one of these characters is "whitespace":
-- the Tab
-- newline
-- carriage return
' ' or -- the space
If you know the exact value of the node, say it is "Hello World" with a space, then a most direct XPath expression:
/top/aChild[. = 'Hello World']
will select this node.
The difficulties with specifying a value that contains whitespace, however, come from the fact that we see all whitespace characters just as ... well, whitespace and don't know if a it is a group of spaces or a single tab.
In XPath 2.0 one may use regular expressions and they provide a simple and convenient solution. Thus we can use an XPath 2.0 expression as the one below:
/*/aChild[matches(., "Hello\sWorld")]
to select any child of the top node, whose value is the string "Hello" followed by whitespace followed by the string "World". Note the use of the matches() function and of the "\s" pattern that matches whitespace.
In XPath 1.0 a convenient test if a given string contains any whitespace characters is:
not(string-length(.)= stringlength(translate(., '
','')))
Here we use the translate() function to eliminate any of the four whitespace characters, and compare the length of the resulting string to that of the original string.
So, if in a text editor a node's value is displayed as
"Hello World",
we can safely select this node with the XPath expression:
/*/aChild[translate(., '
','') = 'HelloWorld']
In many cases we can also use the XPath function normalize-space(), which from its string argument produces another string in which the groups of leading and trailing whitespace is cut, and every whitespace within the string is replaced by a single space.
In the above case, we will simply use the following XPath expression:
/*/aChild[normalize-space() = 'Hello World']
Try either this:
/Root/Child[normalize-space(text())=value without spaces]
or
/Root/Child[contains(text(),value without spaces)]
or (since it looks like your test value may be the issue)
/Root/Child[normalize-space(text())=normalize-space(value with spaces)]
Haven't actually executed any of these so the syntax may be wonky.
Locating the Attribute by value containing whitespaces using XPath
I have a input type element with value containing white space.
eg:
<input type="button" value="Import Selected File">
I solved this by using this xpath expression.
//input[contains(#value,'Import') and contains(#value ,'Selected')and contains(#value ,'File')]
Hope this will help you guys.
"x0020" worked for me on a jackrabbit based CQ5/AEM repository in which the property names had spaces. Below would work for a property "Record ID"-
[(jcr:contains(jcr:content/#Record_x0020_ID, 'test'))]
did you try #x20 ?
i've googled this up like on the second link:
try to replace the space using "x0020"
this seems to work for the guy.
All of the above solutions didn't really work for me.
However, there's a much simpler solution.
When you create the XMLDocument, make sure you set PreserveWhiteSpace property to true;
XmlDocument xmldoc = new XmlDocument();
xmldoc.PreserveWhitespace = true;
xmldoc.Load(xmlCollection);

Resources