Regex in xpath? - ruby

I want to find a table cell that contains the link (\d{0,3} )?pieces.
How would I need to write this xpath?
Can I simply insert the xpath directly into the Capybara search? Or do I need to do something special to indicate it is a regex? Or can I not do it at all?

Xpath 1.0
XPath 1.0 does not include regular expression support. You should be able to achieve the desired match with the following expression:
//td/a['pieces'=substring(#href, string-length(#href) -
string-length('pieces') + 1) and
'pieces'=translate(#href, '0123456789', '') and
string-length(#href) > 5 and
string-length(#href) < 10]
The first test in the predicate checks that the string ends with pieces. The second test ensures that the entire string equals pieces when all of the digits are removed (i.e. there are no other characters). The final two tests ensure that the entire length of the string is between 6 and 9, which is the length of pieces plus zero to three digits.
Test it on the following document:
<table>
<tr>
<td>test0</td>
<td>no match</td>
<td>no match</td>
<td>test1</td>
<td>test2</td>
<td>no match</td>
<td>test3</td>
</tr>
</table>
It should match only the test0, test1, test2, and test3 links.
(Note: The expression may be further complicated by the possibility of other characters preceding the portion you're attempting to match.)
XPath 2.0
Achieving this in XPath 2.0 is trivial with the matches function.

//td/a[
substring-after(concat(#href ,'x') ,'pieces')='x'
and
111>=concat(0 ,translate( substring-before(#href ,'pieces') ,'0123456789 -.' ,'1111111111xxx'))
]
This is another solution, not necessarily better, but, perhaps, interesting.
The first conjunct is true just when #href contains exactly one occurrence
of 'pieces', and it is at the end.
The second conjunct is true just when the part of #href before 'pieces' is empty
or is a numeral made entirely of digits (no .,-, or white-space), with at most 3 digits.
The number of 1's in the '111>=' is the maximum number of digits that will match.
Reference: http://www.w3.org/TR/xpath
The substring-after function returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string.
The substring-before function returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string.
... a string that consists of optional whitespace followed by an optional minus sign followed by a Number followed by whitespace is converted to the IEEE 754 number ... any other string is converted to NaN
Number ::= Digits ('.' Digits?)? | '.' Digits
An attribute node has a string-value. The string-value is the normalized value as specified by the XML Recommendation [XML]
The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Related

What is [0] and [1..-1] in Ruby?

What do [0] and [1..-1] mean in the following code?
def capitalize(string)
puts "#{string[0].upcase}#{string[1..-1]}"
end
string[0] is a new string that contains the first character of string.
It is, in fact, syntactic sugar for string.[](0), i.e. calling the method String#[] on the String object stored in the variable string with argument 0.
The String#[] method also accepts a Range as argument, to extract a substring. In this case, the lower bound of range is the index where the substring starts and the upper bound is the index where the substring ends. Positive values count the characters from the beginning of the string (starting with 0), negative values count the characters from the end of the string (-1 denotes the last character).
The call string[1..-1] (string.[](1..-1)) returns a new string that is initialized with the substring of string that starts with the second character of string (1) and ends with its last character.
Put together, string[0].upcase is the uppercase version of the first character of string, string[1..-1] is the rest of string (everything but the first character).
Read more about different ways to access individual characters and substrings in strings using String#[] method.

Characters at the end do not match

I need to match all the alphabets and numbers in a string str.
This is my code.
str.match(/^(AB)(\d+)([A-Za-z][0-9])?/)
When str = AB57933A [sic], it matches only AB57933, and not the characters appended after the numbers.
If I try with str = AB57933AbC [sic], it matches only AB57933; it only matches up to the last number, and not the characters after that.
In the way you have written it:
/^(AB)(\d+)([A-Za-z][0-9])/
you impose that the last character is between 0 and 9, you can replace it depending on your needs by if you do not expect digits after the last letter
/^(AB)(\d+)([A-Za-z]+)/
or by
/^(AB)(\d+)([A-Za-z0-9]+)/
if AB57933AbC12 are also accepted as valid input.
Last but not least, if you do not use back references you can omit the parenthesis as you do not need capturing groups

Performing operations on each line of a string

I have a string named "string" that contains six lines.
I want to remove an "Z" from the end of each line (which each has) and capitalize the first character in each line (ignoring numbers and white space; e.g., "1. apple" -> "1. Apple").
I have some idea of how to do it, but have no idea how to do it in Ruby. How do I accomplish this? A loop? What would the syntax be?
Using regular expression (See String#gsub):
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
puts s.gsub(/z$/i, '').gsub(/^([^a-z]*)([a-z])/i) { $1 + $2.upcase }
# /z$/i - to match a trailing `z` at the end of lines.
# /^([^a-z]*)([a-z])/i - to match leading non-alphabets and alphabet.
# capture them as group 1 ($1), group 2 ($2)
output:
1. Apple
2. Banana
3. Cat
4. Dog
5. Elephant
6. Fruit
I would approach this by breaking your problem into smaller steps. After we've solved each of the smaller problems, you can put it all back together for a more elegant solution.
Given the initial string put forth by falsetru:
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
1. Break your string into an array of substrings, separated by the newline.
substrings = s.split(/\n/)
This uses the String class' split method and a regular expression. It searches for all occurrences of newline (backslash-n) and treats this as a delimiter, splitting the string into substrings based on this delimiter. Then it throws all of these substrings into an array, which we've named substrings.
2. Iterate through your array of substrings to do some stuff (details on what stuff later)
substrings.each do |substring|
.
# Do stuff to each substring
.
end
This is one form for how you iterate across an array in Ruby. You call the Array's each method, and you give it a block of code which it will run on each element in the array. In our example, we'll use the variable name substring within our block of code so that we can do stuff to each substring.
3. Remove the z character at the end of each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
end
Now, as we iterate through the array, the first thing we want to do is remove the z character at the end of each string. You do this with the gsub! method of String, which is a search-and-replace method. The first argument for this method is the regular expression of what you're looking for. In this case, we are looking for a z followed by the end-of-string (denoted by the dollar sign). The second argument is an empty string, because we want to replace what's been found with nothing (another way of saying - we just want to remove what's been found).
4. Find the index of the first letter in each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
end
The String class also has a method called index which will return the index of the first occurrence of a string that matches the regular expression your provide. In our case, since we want to ignore numbers and symbols and spaces, we are really just looking for the first occurrence of the very first letter in your substring. To do this, we use the regular expression /[a-zA-Z]/ - this basically says, "Find me anything in the range of small A to small Z or in big A to big Z." Now, we have an index (using our example strings, the index is 3).
5. Capitalize the letter at the index we have found
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
Based on the index value that we found, we want to replace the letter at that index with that same letter, but capitalized.
6. Put our substrings array back together as a single-string separated by newlines.
Now that we've done everything we need to do to each substring, our each iterator block ends, and we have what we need in the substrings array. To put the array back together as a single string, we use the join method of Array class.
result = substrings.join("\n")
With that, we now have a String called result, which should be what you're looking for.
Putting It All Together
Here is what the entire solution looks like, once we put together all of the steps:
substrings = s.split(/\n/)
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
result = substrings.join("\n")

How do I match repeated characters?

How do I find repeated characters using a regular expression?
If I have aaabbab, I would like to match only characters which have three repetitions:
aaa
Try string.scan(/((.)\2{2,})/).map(&:first), where string is your string of characters.
The way this works is that it looks for any character and captures it (the dot), then matches repeats of that character (the \2 backreference) 2 or more times (the {2,} range means "anywhere between 2 and infinity times"). Scan will return an array of arrays, so we map the first matches out of it to get the desired results.

ruby parametrized regular expression

I have a string like "{some|words|are|here}" or "{another|set|of|words}"
So in general the string consists of an opening curly bracket,words delimited by a pipe and a closing curly bracket.
What is the most efficient way to get the selected word of that string ?
I would like do something like this:
#my_string = "{this|is|a|test|case}"
#my_string.get_column(0) # => "this"
#my_string.get_column(2) # => "is"
#my_string.get_column(4) # => "case"
What should the method get_column contain ?
So this is the solution I like right now:
class String
def get_column(n)
self =~ /\A\{(?:\w*\|){#{n}}(\w*)(?:\|\w*)*\}\Z/ && $1
end
end
We use a regular expression to make sure that the string is of the correct format, while simultaneously grabbing the correct column.
Explanation of regex:
\A is the beginnning of the string and \Z is the end, so this regex matches the enitre string.
Since curly braces have a special meaning we escape them as \{ and \} to match the curly braces at the beginning and end of the string.
next, we want to skip the first n columns - we don't care about them.
A previous column is some number of letters followed by a vertical bar, so we use the standard \w to match a word-like character (includes numbers and underscore, but why not) and * to match any number of them. Vertical bar has a special meaning, so we have to escape it as \|. Since we want to group this, we enclose it all inside non-capturing parens (?:\w*\|) (the ?: makes it non-capturing).
Now we have n of the previous columns, so we tell the regex to match the column pattern n times using the count regex - just put a number in curly braces after a pattern. We use standard string substition, so we just put in {#{n}} to mean "match the previous pattern exactly n times.
the first non skipped column after that is the one we care about, so we put that in capturing parens: (\w*)
then we skip the rest of the columns, if any exist: (?:\|\w*)*.
Capturing the column puts it into $1, so we return that value if the regex matched. If not, we return nil, since this String has no nth column.
In general, if you wanted to have more than just words in your columns (like "{a phrase or two|don't forget about punctuation!|maybe some longer strings that have\na newline or two?}"), then just replace all the \w in the regex with [^|{}] so you can have each column contain anything except a curly-brace or a vertical bar.
Here's my previous solution
class String
def get_column(n)
raise "not a column string" unless self =~ /\A\{\w*(?:\|\w*)*\}\Z/
self[1 .. -2].split('|')[n]
end
end
We use a similar regex to make sure the String contains a set of columns or raise an error. Then we strip the curly braces from the front and back (using self[1 .. -2] to limit to the substring starting at the first character and ending at the next to last), split the columns using the pipe character (using .split('|') to create an array of columns), and then find the n'th column (using standard Array lookup with [n]).
I just figured as long as I was using the regex to verify the string, I might as well use it to capture the column.

Resources