Xpath 1.0 using an arithmetic operators - xpath

Let's say we have this:
something
Now is there a way to return the #href like: "www.something/page/2". Basically to return the #href value, but with the substring-after(.,"page/") incremented by 1. I've been trying something like
//a/#href[number(substring-after(.,"page/"))+1]
but it doesn't work, and I don't think I can use
//a/#href/number(substring-after(.,"page/"))+1
It's not precisely a paging think, so that I can use the pagination, I just picked that for an example. The point is just to find a way to increment a value in xpath 1.0. Any help?

What you can do is
concat(
translate(//a/#href, '0123456789', ''),
translate(//a/#href, translate(//a/#href, '0123456789', ''), '') + 1
)
So that concatenates the 'href' attribute with all digits being removed with the the sum of 1 and the 'href' with anything but digits being removed.
That might suffice is all digits in your URLs occur at the end of your URL. But generally XPath 1.0 is good at selecting nodes in your input but bad at constructing new values based on parts of node values.

There is a simpler way to achieve this, just take the substring after the page, add 1, and then munge it all back together:
This XPath is based on the current node being the #href attribute:
concat(substring-before(.,'page/'),
'page/',
substring-after(.,'page/')+1
)

Your order of operations is a little, well, out of order. Use something like this:
substring-after(//a/#href, 'page/') + 1
Note that it is not necessary to explicitly convert the string value to a number. From the spec:
The numeric operators convert their operands to numbers as if by
calling the number function.
Putting it all together:
concat(
substring-before(//a/#href, 'page/'),
'page/',
substring-after(//a/#href, 'page/') + 1)
Result:
www.something/page/2

Related

Ruby regex count matched elements in the array of digits

I have a string:
'my_array1: ["1445","374","1449","378"], my_array2: ["1445","374", "1449","378"]'
I need to match all sets of digits from my_array2: [...] and count how many of them there.
I need to do something like this with regex and ruby MatchData
string = 'my_array1: ["1445","374", "1449","378"], my_array2: ["1445","374", "1449","378"]'
matches = string.match(/my_array2\:\s[\[,]\"(\d+)\"/)
count_matches = matches.size
Expected result should be 4.
What is the correct way of doing it?
If you are guaranteed that the content of my_array2 is always numeric you could simply use split twice. First you splitby my_array2: [" and then split by ,. This should give you the amount of items you are after.
If you are not guaranteed that, you could still split by my_array2 and instead of splitting again, you use a pattern such as "\d+" (or "\d+(\.\d+)? if you have floating point values) and count.
An example of the expression is available here.

Whats the XPath equivalent to SQL In query?

I would like to know whats the XPath equivalent to SQL In query. Basically in sql i can do this:
select * from tbl1 where Id in (1,2,3,4)
so i want something similar in XPath/Xsl:
i.e.
//*[#id= IN('51417','1121','111')]
Please advice
(In XPath 2,) the = operator always works like in.
I.e. you can use
//*[#id = ('51417','1121','111')]
A solution is to write out the options as separate conditions:
//*[(#id = '51417') or (#id = '1121') or (#id = '111')]
Another, slightly less verbose solution that looks a bit like a hack, though, would be to use the contains function:
//*[contains('-51417-1121-111-', concat('-', #id, '-'))]
Literally, this means you're checking whether the value of the id attribute (preceeded and succeeded by a delimiter character) is a substring of -51417-1121-111-. Note that I am using a hyphen (-) as a delimiter of the allowable values; you can replace that with any character that will not appear in the id attribute.

XPath 2.0:reference earlier context in another part of the XPath expression

in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s

Regexp, how to limit a match

I have a string:
string = %q{<span class="no">2503</span>read_attribute_before_type_cast(<span class="pc">self</span>.class.primary_key)}
In this example I want to match the words 'class' which are not in the tag. Regexp for this:
/\bclass[^=]/
But the problem is that it matches the last letter
/\bclass[^=]/.match(string) => 'class.'
I don't want have a last dot in a result. I've tried this regexp:
/\bclass(?:[^=])/
but still got the same result. How to limit the result to 'class'? Thanks
You are almost correct, but you have an error in your look ahead. Try this:
/\bclass(?!=)/
The regex term (?!=) means the input to the right must not match the character '='
You can take your variable string and extract a subsection using groups:
substring = string[/\b(class)[^=]/, 1]
The brackets around class will set that as the first "group", which is referred to by the 1 as the second parameter in the square brackets.
Assuming your only issue is keeping it from matching span.class.blah, just ignore . as well, so [^=.].

RegExp Counting System

I'm trying to create a system where I can convert RegEx values to integers and vice versa. where zero would be the most basic regex ( probably "/./" ), and any subsequent numbers would be more complex regex's
My best approach so far was to stick all the possible values that could be contained within a regex into an array:
values = [ "!", ".", "\/", "[", "]", "(", ")", "a", "b", "-", "0", "9", .... ]
and then to take from that array as follows:
def get( integer )
if( integer.zero? )
return '';
end
integer = integer - 1;
if( integer < values.length )
return values[integer]
end
get(( integer / values.length ).floor) + get( integer % values.length);
end
sample_regex = /#{get( 100 )}/;
The biggest problem with this approach is that a invalid RegExp can easily be generated.
Is there an already established algorithm to achieve what I'm trying? if not, any suggestions?
Thanx
Steve
Since regular expressions can be formally defined by recursively applying a finite number of elements, this can be done: instead of simply concatenating elements, combine them according to the rules of regular expressions. Because the regular language is also recursively enumerable, this is guaranteed to work.
However, it's quite probably overkill to implement this. What do you need this for? Would a simple dictionary of Number -> RegExp key-value pairs not be better suited to associate regular expressions with unique numbers?
I would say that // is the simplest regex (it matches anything). /./ is fairly complex since it is just shorthand for /[^\n]/, which itself is just shorthand for a much longer expression (what that expression is depends on your character set). The next simplest expression would be /a/ where a is the first character in your character set. That last statement brings up an interesting problem for your enumeration: what character set will you use? Any enumeration will be tied to a given character set. Assuming you start with // as 0, /\x{00}/ (match the nul character) as 1, /\x{01}/ as 2, etc. Then you would start to get into interesting regexes (ones that match more than one string) around 129 if you used the ASCII set, but it would take up to 1114112 for UNICODE 5.0.
All in all, I would say a better solution is treat the number as a sequence of bytes, map those bytes into whatever character set you are using, use a regex compiler to determine if that number is a valid regex, and discard numbers that are not valid.

Resources