So I have elements that look like this
<li class="attribute "></li> # note the space
<li class="attribute"></li>
Using the xpath //li[#class="attribute"] will get the second element but not the first. How can I get both elements with the same xpath?
This XPath 1.0 expression,
//li[contains(concat(' ', normalize-space(#class), ' '),
' attribute ')]
will select all li elements with class attributes that contain the attribute substring, regardless of whether it has leading or trailing spaces.
If you only want to match attribute with possible leading and trailing spaces only (no other string values), just use normalize-space():
//li[normalize-space(#class) = 'attribute']
Related
I have this link which i declare like this:
link = "H.R.11461"
The question is how could I use regex to extract only the href value?
Thanks!
If you want to parse HTML, you can use the Nokogiri gem instead of using regular expressions. It's much easier.
Example:
require "nokogiri"
link = "H.R.11461"
link_data = Nokogiri::HTML(link)
href_value = link_data.at_css("a")[:href]
puts href_value # => https://www.congress.gov/bill/93rd-congress/house-bill/11461
You should be able to use a regular expression like this:
href\s*=\s*"([^"]*)"
See this Rubular example of that expression.
The capture group will give you the URL, e.g.:
link = "H.R.11461"
match = /href\s*=\s*"([^"]*)"/.match(link)
if match
url = match[1]
end
Explanation of the expression:
href matches the href attribute
\s* matches 0 or more whitespace characters (this is optional -- you only need it if the HTML might not be in canonical form).
= matches the equal sign
\s* again allows for optional whitespace
" matches the opening quote of the href URL
( begins a capture group for extraction of whatever is matched within
[^"]* matches 0 or more non-quote characters. Since quotes inside HTML attributes must be escaped this will match all characters up to the end of the URL.
) ends the capture group
" matches the closing quote of the href attribute's value
In order to capture just the url you can do this:
/(href\s*\=\s*\\\")(.*)(?=\\)/
And use the second match.
http://rubular.com/r/qcqyPv3Ww3
I am try to select the "string b" text node using XPath with the HtmlAgilliyPack.
<div>
string a<br/>
string b<br/>
string c<br/>
</div>
I am not sure how to select the text?
This won't work //div/text(1)
Anybody has some suggestions?
There are two problems with your expression:
XPath starts counting at 1, so you want the second text node
text() is a node filter which does not accept arguments. If you want to limit to the second text node, use the predicate [position() = 2] or the short version [2].
Use this expression:
//div/text()[2]
Selecting text nodes can include some hassles, chopping leading and trailing whitespace and omitting whitespace-only text nodes is implementation-dependent.
Try:
//div/br[1]/following-sibling::text()[1]'
The direct following text after the first br.
I want to get the node :
//script[starts-with(text(). '\r\nvar name')]
but it seems xpath does not recognize \r\n escape characters. Any ideas how to match them?
Note: I am using html agility pack
Use:
//script[starts-with(., '
var name')]
Most often XML is normalized by the XML parser and there is only a single NL character left -- therefore, if the above expression doesn't select the wanted script elements, try with:
//script[starts-with(., '
var name')]
Or, this would work in both cases:
//script
[(starts-with(., '
') or starts-with(., '
'))
and
starts-with(substring-after(., '
'), 'var name')
]
in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s
I'm trying to select an anchor element by first containing the text "To Be Coded", then extracting a number from a string using substring, then using the greater than comparison operator (>0). This is what I have thus far:
/a[number(substring(text(),???,string-length()-1))>0]
An example of the HTML is:
<a class="" href="javascript:submitRequest('getRec','30', '63', 'Z')">
To Be Coded (23)
</a>
My issue right now is I don't know how to find the first occurrence of the open parenthesis. I'm also not sure how to combine what I have with the contains(text(),"To Be Coded") function.
So my criteria for the selection is:
Must be an anchor element
Must include the text "To Be Coded"
Must contain a number greater than 0 in the parentheses
Edit: I suppose I could just "hard code" the starting position for the substring, but I'm not sure what that would be - will XPath count the white space before the text in the element? How would it handle/count the characters?
Here try this :
a[contains(., 'To Be Coded') and number(substring-before(substring-after(., '('), ')')) > 0]