A better way of selecting multiple elements by attribute name in XPath - xpath

I'm looking to select a collection of elements based on an array of ID names. I'm currently using a giant OR statement essentially:
//*[#id='apple']|//*[#id='orange']|//*[#id='banana']
But building that string manually seems messy. Is there something like a nice SQL-esque "WHERE IN [a,b,c]" operator that I could be using?
I am using the HTTPAgilityPack for ASP.Net which I think equates to XPath1.o (feel free to correct me on that.)
Thanks.

First, you could simplify this by using or. This avoids repeating the //* multiple times although you till specify the #id= part multiple times:
//*[#id='apple' or #id='orange' or #id='banana']
A more elegant solution is to check against a list of acceptable ids. Now if you're using XPath 1.x then you'll have to do a bit of gymnastics to get contains() to do your bidding. Specifically, notice that I've got spaces on both ends of the first string, and then concatenate spaces to each end of #id before looking for a match. This is to prevent an #id of "range" from matching, for example.
//*[contains(' apple orange banana ', concat(' ', #id, ' '))]
If you have are using XPath 2.0 then the way forward is simpler thanks to the addition of sequences to the language:
//*[exists(index-of(('apple', 'orange', 'banana'), #id))]

Use:
//*[contains('|apple|banana|orange|', concat('|',#id, '|'))]
In case some of the id attributes may contain the "|" character, use another instead, that is known not to be present in the value of any of the id attributes.
An XPath 2.0 solution:
//*[#id=('apple', 'orange', 'banana')]

Related

XPath to be precised into one in order to extract text from a web page?

I have a few Xpaths as below:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p
//*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="2555ab30-bb84-11ea-9e8b-277e7f6208b2"]/div/div/div[1]/div/p
//*[#id="7e100250-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
All of the above are used to extract text from a single web page since text is located at different view--ports, but I wish to find a single xpath to extract text for all of them. Is it possible to use 'and' and multiple ID's to extract all of it through one xpath?
Any other suggestions would be appreciate.
You can use the or operator for the last four.
And the merge-nodes operator | to add the first one.
So to select all 5 expression in one, use the following expression:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p | //*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2" or #id="2555ab30-bb84-11ea-9e8b-277e7f6208b2" or #id="7e100250-a71d-11ea-b994-53a3e91a35c2" or #id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
A shorter and more generic solution could be :
(//div/div/div[1]/div/p|//div/p)[parent::*[string-length(#id)=36 and substring(#id,24,1)="-"]]
First part with () is used to specify the end of the path. Since #id attributes have the same length, we use it inside the predicate. We also verify the presence of a - at a specific position with substring.

Is there a short and elegant way to write an XPath 1.0 expression to get all HREF values containing at least one of many search values?

I was just wondering if there is a shorter way of writing an XPath query to find all HREF values containing at least one of many search values?
What I currently have is the following:
//a[contains(#href, 'value1') or contains(#href, 'value2')]
But it seems quite ugly, especially if I were to have more values.
First of all, in many cases you have to live with the "ugliness" or long-windedness of expressions if only XPath 1.0 is at your disposal. Elegance is something introduced with version 2.0, I'd daresay.
But there might be ways to improve your expression: Is there a regularity to the href attributes you'd like to find? For instance, if it is sufficient as a rule to say that the said href attribute values must start with "value", then the expression could be
//a[starts-with(#href,'value')]
I know that "value1" and "value2" are most probably not your actual attribute values but there might be something else that uniquely identifies the group of a elements you're after. Post your HTML input if this is something you want us to help you with.
Personally, I do not find your expression ugly. There is just one or operator and the expression is quite short and readable. I take
if I were to have more values.
to mean that currently, there are only two attribute values you are interested in and that your question therefore is a theoretical one.
In case you're using XPath 2 and would like to have exact matches instead of also matches only containing part of a search value, you can shorten with
//a[#href = ('value1', 'value2')]
For contains() this syntax wouldn't work as the second argument of contains() is only allowed to be 0 or 1 value.
In XPath 2 you could also use
//a[some $s in ('value1', 'value2') satisfies contains(#href, $s)]
or
//a[matches(#href, "value1|value2")]

How to use the "translate" Xpath function on a node-set

I have an XML document that contains items with dashes I'd like to strip
e.g.
<xmlDoc>
<items>
<item>a-b-c</item>
<item>c-d-e</item>
<items>
</xmlDoc>
I know I can find-replace a single item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the entire set?
This doesn't work
/xmldoc/items/item/translate(text(),'-','')
Nor this
translate(/xmldoc/items/item/text(),'-','')
Is there a way at all to achieve that?
I know I can find-replace a single
item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the
entire set?
This cannot be done with a single XPath 1.0 expression.
Use the following XPath 2.0 expression to produce a sequence of strings, each being the result of the application of the translate() function on the string value of the corresponding node:
/xmlDoc/items/item/translate(.,'-', '')
The translate function accepts in input a string and not a node-set. This means that writing something like:
"translate(/xmlDoc/items/item/text(),'-','')"
or
"translate(/xmlDoc/items/item,'-','')"
will result in a function call on the first node only (item[1]).
In XPath 1.0 I think you have no other chances than doing something ugly like:
"concat(translate(/xmlDoc/items/item,'-',''),
translate(/xmlDoc/items/item[2],'-',''))"
Which is privative for a huge list of items and returns just a string.
In XPath 2.0 this can be solved nicely using for expressions:
"for $item in /xmlDoc/items/item
return replace($item,'-','')"
Which returns a sequence type:
abc cde
PS Do no confuse function calls with location paths. They are different kind of expressions, and in XPath 1.0 can not be mixed.
here is yet anther example, running it against chrome developer tools, in prepartion for a selenium test.
$x("//table[#id='sometable_table']//tr[1=1 and ./td[2=2 and position()=2 and .//*[translate(text(), ',', '') ='1001'] ] ]/td[position()=2]")
Essentially the the data sometable_table has a column containing numbers that appear localized. For example 1001 would appear as 1,001. With the above you have somewhat nasty xpath expression.
Where first you select all table rows. Then you focus on the data of the position 2 table data for the row. Then you go deeper into the contents of the position=2 table data expand the data on the cell until you find any node whose text after string replacement is 1001. Finally you ask for the table at position 2 to be returned.
But since all your main filters are at the table row level, you could be doing additional filters at table data columns at other positions as well, if you need to find the appropriate table row that has content (A) on a cell column and content (B) on a different column.
NOTE:
It was actually quite nasty to write this, because intuitively, we all google for XPATH replace string. So I was getting furstrated trying to use xpath replace until i realized chrome supports XPATH 1.0. In xpath 1.0 the string functions that exist are different from xpath 2.0, you need to use this translate function.
See reference:
http://www.edankert.com/xpathfunctions.html

Trouble using Xpath "starts with" to parse xhtml

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format
<div id="post_message_somenumber">
and I only want to get the first one
I tried xpath='//div[starts-with(#id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions
I think I have a solution that does not require dealing with namespaces.
Here is one that selects all matching div's:
//div[#id[starts-with(.,"post_message")]]
But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:
(//div[#id[starts-with(.,"post_message")]])[1]
These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.
It works great for me in PowerShell:
# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'
# Run the xpath selection of all matching div's
$xml.selectnodes('//div[#id[starts-with(.,"post_message")]]')
Result:
id
--
post_message_somenumber
post_message_somenumber2
Or, for just the first match:
# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[#id[starts-with(.,"post_message")]])[1]')
Result:
id
--
post_message_somenumber
I tried xpath='//div[starts-with(#id,
'"post_message_')]' in yql without
success I'm still learning this,
anyone have suggestions
If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.
Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.
Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.
Here is a good answer how to do this in C#.
Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.
#FindBy(xpath = "//div[starts-with(#id,'expiredUserDetails') and contains(text(), 'Details')]")
private WebElementFacade ListOfExpiredUsersDetails;
This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details

What is the best way to match id's against a regular expression in Hpricot?

Using apricot, it is pretty easy to see how I can extract all elements with a given id or class using a CSS Selector. Is it possible to extract elements from a document based on whether some attribute of those elements matches against some regular expression?
If you mean do something like:
doc.search("//div[#id=/regex/]")
then I don't think it can be done. The alternative is to find all elements and then iterate through the results deleting those that don't match a regex.
result = doc.search("//div")
result.delete_if (|x| x.to_s !~ /regex/)
There are lots of alternative approaches. This thread has two other suggestions: Hpricot and Regular Expression.
Note, depending on exactly what it is you are trying to match you may be able to use the "Supported, but different" syntaxes available on the Hpricot Wiki, e.g:
E[#foo$=“bar”]
Matches an E element whose “foo”
attribute value ends exactly with the
string “bar”

Resources