Xpath query to find elements which contain a certain descendant - xpath

I'm using Html Agility Pack to run xpath queries on a web page. I want to find the rows in a table which contain a certain interesting element. In the example below, I want to fetch the second row.
<table name="important">
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm interested in</td>
<td><interestingtag/></td>
<td>More stuff I'm interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
</table>
I'm looking to do something like this:
//table[#name='important']/tr[has a descendant named interestingtag]
Except with valid xpath syntax. ;-)
I suppose I could just find the interesting element itself and then work my way up the parent chain from the node that's returned, but it seemed like there ought to be a way to do this in one step and I'm just being dense.

"has a descendant named interestintag" is spelled .//interestintag in XPath, so the expression you are looking for is:
//table[#name='important']/tr[.//interestingtag]

Actually, you need to look for a descendant, not a child:
//table[#name='important']/tr[descendant::interestingtag]

I know this isn't what the OP was asking, but if you wanted to find an element that had a descendant with a particular attribute, you could do something like this:
//table[#name='important']/tr[.//*[#attr='value']]

I know it is a late answer but why not going the other way around. Finding all <interestingtag/> tags and then select the parent <tr> tag.
//interestingtag/ancestor::tr

Related

Choosing multiple elements by using a square bracket at the end vs enclosing the whole xpath in braces before sq brackets?

I've come across cases when there are lots of links on the page and the following xpath works for choosing the first one:
//tag[#...]/div/a[1]
There are other cases when the above xpath doesn't work and then I need to use it the following way:
(//tag[#...]/div/a)[1]
As I write lengthier xpaths to code in business logic for which elements to select, this difference starts getting all the more complicated where the same xpath has multiple combinations of both of these.
What is the difference exactly between writing xpaths in these two ways? I've seen that for any particular occasion one of them works and the other doesn't.
Consider this sample HTML:
<table>
<tbody>
<tr>
<td>1.1</td>
<td>1.2</td>
<td>1.3</td>
</tr>
<tr>
<td>2.1</td>
<td>2.2</td>
<td>2.3</td>
</tr>
<tr>
<td>3.1</td>
<td>3.2</td>
<td>3.3</td>
</tr>
</tbody>
</table>
Here you can use //table/tbody/tr/td[index] to go through <td> elements of only first row <tr>. //table/tbody/tr will return the first match, which is your first row and then indexing is done only on the <td> elements in first row. So valid indexes are 1,2,3.
But you can use (//table/tbody/tr/td)[index] if you want to go through all <td> values in the table. Here the indexing applies on the whole xpath which is same for all the <td> elements. So valid indexes are 1,2,3,..9.

Scrapy - Scraping hidden elements

I think what I want to ask if it's possible to get around sql:hide (https://learn.microsoft.com/en-us/sql/relational-databases/sqlxml-annotated-xsd-schemas-using/hiding-elements-and-attributes-by-using-sql-hide?view=sql-server-2017), but I've described my actual problem below in case I'm mistaken:
I'm trying to scrape the "foo" urls from a website with a DOM similar to the following:
<html>
<body>
<tbody>
<tr>
...
...
</tr>
</tbody>
<table>
<tbody>
<tr>
...
</tr>
<tr>
...
</tr>
</tbody>
</table>
</body>
</html>
Whenever I try print(response.css('a')) or equivalently print(response.xpath('//a')), I can see the "foo" urls, but not the "bar" urls. Additionally, using XPath I can access up to the table, but print(response.xpath('//table//*')) and print(response.xpath('//table//a')) both output [].
Could it be possible that the elements of table have been hidden from Scrapy somehow? How would one resolve this?
Thanks in advance. This is mainly for interest as the urls have a predictable pattern anyway.
I know that this is just a wild guess, but you can try
//a[starts-with(#href,'foo')]/text()
This should give you the text values of all a tags which have a href attribute which value starts with the string 'foo'.
But it could be possible that some parts of the result XML/HTML are loaded by JavaScript at a later time what would explain your difficulties locating certain elements.

using xpath to find Href in selenium webdriver

i have to find the following href using selenium in java
<tr>
<td>
<a target="mainFrame" href="reb.php?tiEx=ES"></a>
</td>
</tr>
thanks
There are multiple ways to find the link depending on the elements it is inside and the uniqueness of the element attribute values on the page, but, from what you've provided, you can rely on the target attribute:
//a[#target="mainFrame"]
You can also narrow it down to the scope of it's parents:
//tr/td/a[#target="mainFrame"]
Also, you can additionally check the href attribute if it is persistent and reliable:
//tr/td/a[#target="mainFrame" and #href="reb.php?tiEx=ES"]

Nokogiri: Finding all tags in a direct path, not including arbitrary levels of nesting

Say I have an html document like:
<div id='findMe'>
<table>
<tr>
<td>
<p>
bad
</p>
</td>
</tr>
</table>
<p>
This is some text and this is a link
</p>
</div>
I want to capture all links instead the div #findMe, inside paragraphs tags, but not inside table or any other tags. So, I want the one labeled "good", but not the one labeled "bad". I'm trying:
Nokogiri::HTML(html).css('#findMe p a')
but that's capturing both links. I also tried a more explicit xpath:
Nokogiri::HTML(html).css('#findMe').xpath('//p/a')
But that's doing the same thing. How can I tell Nokogiri to only search a specific path down the tree?
Use > in CSS to select immediate descendant.
Nokogiri::HTML(html).css('#findMe > p > a')
Or use / in xpath:
Nokogiri::HTML(html).xpath("//div[#id='findMe']/p/a")
Figured out a way to do it, but I'm still not too comfortable with xpaths so if this isn't the best way feel free to post the more canonical way to achieve this.
Nokogiri::HTML(html).css(#findMe').xpath('//div/p/a')

Selenium WebDriver and xpath: a more complicated selection

So let's say my structure looks like this at some point:
..........
<td>
[...]
<input value="abcabc">
[...]
</td>
[...]
<td></td>
[...]
<td>
<input id="booboobooboo01">
<div></div> <=======I want to click this!
</td>
.........
I need to click that div, but I need to be sure it's on the same line as the td containing the input with value="abcabc". I also know that the div I need to click (which doesn't have id or any other relevant attribute I can use) is in a td at the same level as the first td, right after the input with id CONTAINING "boo" (dynamically generated, I only know the root part of the id). td's contain nothing relevant I can use.
This is what I tried as far as xpath goes:
//input[#value='abcabc']/../td/input[contains(#id,'boo')]/following-sibling::div
//input[#value='abcabc']/..//td/input[contains(#id,'boo')]/following-sibling::div
None of them worked, of course (element cannot be found).
I want to know if there's a way of selecting that div and how.
EDIT: //input[#value='abcabc']/../../td/input[contains(#id,'boo')]/following-sibling::div is the correct way. This was suggested by the person with the accepted answer. Also note that he offered a slightly different way of doing it. See his answer for details.
Try
//input[#value='abcabc']/ancestor::tr[1]/td/input[contains(#id,'boo')]/following-sibling::div[1]
Note that //input[#value='abcabc']/.. only goes up to the parent <td>, that's why your's did not work.
Another XPath that may work, is a bit more simple:
//input[#id='booboobooboo01']/../div[1]

Resources