I have similiar structure to:
<tr>
<td>I WANT THIS</td>
<td>
<a class="unread">text</a>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<a class="read">text</a>
</td>
<td></td>
</tr>
And I need to select <tr> node, which have <a> node with attribute [#class='unread'], to select inner <td> later.
I tried //tr[a/#class='unread'] and //tr/a[#class='textMsg unread'] but didn't work. How can I get my <tr> node?
a tag is not a child of tr tag, you can try this xpath:
//tr[.//a/#class='unread']
Or
//tr[descendant::a/#class='unread']
To select the wanted td element(s), use:
//tr//td[.//a[#class = 'unread']]
If it is known that the td is a child of the tr and the a is a child of the td, the above may be simplified to:
//tr/td[a[#class = 'unread']]
Related
Currently, I am solving some problem using Scrapy and XPath, where I am required to grab the nested tag. Assume the condition like this
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
<tr><td></td><td></td></tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
I only want to grab or select the nested tr (<tr><td></td><td></td></tr>). How I should write the XPath for this.
To get all tr elements that have td children but no table grandchildren, use the XPath expression //tr[td][not(td/table)].
//tr/td[2]/..
We select second td in tr and then just level up to select our tr element.
I'm stumped on why and how to do this query.
My html structure is like this (tables nested inside tables):
<root>
<table>
</table>
<table>
<tr>
<td>
<table>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</root>
If I start out my xpath like:
var tables = blah.SelectNodes("//table");
which returns me the 3 parent tables, then I want to select the td's from the 2nd tr like this:
var td = tables[2].SelectNodes("//tr[2]/td");
But, when I do this, it goes back to the parent/root, the "blah" level. Why is this, and how can I keep filtering my search results down?
Note: The example xml structure may not directly match the queries written, just trying to give a general idea...
Just keep extending the XPath
This one returns the <tr> items (four of them) of the second table:
/table/tr/td/table/tr
This one returns the second <tr> item:
/table/tr/td/table/tr[2]
Your best bet, though, is to give individual id attributes to each table, so that you can find it directly using that attribute.
Using something like this:
<root>
<table id="1">
</table>
<table id="2">
<tr>
<td>
<table id="3">
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</root>
You can get the items in the innermost table with:
//table[#id="3"]
You can get an individual <td> item from that innermost table with:
//table/tr/td/table/tr[2]/td[1]
Assigning an id attribute makes it a little easier (note missing /tr/td items after the first table):
//table[#id="3"]/tr[2]/td[1]
I have a page with list of jobs jobs offers and every job in list is link to page with job offer.
And I have a problem with Microdata, and my question is, which variant is better?
First variant:
<table itemscope itemtype="http://schema.org/JobPosting">
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 1</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 2</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 3</td>
</tr>
</table>
Second variant:
<table>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 1</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 2</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 3</a></td>
</tr>
</table>
Your first variant means: There is a JobPosting which has three titles. Each of these titles consists of another JobPosting.
Your second variant means: There are three JobPostings, each one has a title.
So you want to go with your second variant.
Note that you have an error on your current page. Instead of the example contained in your question, on your page you use itemprop="title" on the a element. But then the href value is the title, not the anchor text.
So instead of
<td>
<a itemprop="title" href="…" title="…">…</a>
</td>
<!-- the value of 'href' is the JobPosting title -->
you should use
<td itemprop="title">
<a class="list1" href="…" title="…">…</a>
</td>
<!-- the value of 'a' is the JobPosting title -->
And why not use the url property here?
<td itemprop="title">
<a itemprop="url" href="…" title="…">…</a>
</td>
The second one. The first one is describing a table as JobPosting which isn't a JobPosting.
What is the xpath to use if I want to get the nodes that have a certain number of child nodes of a tag type?
<table>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<p></p>
</tr>
<tr>
<td></td>
<p></p>
</tr>
</table>
For example, in the markup above, I want to get <tr> tags that have 3 <td> children. The xpath should return the 1st and 3rd <tr>.
You could try a condition based on the count statement, for example:
/table/tr[count(td)=3]
I want to select text() within each row in the following HTML. However, the text I want is either in the td element or the p element, so I have to write two statements to ensure each row is selected.
How do I combine the two statements into one?
XPATH:
//table/tr/td[not(p)]/text() | //table/tr/td/p/text()
With the result desired:
['1', '2', '3', '4']
Original html:
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>
<p>2
</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>
<p>4
</td>
</tr>
</table>
Probably you want something like this:
//table/tbody/tr/td//text()[normalize-space()]
All non-whitespace-only text nodes one or more levels deep in the //table/tr/td will be found.