This is a simple version of the HTML of the page that I want analyse:
<table class="class_1">
<tbody>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"> </td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>square</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_7"></span>circle</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>triangle</td>
</tr>
</tbody>
</table>
You can find the page at
https://sabbiobet.netsons.org/test.html
If you try in a google sheets the function:
=IMPORTXML("https://sabbiobet.netsons.org/test.html";"//td[#class='class_5']")
i'll obtain:
square
circle
triangle
I need to obtain all the <td> with class="class_5" minus the ones that have or <span class=class_7>.
In other words I want to obtain only these values:
Square
Triangle
can somebody help me?
The following XPath expression
//td[#class='class_5' and span and not(span[#class='class_7'])]
selects all td elements having an attribute class with value class_5, having a child element span and not having a child element span where its class attribute has the value class_7.
Note that you could also use
//td[#class='class_5' and span[#class='class_6']]
to get the same result in this case.
This should work:
//td[#class='class_5'][not(text()=' ')][not(./span[#class='class_7'])]
where [not(text()=' ')] is not testing for a reqular space but rather for a symbol with Unicode code U+00A0 that you can input from keyboard in windows using alt+0160 where numbers are to be input from numpad.
Related
There is for example such piece of code:
<tr>
<td width="270" class="news">
Ukraine
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
USA
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>10.10.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
Germany
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>23.12.15<span>
</td>
</tr>
In dev panel in Chrome this xpath //tr/td/a[text()="France"] shows me 3 results.
I cant find the way to get the value I need.
For example, how to get the date of 2nd result (10.10.15)?
I tried different ways with position, something like //tr/td/a[text()="France"][position=2]/span/text() and //tr/td/a[position(text()="France")][2]/span/text()etc, but no success. Maybe "position" is not what I need?
EDITED:
Positions of "France" in code differ depends on a page. In the example above right now positions are 2,4,6, but on the other page it can be 3,10,15,25,27. How to get the 2nd result and the last one, without taking into account the position where 'France' actually located?
This is one possible way :
(//tr/td[a='France'])[2]/span/text()
The parentheses before position index is required because ([]) has a higher precedence (priority) than (// and /) *. So this expression //tr/td[a='France'][2]/span/text(), for example, will look for the 2nd td within a single tr parent instead, which doesn't exist in the HTML sample you posted.
*: for further explanation on this matter: How to select specified node within Xpath node sets by index with Selenium?
Your 'xml' was invalid so first I corrected it (I know it isn't valid HTML but good enough to test):
<?xml version="1.0" encoding="utf-8"?>
<form>
<tr>
<td width="270" class="news">
France
<span>17.06.15</span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>10.10.15</span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>23.12.15</span>
</td>
</tr>
</form>
Then, this worked:
//tr[position()=2]/td/a[text()="France"]/../span/text()
I have a feed that outputs HTML. The following segment is part of the output
<div class="leftnav">
<table border="0" cols="2">
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat1 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle1</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink1
</td>
</tr>
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat2 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle2</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink2
</td>
</tr>
</table>
</div>
I want to process above segment using XPATH so that output looks like this
Article Cat1
ArticleTitle1
ArticleLink1 Article Cat2
ArticleTitle2
ArticleLink2
What is the optimal XPATH that will produce the desired output? I tried //div[#class="leftnav"]/table/tr but this gives all the TR elements. I want to skip the first TR element so that I can get the output in the format I described above.
//div[#class="leftnav"]/table/tr[position() > 1]
Try the above
Stupid simple way:
substring-after(normalize-space(string(//*:div)), normalize-space(string(//*:div/*:table/*[1])))
Result: "Article Cat1 ArticleTitle1 ArticleLink1 nbsp Article Cat2 ArticleTitle2 ArticleLink2"
I don't know why, but (position() > 1) doesn't work in my environment, so I've used strings instead.
I have a page with list of jobs jobs offers and every job in list is link to page with job offer.
And I have a problem with Microdata, and my question is, which variant is better?
First variant:
<table itemscope itemtype="http://schema.org/JobPosting">
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 1</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 2</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 3</td>
</tr>
</table>
Second variant:
<table>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 1</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 2</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 3</a></td>
</tr>
</table>
Your first variant means: There is a JobPosting which has three titles. Each of these titles consists of another JobPosting.
Your second variant means: There are three JobPostings, each one has a title.
So you want to go with your second variant.
Note that you have an error on your current page. Instead of the example contained in your question, on your page you use itemprop="title" on the a element. But then the href value is the title, not the anchor text.
So instead of
<td>
<a itemprop="title" href="…" title="…">…</a>
</td>
<!-- the value of 'href' is the JobPosting title -->
you should use
<td itemprop="title">
<a class="list1" href="…" title="…">…</a>
</td>
<!-- the value of 'a' is the JobPosting title -->
And why not use the url property here?
<td itemprop="title">
<a itemprop="url" href="…" title="…">…</a>
</td>
The second one. The first one is describing a table as JobPosting which isn't a JobPosting.
I am new to Xpath. I tried to write an xpath expression which returns all the valid elements in a above table row (Marked as 'Required').
XPath Used:
.//*[#id='reportListItemID0']/td[not(#width) and child::node()] | .//*[#id='reportListItemID0']/td[not(#width) and not(child::node())]
<tr style="" class="deatilRow" id="reportListItemID0" nowrap="">
<td width="10px"> </td>
<td>04/04/2013</td><!----------------------------Required-->
<td width="3px" colspan="10"> </td>
<td>Morkel, Rashid</td><!------------------------Required-->
<td width="3px" colspan="10"> </td>
<td>100668041</td><!-----------------------------Required-->
<td width="3px" colspan="10"> </td>
<td></td><!--------------------------------------Required-->
<td width="3px" colspan="10"> </td>
<td>
XA0404181004596<!--Required-->
</td>
<td width="3px" colspan="10"> </td>
<td>$31.00</td><!--------------------------------Required-->
<td width="3px" colspan="10"> </td>
<td>
<span class="workedYesNoColor">N</span><!----Required-->
</td>
<td width="3px" colspan="10"> </td>
</tr>
The X-Path used returns the below listed elements:
<td>04/04/2013</td>
<td>Morkel, Rashid</td>
<td>100668041</td>
<td></td>
<td> XA0404181004596 </td>
<td>$31.00</td>
<td> <span class="workedYesNoColor">N</span> </td>
But the Expected Result is as below: (The leaf nodes are only required and not the 'td')
<td>04/04/2013</td>
<td>Morkel, Rashid</td>
<td>100668041</td>
<td></td>
XA0404181004596
<td>$31.00</td>
<span class="workedYesNoColor">N</span>
Important Note: The position of the 'td' with tags 'a' and 'span' can vary. They are not present at the 5th and 6th position consistently.
Please let me know what I am missing.
Find the below html code:
<table id="supplier_list_data" cellspacing="1" cellpadding="0" class="data">
<tr class="rowLight">
<td class="extraWidthSmall">
Cdata
</td>
<td class="extraWidthSmall">
xyz
</td>
<td class="extraWidthSmall">
ppm
</td>
</tr>
</table>
Now using xpath how to get the value xyz (means always the second "<td>") . Give me an idea please!
Try //tr/td[2]/data().
//tr/td selects all <td/> elements, [2] the second result inside each <tr/> and data() returns their contents.