Xpath: howto return empty values - xpath

I have an Xpath like following:
"//<path to some table>/*/td[1]/text()"
and it returns text values of all non-empty tds, for example:
<text1>, <text2>, <text3>
But the problem is that between nodes, that contain mentioned values could be some empty tds elements:
What i want is to get result that contain some identifiers, that there is those empty values, for example:
<text1>,<>, <>, <text2>, <text3>, <>
or
<text1>,<null>, <null>, <text2>, <text3>, <null>
I tried to use next one:
"//<path to some table>/*/string(td[1]/text())"
but it returns undefined
Of course, I could just get whole node and then work with it in my code (cut all unnecessary info), but may be there is a better way?
html example for that case:
<html>
<body>
<table class="tablesorter">
<tbody>
<tr class="tr_class">
<td>text1</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td></td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td>text2</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td>text3</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td></td>
<td>{some text}</td>
</tr>
</tbody>
</table>
</body>
</html>

Well simply select the td elements, not its text() child nodes. So with the path changed to //<path to some table>/*/td[1] or maybe //<path to some table>/*/td you will get a node-set of td elements, whether they are empty or not, and you can then access the string contents of each node (with XPath (select string(.) for each element node) or host environment method e.g. textContent in the W3C DOM or text in the MSXML DOM.). That way the empty strings will be included.
In case you use XPath 2.0 or XQuery you can directly select //<path to some table>/*/td/string(.) to have a sequence of string values. But that approach with a function call in the last step is not supported in XPath 1.0, there you can select the td element nodes and then access the string value of each in a separate step.

Do you mean you want only the td[1] with text and get rid of ones without text? If so, you can use this xpath
//td[1][string-length(text()) > 1]

Related

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

XPath extract value within attribute

This is my HTML code so far:
<tr valign="top">
<td nowrap="x">Citation(s)</td>
<td>
<span class="pubmed_id" id="26472973">
26472973
</span>
</td>
</tr>
I would like to extract the number 26472973, which is a value that changes for each entry in the database.
It is unclear if you want to get either the value from the attribute #id or the following a element.
So, for the attribute value, try this XPath:
//tr[#valign='top']/td/span[#class='pubmed_id']/#id
Or, for the element's a value use this XPath:
//tr[#valign='top']/td/span[#class='pubmed_id']/a/text()
In both case the result is 26472973.
In case you just want the 'citations', here another try:
//tr/td[text()='Citation(s)']/following-sibling::td/span/#id

XPath: Getting a node by attribute value of subnode

People, could you please help me with this XPATH. Lets say I have the following HTML code
<table>
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
<tr>
<td>value2</td>
<td>text</td>
</tr>
</table>
I need to build a XPath that will pick <tr>that have <td> with value text AND attribute class equals clickable.
I tried the following xpath:
//tr[contains(.,'text')][contains(./td/#class,'clickable')]
//tr[contains(.,'text')][contains(td/#class,'clickable')]
but none of those worked
Any help is appreciated
Thanks
You are almost there:
//tr[contains(td/#class,'clickable') and contains(td, 'text')]
Demo using xmllint:
$ xmllint input.xml --xpath "//tr[contains(td/#class,'clickable') and contains(td, 'text')]"
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
If you find tr with a td having value text and a td (maybe, another) with attribute class equals clickable, use answer of #alecxe.
If that is one td with two condition then
//tr[td[.='text' and #class='clickable']]

XPath: returning the index of specific tag inside a set of tags with the same type

Here is an excerpt of my xml:
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
I know how to find specific <tr> tag.
Is it possible to define <tr> tag index or ordinal number inside the <tbody> tag? I guess, that it's possible to loop through the table, but the table is quite large and it will take lots of time.
Is it possible to get this index/ordinal number with single XPATH statement?
I've used following XPath expression:
//tbody//td[text()='findMe']/../following-sibling::tr
These expression calculates, how many 'tr' nodes are located under the node with 'findMe' text. Actually, it useful, because quantity of 'tr' nodes could be obtained.
But, prior to given XPath, a verification should be made, because in case 'finMe' string would be absent, XPath would return 0. The following expression works as validation fine:
//tbody//td[text()='findMe']

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources