correct way to scrape this table (using scrapy / xpath) - xpath

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.

Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

Related

Getting a substring of an attribute is not working

This is the HTML code:
<table id="laptop_detail" class="table">
<tbody>
<tr>
<td style="padding-left:36px" class="ha"> Camera Pixels </td>
<td class="val">8 megapixel camera</td>
</tr>
</tbody>
</table>
How do i get only the first character which is "8" in chrome? My approach so far is:
$x('//*[#id="laptop_detail"]//tr/td[contains(. ,"Camera")]/following-sibling::td[1]/text()[substring(. , 0, 2)]')
Don't put the function you need the output of into a predicate, instead, apply it on the node:
substring(//*[#id="laptop_detail"]//tr/td[contains(., "Camera")]/following-sibling::td[1], 1, 1)
Note that in XPath, characters in a string are numbered from 1, not 0.
Also, you don't need to specify text(), substring knows it should operate on strings.
BTW, do you really want to get 1 if the number of megapixels is 10? Maybe
substring-before(..., ' ')
would work better?

XPath to get siblings between two elements

With the following markup I need to get the middle tr's
<tr class="H03">
<td>Artist</td>
...
<tr class="row_alternate">
<td>LIMP</td>
<td>Orion</td>
...
</tr>
<tr class="row_normal">
<td>SND</td>
<td>Tender Love</td>
...
</tr>
<tr class="report_total">
<td> </td>
<td> </td>
...
</tr>
That is every sibling tr between <tr class="H03"> and <tr class="report_total">. I'm scraping using mechanize and nokogiri, so am limited to their xpath support. My best attempt after looking at various StackOverflow questions is
page.search('/*/tr[#class="H03"]/following-sibling::tr[count(. | /*/tr[#class="report_total"]/preceding-sibling::tr)=count(/*/tr[#class="report_total"]/preceding-sibling::tr)]')
which returns an empty array, and is so ridiculously complicated that my limited xpath fu is completely overwhelmed!.
You can try the following xpath :
//tr[#class='H03']/following-sibling::tr[following-sibling::tr[#class='report_total']]
Above xpath select all <tr> following tr[#class='H03'], where <tr> have following sibling tr[#class='report_total'] or in other words selected <tr> are located before tr[#class='report_total'].
Mechanize has a few helper methods here that would be useful to employ.
presuming you are doing something like the following:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.website.com')
start_tr = page.at('.H03')
At this point, tr will be a nokogiri xml element of the first tr you list in your question.
You can then iterate through siblings with:
next_tr = start_tr.next_sibling
Do this until you hit the tr at which you want to stop.
trs = Array.new
until next_tr.attributes['class'].name == 'report_total'
next_tr = next_tr.next_sibling
trs << next_tr
end
If you want the range to be inclusive of the start and stop trs (H03 and report_total) just tweak the code above to include them in the trs array.

XPath: returning the index of specific tag inside a set of tags with the same type

Here is an excerpt of my xml:
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
I know how to find specific <tr> tag.
Is it possible to define <tr> tag index or ordinal number inside the <tbody> tag? I guess, that it's possible to loop through the table, but the table is quite large and it will take lots of time.
Is it possible to get this index/ordinal number with single XPATH statement?
I've used following XPath expression:
//tbody//td[text()='findMe']/../following-sibling::tr
These expression calculates, how many 'tr' nodes are located under the node with 'findMe' text. Actually, it useful, because quantity of 'tr' nodes could be obtained.
But, prior to given XPath, a verification should be made, because in case 'finMe' string would be absent, XPath would return 0. The following expression works as validation fine:
//tbody//td[text()='findMe']

Xpath: howto return empty values

I have an Xpath like following:
"//<path to some table>/*/td[1]/text()"
and it returns text values of all non-empty tds, for example:
<text1>, <text2>, <text3>
But the problem is that between nodes, that contain mentioned values could be some empty tds elements:
What i want is to get result that contain some identifiers, that there is those empty values, for example:
<text1>,<>, <>, <text2>, <text3>, <>
or
<text1>,<null>, <null>, <text2>, <text3>, <null>
I tried to use next one:
"//<path to some table>/*/string(td[1]/text())"
but it returns undefined
Of course, I could just get whole node and then work with it in my code (cut all unnecessary info), but may be there is a better way?
html example for that case:
<html>
<body>
<table class="tablesorter">
<tbody>
<tr class="tr_class">
<td>text1</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td></td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td>text2</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td>text3</td>
<td>{some text}</td>
</tr>
<tr class="tr_class">
<td></td>
<td>{some text}</td>
</tr>
</tbody>
</table>
</body>
</html>
Well simply select the td elements, not its text() child nodes. So with the path changed to //<path to some table>/*/td[1] or maybe //<path to some table>/*/td you will get a node-set of td elements, whether they are empty or not, and you can then access the string contents of each node (with XPath (select string(.) for each element node) or host environment method e.g. textContent in the W3C DOM or text in the MSXML DOM.). That way the empty strings will be included.
In case you use XPath 2.0 or XQuery you can directly select //<path to some table>/*/td/string(.) to have a sequence of string values. But that approach with a function call in the last step is not supported in XPath 1.0, there you can select the td element nodes and then access the string value of each in a separate step.
Do you mean you want only the td[1] with text and get rid of ones without text? If so, you can use this xpath
//td[1][string-length(text()) > 1]

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources