XPATH (Scrapy): select text between 2 certain keywords

XPATH (Scrapy): select text between 2 certain keywords - xpath

I'm trying to extract text between 2 keywords 商品詳細 and 支払詳細 in this HTML
<TR>
<TD BGCOLOR=#336600><BR></TD>
<TD COLSPAN=3 BGCOLOR=#FFFFCC><FONT COLOR=#336600 SIZE=4><B>　商品詳細 </B></FONT></TD>
</TR>
<TR>
<TD COLSPAN=4 HEIGHT=10>
<LI STYLE=><SPAN STYLE=>鍵付きで盗難を防止できます。</SPAN>
<LI STYLE=><SPAN STYLE=>商品サイズ：約28*36*12cm</SPAN>
<LI STYLE=><SPAN STYLE=>素材：鉄製</SPAN>
<LI STYLE=><SPAN STYLE=>※柄は、ランダムにて発送なります</SPAN>
<LI STYLE=><SPAN STYLE=></SPAN>
<LI STYLE=>
<SPAN STYLE=></SPAN>
</TD>
</TR>
<TR>
<TD><BR></TD>
<TD COLSPAN=2 ALIGN=left><BR></TD>
<TD><BR></TD>
</TR>
<TR>
<TD COLSPAN=4 HEIGHT=25><BR></TD>
</TR>
<TR>
<TD BGCOLOR=#336600><BR></TD>
<TD COLSPAN=3 BGCOLOR=#FFFFCC>
<FONT COLOR=#336600 SIZE=4><B>　支払詳細 </B></FONT>
</TD>
</TR>
I tried the solutions in these 2 links but they didn't work for me
Scrapy xpath between 2 keywords
Xpath text extraction between 2 keywords
This is the result I have when run in scrapy shell:
In [21]: response.xpath("//text()[preceding-sibling::*[text()='商品詳細'] and following-sibling::*[text()='支払詳細']]").extract()
Out[21]: []

With xpath you can navigate the document in any direction,so in this case you want to find a key node that you know some info about and navigate to related nodes.
//td[contains(.//text(),'商品詳')] # find td that contains some text
/../following-sibling::tr//li/span/text()" # find text in it's father's sibling
I've tried this in a shell:
>[1]: sel.xpath("//td[contains(.//text(),'商品詳')]/../following-sibling::tr//li/span/text()").ex
tract()
<[1]: ['鍵付きで盗難を防止できます。', '商品サイズ：約28*36*12cm', '素材：鉄製', '※柄は、ランダムにて発送なります']

Related

Unable to select column with its header through XPath

My HTML
<table id="flex1" cellspacing="0" cellpadding="0" border="0">
<thead>
<tr class="hDiv">
<th width="6%">
<div class="text-left field-sorting asc" rel="IFSC_CODE"> IFSC CODE </div>
</th>
<th width="6%">
<div class="text-left field-sorting " rel="BRANCH_NAME"> BRANCH NAME </div>
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="sorted" width="6%">
<div class="text-left">SACS011151</div>
</td>
<td width="6%">
<div class="text-left">check</div>
</td>
</tr>
<tr class="erow">
<td class="sorted" width="6%">
<div class="text-left">SACS011152</div>
</td>
<td width="6%">
<div class="text-left">Motiram</div>
</td>
</tr>
<tr class="erow">
<td class="sorted" width="6%">
<div class="text-left">SACS011158</div>
</td>
<td width="6%">
<div class="text-left">TESTNAME</div>
</td>
</tr>
</tbody>
</table>
My XPath
//table/tbody/tr/td[count(//table/thead/tr/th[.='BRANCH NAME']/preceding-sibling::th)+4]
Above XPath is Selecting all the column but not selecting its header name 'BRANCH NAME' and I want to select the header name with all its column.Any Idea how to do this?

You can simply use xpath union operator (|) to combine two xpath queries, for example* :
//table/tbody/tr/td[count(//table/thead/tr/th[.='BRANCH NAME']/preceding-sibling::th)+4]
|
//table/thead/tr/th[.='BRANCH NAME']
*: formatted into multiple lines just to make it visible without horizontal scroll

Optimal XPath Query for processing the sample HTML fragment

I have a feed that outputs HTML. The following segment is part of the output
<div class="leftnav">
<table border="0" cols="2">
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat1 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle1</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink1
</td>
</tr>
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat2 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle2</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink2
</td>
</tr>
</table>
</div>
I want to process above segment using XPATH so that output looks like this
Article Cat1
ArticleTitle1
ArticleLink1 Article Cat2
ArticleTitle2
ArticleLink2
What is the optimal XPATH that will produce the desired output? I tried //div[#class="leftnav"]/table/tr but this gives all the TR elements. I want to skip the first TR element so that I can get the output in the format I described above.

//div[#class="leftnav"]/table/tr[position() > 1]
Try the above

Stupid simple way:
substring-after(normalize-space(string(//*:div)), normalize-space(string(//*:div/*:table/*[1])))
Result: "Article Cat1 ArticleTitle1 ArticleLink1 nbsp Article Cat2 ArticleTitle2 ArticleLink2"
I don't know why, but (position() > 1) doesn't work in my environment, so I've used strings instead.

Parsing a table on a webpage without id's or classes - using Nokogiri or xpath

I wish to parse through a epinions.com page to gather some statistics about a few companies. Epinions have almost no id's or classes, so it's quite difficult to parse the site.
I need to loop through all <tr bgcolor="white"> objects. I have put in 2 samples of this.
From the sample 1, I need to extract:
The alt on this line:
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
The href this line:
CHUMBO ROCKS!
The author at this line:
<span class="rgr">by whitey436, Jan 18, 2006
Here is sample 1:
<tr bgcolor="white">
<td style="padding:10px 5px" align="right" valign="top" height="100%">
<table cellspacing="4" cellpadding="0" border="0" width=100% height="100%">
<tr valign="top">
<td class="rkr" nowrap>Overall Rating:</td>
<td width=80>
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
</td>
</tr>
<span class="rgr">
<tr>
<td class="rgr" nowrap>Ease of Ordering:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>Customer Service:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>Selection:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>On-Time Delivery:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
</span>
<tr valign="bottom" height="100%">
<td class="rkb" colspan="2">
<div align="center"> </div>
<div align="center"> </div>
</td>
</tr>
</table>
</td>
<td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
<h2 style="font-family:arial,helvetica,sans-serif; font-size:87%; color:#000000; font-weight:bold; margin-bottom:0px;">
CHUMBO ROCKS!
</h2>
<span style="line-height:110%">
<span class="rgr">by whitey436, Jan 18, 2006
Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>
</span>
<span class="rkr">
<div style="padding:5px 0px"> Its just this simple, I tried buying this receiver from another online supplier who had the lowest price only to find they didnt have any of these units and they wanted to sell me extra warranty then tried to sell a different model in stock from Yamaha ...</div>
<b>
Read the full review
</b>
</span>
</td>
</tr>
From the sample 2, I need to extract:
The alt on this line:
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
The href on this line:
Read more
The author at this line:
<span class="rgr">by whitey436, Jan 18, 2006
Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>
Here is sample 2:
<tr bgcolor="white">
<td style="padding:10px 5px" align="right" valign="top">
<table cellspacing="4" cellpadding="0" border="0" width=100%>
<tr>
<td class="rkr" nowrap>Overall Rating:</td>
<td width=80>
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
</td>
</tr>
<tr>
<td class='rgr' > </td>
<td>
<img src='http://img.epinions.com/images/epi_images/spacer.gif' width=80 height=11>
</td>
</tr>
</table>
</td>
<td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
<span class="rgr">Mar 27, 2006 <br>(Not Yet Rated)</span><br>
<span class="rkr"> Very helpful in giving me the information I needed to make a purchase.<br><b>
Read more
</b></span>
</td>
</tr>

Here is some Nokogiri code to print out the information you want using XPath:
xml.xpath("//tr[#bgcolor='white']").each do |el|
# Get the "Overall rating" tr block from the first td and get (first) img alt
puts el.at_xpath("td[1]//tr[td/text()='Overall Rating:']//img/#alt")
# Get the first link from the second td that contains "content" and get href
puts el.at_xpath("td[2]//a[contains(#href, '/content')][1]/#href")
# Get the (first) link that has an itemprop author value and get the href
puts el.at_xpath("td[2]//a[#itemprop='author']/#href")
end

use Nokogiri will be ok.
to get alt, get back all the image tags and keep the img tag with the specified src
imgs = doc.css('img[src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif"]')
to get back the href
links = doc.css('a[href*="/content"]')
to get back the author
links = doc.css('a[href*="/user"]')

Listing job offers (schema.org’s JobPosting)

I have a page with list of jobs jobs offers and every job in list is link to page with job offer.
And I have a problem with Microdata, and my question is, which variant is better?
First variant:
<table itemscope itemtype="http://schema.org/JobPosting">
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 1</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 2</td>
</tr>
<tr>
<td itemprop="title" itemtype="http://schema.org/JobPosting" itemscope>job 3</td>
</tr>
</table>
Second variant:
<table>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 1</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 2</a></td>
</tr>
<tr itemscope itemtype="http://schema.org/JobPosting">
<td itemprop="title"><a href..>job 3</a></td>
</tr>
</table>

Your first variant means: There is a JobPosting which has three titles. Each of these titles consists of another JobPosting.
Your second variant means: There are three JobPostings, each one has a title.
So you want to go with your second variant.
Note that you have an error on your current page. Instead of the example contained in your question, on your page you use itemprop="title" on the a element. But then the href value is the title, not the anchor text.
So instead of
<td>
<a itemprop="title" href="…" title="…">…</a>
</td>
<!-- the value of 'href' is the JobPosting title -->
you should use
<td itemprop="title">
<a class="list1" href="…" title="…">…</a>
</td>
<!-- the value of 'a' is the JobPosting title -->
And why not use the url property here?
<td itemprop="title">
<a itemprop="url" href="…" title="…">…</a>
</td>

The second one. The first one is describing a table as JobPosting which isn't a JobPosting.

Finding table values in watij using xpath

I am using watij to automate my UI testing. I have many tables in a webpage. I need to find a table which has a width 95%. It contains many rows. I have to find each row with different text say "running first UI test on local" as below adn need to get the td value "Complete". I am not ble to get the value but I get the watij address. Let me know how I can find this.
<table width=95%>
<tr>
<th align="left">
<span id="lblHeaderComponent" style="font-size:10pt;font-weight:bold;">Component</span>
</th>
<th align="left">
<span id="lblHeaderServer" style="font-size:10pt;font-weight:bold;">Server</span>
</th>
<th align="left">
<span id="lblHeaderStatus" style="font-size:10pt;font-weight:bold;">
</span>
</th>
</tr>
<tr>
<td align="left"
nowrap="nowrap" style="font-size:12px;">running first UI test on local</td>
<td align="left" style="font-size:12px;">Google</td>
<td align="left" style="font-size:12px;">
<a style='color:#336600;'>Complete</a>
</td>
</tr>
<tr>
<td align="left"
style="border-top:1px solid #cfcfcf;border-bottom:1px solid #cfcfcf;"
colspan="3"
style="font-size:12px; color:#ff3300;">
</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" style="font-size:12px;">running second UI test on local</td>
<td align="left" style="font-size:12px;">Google</td>
<td align="left" style="font-size:12px;">
<a style='color:#336600;'>Complete</a>
</td>
</tr>
</table>

You can try an xpath visualizer like this one to assist you in getting the right expression. It lets you see the results visually.
Using XPath on HTML assumes the HTML is XHTML - in other words it must be well-formed XML.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

XPATH (Scrapy): select text between 2 certain keywords - xpath

Related

Unable to select column with its header through XPath

Optimal XPath Query for processing the sample HTML fragment

Parsing a table on a webpage without id's or classes - using Nokogiri or xpath

Listing job offers (schema.org’s JobPosting)

Finding table values in watij using xpath

Categories

Resources