Scrapy can't find XPath content - xpath

I'm writing a web crawler with Scrapy to download the text of talk-backs on a certain webpage.
Here is the relevant part of the code behind the webpage, for a specific talkback:
<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......
While writing an XPath to get the the message:
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
and later on:
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
There's no bug, but it doesn't work. Any ideas why? I suppose I'm not writing the path correctly, but I can't find the error.
Thank you :)
The whole code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
items.append(item)
return items

Here's a snippet of the HTML page for #site_comment_74240
<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
144. מדיניות
</div>
<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי </td>
<td>(01.11.2013)</td>
</tr></table>
</div>
The "talkback-message" div is not in the HTML page when you first fetch it, but rather is fetched asynchronously via some AJAX query when you click on a comment title, so you'll have to fetch it for each comment.
Comment blocks, titles in you code snipper, can be grabbed using an XPath like this: //div[starts-with(#id, "site_comment_"]), i.e. all divs that have an "id" attribute beginning with string ""site_comment_"
You can also use CSS selectors with Selector.css(). In your case, you can grab comment blocks using either the "id" approach (as I've done above using XPath), so:
titles = sel.css("div[id^=site_comment_]")
or using the "site_comment" class without the other "site_comment-even", "site_comment-odd", "small", "normal-rank" or "high-rank" that vary:
titles = sel.css("div.site_comment")
Then you would issue a new Request using the URL that's in ./div[#class="talkback-topic"]/a[#class="show-comment"]/#data-ajax-url inside that comment div. Or using CSS selectors, div.talkback-topic > a.show-comment::attr(data-ajax-url) (by the way, the ::attr(...) is not standard, but is a Scrapy extension to CSS selectors using pseudo elements functions)
What you get from the AJAX call is some Javascript code, and you want to grab the content inside old.after(...)
var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל <\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&html_id=site_comment_72765&sibling_id=72765\">תגובה חדשה<\/a>\n \n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n \n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");
This is HTML data that you'll need to parse again using something Selector(text=this_ajax_html_data) and a .//div[#class="talkback-message"]//text() XPath or div.talkback-message ::text CSS selector
Here's a skeleton spider to get you going with these ideas:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[#class='talkback-message']text()").extract()
# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break
# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")
# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")
# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item
You can try it out using scrapy runspider tbkspider.py.

Related

extract Xpath for string in a div class

I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(#class, "symbol")]
return element not no text of "GGRM.JK"
Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(#class, "symbol")]/#class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(#class, "symbol")]/a/#href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK
The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/#href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/#class, ": \'"), "\'")')
print(fromClass)

How to set a label to a variable in Watir

I am trying to set some html contents to a variable so i can perform some if statements. But I get this instead:
Fail #<Watir::Browser:0x00000004440b98>
It looks like my variable isnt set to text i want to set.
my html:
<label class="col-lg-12 control-label ng-binding" ng-show="productionReport.Status == 2 && productionReport.ReadyForPublishDate" style="">Text 1</label>
My Watir code:
msgText = 'Text 1'
msgText2 = #browser.label(:xpath, '/html/body/div[1]/div[3]/div/div/div/div/div/form/div/div/div[2]/label')
if (msgText == msgText2)
puts 'Pass' "#{msgText2}"
else
puts 'Fail' "#{msgText2}"
end
The problem is that msgText2 (ie #browser.label) is being set to a Watir::Label element rather than its text.
To get the text of the label, you need to call the text method. For example:
msgText2_element = #browser.label(:xpath, '/html/body/div[1]/div[3]/div/div/div/div/div/form/div/div/div[2]/label')
msgText2 = msgText2_element.text

retrieve text from <p> on landing page using ruby watir

I have to retrieve the text from the web page and put it on console.
I am not able to get the text from this html below. Can anyone please help me on this.
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row">
</div>
I tried b.div(:class => 'twelve columns').exist? on irb and it says true.
I tried this - b.div(:class => 'twelve columns').text, and it returns me the text on the header not in paragraph.
I tried with - b.div(:class => 'twelve columns').p.text, it returned me error - unable to locate element, using {:tag_name=>"p"}
Simply doing this on example you wrote worked for me:
browser.div(:class => 'twelve columns').p.text
Your best bet would be to check your page css for actually having provided elements structure, as well as that they are nested properly.
I slightly fixed you HTML:
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row"></div>
</div>
Let's do a tiny example:
div = b.div(:class => 'twelve columns')
Enumeration of elements as follows:
div.elements.each do |e|
p e
end
Will do something like that:
<Watir::HTMLElement ... # <h1>Your product</h1>
<Watir::HTMLElement ... # <p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<Watir::HTMLElement ... #<div class="row">
If you want to specify child element P from the DIV do this:
p = div.p
or
p = div.element( :tag_name => 'p' )
And when get text of P:
p.text # >> 21598: DECLINE: Decline - Property Type not acceptable under this contract
Or event do with your single string:
b.div(:class => 'twelve columns').p.text
=> "21598: DECLINE: Decline - Property Type not acceptable under this contract"

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

How to extract links, text and timestamp from webpage via Html Agility Pack

I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(#class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[#class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the <a> inside your <span class='Subject2'> - the first value is a href attribute value, the second is InnerText of the anchor. To get the third value you'll have to get the following sibling of the <span class='Subject2'> tag and get its InnerText.
See, this how you can do it:
var nodes = document.DocumentNode.SelectNodes("//span[#class='Subject2']//a");
foreach (var node in nodes)
{
if (node.Attributes["href"] != null)
{
var link = new XElement("link", node.Attributes["href"].Value);
var description = new XElement("description", node.InnerText);
var timeNode = node.SelectSingleNode(
"..//following-sibling::span[#class='time']");
if (timeNode != null)
{
var time = new XElement("pubDate", timeNode.InnerText);
Response.Write(link);
Response.Write(description);
Response.Write(time);
}
}
}
this outputs something like:
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open</link>
<description>Description 1 text here</description>
<pubDate>2012-01-20 08:35</pubDate>
<link>/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open</link>
<description>Description 2 text here</description>
<pubDate>2012-01-20 09:35</pubDate>

Resources