extract Xpath for string in a div class - xpath

I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(#class, "symbol")]
return element not no text of "GGRM.JK"

Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(#class, "symbol")]/#class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(#class, "symbol")]/a/#href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK

The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/#href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/#class, ": \'"), "\'")')
print(fromClass)

Related

XPath problem with multiple OR expressions like (a|b|c) [duplicate]

This question already has an answer here:
Logical OR in XPath? Why isn't | working?
(1 answer)
Closed 1 year ago.
I have simplified html:
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
I want to find only one and two, using conditions that the parent tag is main or support, and there is span or divafter it.
I wonder why that code does not work:
import lxml.html as HTML_PARSER
html = """
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
"""
parent = '//main | //support'
child = '/span | /div'
doc = HTML_PARSER.fromstring(html)
print doc
xpath = '(%s)(%s)' % (parent, child)
print xpath
parsed = doc.xpath(xpath)
print parsed
I get an error Invalid expression. Why?
This (//main | //support) and this (/span | /div) xpaths are both correct.
Simple combo like (//main | //support)/span is also correct.
But why more complicated combination (//main | //support)(/span | /div) is not correct? How to resolve it?
In my real case //main, //support, /span and /div are really complicated xpaths, I want some general solution like (xpath1 | xpath2)(xpath3 | xpath4)
this will find it, however I'm not 100% sure if it's what you want:
//*[name() = 'main' or name() = 'support']/*[name() = 'span' or name() = 'div']/text()
Your XPath is not valid for XPath version 1 (the one that lxml use)
Try
xpath = '//div[parent::support]|//span[parent::main]'
or
parent = ['main', 'support']
child = ['span', 'div']
xpath = '//*[self::{0[0]} or self::{0[1]}]/*[self::{1[0]} or self::{1[1]}]'.format(parent, child)
You can use the self:: axis:
(//main | //support)[*[self::div or self::span]]

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop,
# This package will contain the spiders of your Scrapy project
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[#name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[#name = classname]/text()")
I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...
PS.
I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes
with filter applied: Kingsborough CC, fall 18, BIO
Thanks!
Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...
So I will give you some tools:
In your settings.py set that:
LOG_FILE = "log.txt"
LOG_STDOUT=True
then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt
Check there if there is what you are looking for.
If there is, use this https://www.freeformatter.com/xpath-tester.html (
or similar ) to check your xpath statement.
In addition, change for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
by for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

Scrapy can't find XPath content

I'm writing a web crawler with Scrapy to download the text of talk-backs on a certain webpage.
Here is the relevant part of the code behind the webpage, for a specific talkback:
<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......
While writing an XPath to get the the message:
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
and later on:
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
There's no bug, but it doesn't work. Any ideas why? I suppose I'm not writing the path correctly, but I can't find the error.
Thank you :)
The whole code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
items.append(item)
return items
Here's a snippet of the HTML page for #site_comment_74240
<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
144. מדיניות
</div>
<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי </td>
<td>(01.11.2013)</td>
</tr></table>
</div>
The "talkback-message" div is not in the HTML page when you first fetch it, but rather is fetched asynchronously via some AJAX query when you click on a comment title, so you'll have to fetch it for each comment.
Comment blocks, titles in you code snipper, can be grabbed using an XPath like this: //div[starts-with(#id, "site_comment_"]), i.e. all divs that have an "id" attribute beginning with string ""site_comment_"
You can also use CSS selectors with Selector.css(). In your case, you can grab comment blocks using either the "id" approach (as I've done above using XPath), so:
titles = sel.css("div[id^=site_comment_]")
or using the "site_comment" class without the other "site_comment-even", "site_comment-odd", "small", "normal-rank" or "high-rank" that vary:
titles = sel.css("div.site_comment")
Then you would issue a new Request using the URL that's in ./div[#class="talkback-topic"]/a[#class="show-comment"]/#data-ajax-url inside that comment div. Or using CSS selectors, div.talkback-topic > a.show-comment::attr(data-ajax-url) (by the way, the ::attr(...) is not standard, but is a Scrapy extension to CSS selectors using pseudo elements functions)
What you get from the AJAX call is some Javascript code, and you want to grab the content inside old.after(...)
var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל <\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&html_id=site_comment_72765&sibling_id=72765\">תגובה חדשה<\/a>\n \n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n \n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");
This is HTML data that you'll need to parse again using something Selector(text=this_ajax_html_data) and a .//div[#class="talkback-message"]//text() XPath or div.talkback-message ::text CSS selector
Here's a skeleton spider to get you going with these ideas:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[#class='talkback-message']text()").extract()
# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break
# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")
# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")
# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item
You can try it out using scrapy runspider tbkspider.py.

How to extract links, text and timestamp from webpage via Html Agility Pack

I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(#class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[#class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the <a> inside your <span class='Subject2'> - the first value is a href attribute value, the second is InnerText of the anchor. To get the third value you'll have to get the following sibling of the <span class='Subject2'> tag and get its InnerText.
See, this how you can do it:
var nodes = document.DocumentNode.SelectNodes("//span[#class='Subject2']//a");
foreach (var node in nodes)
{
if (node.Attributes["href"] != null)
{
var link = new XElement("link", node.Attributes["href"].Value);
var description = new XElement("description", node.InnerText);
var timeNode = node.SelectSingleNode(
"..//following-sibling::span[#class='time']");
if (timeNode != null)
{
var time = new XElement("pubDate", timeNode.InnerText);
Response.Write(link);
Response.Write(description);
Response.Write(time);
}
}
}
this outputs something like:
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open</link>
<description>Description 1 text here</description>
<pubDate>2012-01-20 08:35</pubDate>
<link>/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open</link>
<description>Description 2 text here</description>
<pubDate>2012-01-20 09:35</pubDate>

XPath Query to select hyperlink

The following is a subset of xml from a twitter atom feed:
<entry>
<id>tag:search.twitter.com,2005:18232030105964545</id>
<published>2010-12-24T09:10:29Z</published>
<link type="text/html" rel="alternate" href="http://twitter.com/KTNKenya/statuses/18232030105964545"/>
<title>Synovate Poll: PM Raila Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... http://fb.me/yjmMbmBx</title>
<content type="html">Synovate Poll: PM <b>Raila</b> Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... <a href="http://fb.me/yjmMbmBx">http://fb.me/yjmMbmBx</a></content>
<updated>2010-12-24T09:10:29Z</updated>
<link type="image/png" rel="image" href="http://a3.twimg.com/profile_images/701825859/NEW_KTN_normal.png"/>
<google:location>nairobi, kenya</google:location>
<twitter:geo>
</twitter:geo>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>KTNKenya (KTN Kenya)</name>
<uri>http://twitter.com/KTNKenya</uri>
</author>
</entry>
From the <title>...</title> element, i need to select the hyperlink http://fb.me/yjmMbmBx via an XPath query. How do I do it? Is it possible?
*I'm an XPath newbie.
Thanks.
You have two options:
Use <title> (xpath: "/entry/title/text()") and get the URL yourself (e.g. using regex or finding the last instance of "http://" in the string.
Get the data first:
/entry/content[#type="html"]/text()
Then you need to parse this as HTML and extract any tags, and use the href attribute of those tags. How you do this last part depends on the language/environment you are doing this in.
Update: Added basic example code for option 1 above, as requested:
xmlpp::Element *node = parser.get_document()->get_root_node();
xmlpp::NodeSet results = node->find("/entry/title/text()");
xmlpp::ContentNode* content = dynamic_cast<xmlpp::ContentNode*>(results.front());
std::string text = content->get_content();
std::string link = "";
int res = text.rfind("http://");
if(res == text.npos)
res = text.rfind("https://");
if(res != text.npos)
link = text.substr(res);
With atom prefix bound to http://www.w3.org/2005/Atom namespace URI, use:
/atom:feed/atom:entry/atom:title[contains(.,'http://')]
This selects every atom:title element child of atom:entry, having the string "http://" contained in its string value.

Resources