so I've been trying to scrape this without much luck. I have tried with the following xpaths with no luck:
../#href
parent::a/#href
Here's what I'm trying to scrape:
<a href="https://placeholder.url.com" class="infoclass_3392 classghzb">
<div class="hgs-983hsa" data-testid="Name">Casing NZXT H510i Black Matte or White</div>
<div class="hgs-212gsa" data-testid="Price">Rp1.747.999</div>
</a>
I can scrape the price data, but from the price data I'm trying to access the parent a tag.
If your current node is the text(), you need to go up two levels: ../../#href or parent::div/parent::a/#href. The parent of the text() is the div.
Demonstated in xsh:
open file.xml ;
cd a/div[2]/text() ;
ls ; # Rp1.747.999
echo ../../#href ; # https://placeholder.url.com
echo parent::div/parent::a/#href ; # https://placeholder.url.com
Related
<div id="t_info" class="tab-pane fade active in tab">
<br><strong>Delivery</strong> <br>
<br><br><br><strong>Model Name</strong> : BP250
<br>
<br>Full HD up-scaling dramatically improves the resolution of any original content to Full HD.
<br>
<br><strong>Barcode</strong> : 8806087225921
<br>
<br><strong>Product Type</strong> : Blu-ray Player<br>
<br>Blu-Ray Disc <br>External <br></div>
I need xpath to capture the barcode value. Location of the barcode varies depending on the description.
I have tried //*[text()='Barcode'] . but i cant capture the value.
In your case you can use next XPath:
(//div[#id="t_info"]/text())[./preceding::strong[text()='Barcode']][1]
Please note that it is mauvais ton (bad manners)
I want to get text "+12345" from this HTML
<p class="Test" ng-repeat="(k, wl) in partnerEditModel.td">
<span id="Test-update-12345" class="ng-binding">
+12345
<span class="err-message ng-binding">Error</span>
</span>
<a id="mibile" class="button" ng-click="remove(k)">
</p>
I have written "//p[#class='Test']/span but it matched "+12345Error" which I have just wanted "+12345" (I have not wanted "Error".)
Could you please tell me about how to write this xpath?
Try below XPath expression to get "+12345" only:
normalize-space(//span[#id="Test-update-12345"]/text()[1])
Try below code in robot framework
${var}= | Get Text | xpath=//span[span[#class='err-message ng-binding']]
Log | ${var}
Try using below:
xpath=//p[#class='Test']/span[1]
I'm writing a web crawler with Scrapy to download the text of talk-backs on a certain webpage.
Here is the relevant part of the code behind the webpage, for a specific talkback:
<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......
While writing an XPath to get the the message:
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
and later on:
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
There's no bug, but it doesn't work. Any ideas why? I suppose I'm not writing the path correctly, but I can't find the error.
Thank you :)
The whole code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
items.append(item)
return items
Here's a snippet of the HTML page for #site_comment_74240
<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
144. מדיניות
</div>
<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי </td>
<td>(01.11.2013)</td>
</tr></table>
</div>
The "talkback-message" div is not in the HTML page when you first fetch it, but rather is fetched asynchronously via some AJAX query when you click on a comment title, so you'll have to fetch it for each comment.
Comment blocks, titles in you code snipper, can be grabbed using an XPath like this: //div[starts-with(#id, "site_comment_"]), i.e. all divs that have an "id" attribute beginning with string ""site_comment_"
You can also use CSS selectors with Selector.css(). In your case, you can grab comment blocks using either the "id" approach (as I've done above using XPath), so:
titles = sel.css("div[id^=site_comment_]")
or using the "site_comment" class without the other "site_comment-even", "site_comment-odd", "small", "normal-rank" or "high-rank" that vary:
titles = sel.css("div.site_comment")
Then you would issue a new Request using the URL that's in ./div[#class="talkback-topic"]/a[#class="show-comment"]/#data-ajax-url inside that comment div. Or using CSS selectors, div.talkback-topic > a.show-comment::attr(data-ajax-url) (by the way, the ::attr(...) is not standard, but is a Scrapy extension to CSS selectors using pseudo elements functions)
What you get from the AJAX call is some Javascript code, and you want to grab the content inside old.after(...)
var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל <\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&html_id=site_comment_72765&sibling_id=72765\">תגובה חדשה<\/a>\n \n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n \n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");
This is HTML data that you'll need to parse again using something Selector(text=this_ajax_html_data) and a .//div[#class="talkback-message"]//text() XPath or div.talkback-message ::text CSS selector
Here's a skeleton spider to get you going with these ideas:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[#class='talkback-message']text()").extract()
# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break
# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")
# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")
# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item
You can try it out using scrapy runspider tbkspider.py.
So I have some content with some href links the links look like so:
<p>Here you can find
Survival stats
Smoking stats
and Risks
<a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
And a few more
My desired result is to remove all ssNODELINKs that you see listed and keep other links. the result would look like:
Here you can find Survival stats Smoking stats and Risks of recent research Something
I have tried the following lines of code to achieve this:
page_content.gsub!(/(<a href="ssNODELINK/a-zA-Z">)/, ''))
And
this only removes part of it
page_content.gsub!(/(<a href="ssNODELINK)/, ''))
Any suggestions on how to achieve my desired result?
I would do as below :
require 'nokogiri'
doc = Nokogiri.HTML <<-eot
<p>Here you can find
Survival stats
Smoking stats
and Risks
<a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
eot
nodesets = doc.css('p > a')
nodesets.each do |nd|
nd.unlink if nd['href'].include? 'ssNODELINK'
end
puts doc.to_html.gsub(/^\s*\n/, "")
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Here you can find
# >> <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
# >> of recent research</p></body></html>
I am attempting to capture a line of text for an automated WebDriver test to use it in a comparison later on. However, I cannot find an XPath that will work with WebDriver. I have used the text() function before to capture text that is not in a tag, but in this instance that is not working. Here is the HTML, note that this text will never be the same, so I cannot use contains or similar functions.
<div id="content" class="center ui-content" data-role="content" role="main">
<div data-iscroll="scroller">
<div class="ui-corner-all ui-controlgroup ui-controlgroup-vertical" data-role="controlgroup">
<a class="ui-btn ui-corner-top ui-btn-hover-c" style="text-align: left" data-role="button" onclick="onDocumentClicked(21228772, "document.php?loan=********&folderseq=0&itemnum=21228772&pageCount=3&imageTypeName=1003 Application - Final&firstInitial=&lastName=")" href="#" data-corners="true" data-shadow="true" data-iconshadow="true" data-wrapperels="span" data-theme="c">
<span class="ui-btn-inner ui-corner-top">
<span class="ui-btn-text">
<img class="checkMark checkMark21228772 notViewedCompletely" width="15" height="15" title="You have not yet viewed this document." src="../images/white_dot.gif"/>
1003 Application - Final. (Jan 11 2012 5:04PM)
</span>
</span>
</a>
In this example, the text I am attempting to capture is: 1003 Application - Final. (Jan 11 2012 5:04PM)
I have inspected the element with Firebug and I have tried the following XPaths with no success.
html/body/div[1]/div[2]/div/div/a[1]/span/span
html/body/div[1]/div[2]/div/div/a[1]/span/span/text()
The WebDriver test is being written in C#.
You can either use this
driver.FindElement(By.XPath(".//div[#id='content']/following-sibling::span[#class='ui-btn-text']")
or
var elem = driver.FindElement(By.Id("Content"));
string text = string.Empty;
if(elem!=null) {
var textElem = elem.FindElement(By.Xpath(".//following-sibling::span[#class='ui-btn-text']"));
if(textElem!=null) text = textElem.Text();
}
I was able to solve this issue by removing the span tags from the XPath.
GetText("html/body/div[3]/div[2]/div/div/a[1]", SelectorType.XPath);
python webdriver code looks something like
driver.find_element_by_xpath("//span[#class='ui-btn-text']").text
But locator may be not uniqe, because I can't see all the code
PS Try to never use locators like html/body/div[1]/div[2]/div/div/a[1]/span/span
Approach:
Find the CSS Selector from the Given DOM
Derived CSS:css=#content div.ui-controlgroup > a[onclick*='onDocumentClicked'] > span > span
Use the C# Library Method to get the Text.