I am trying to scrape a webpage with Mechanize, with the following structure:
<div id="searchResultsBox">
<div class="listings-wrap">
<div class="listings-header">
<div class="listing-cat">Category</div>
<div class="listing-name">Name</div>
</div>
<ul class="listings">
<li class="listing">
<a href="/ShowRatings.jsp?tid=1143052">
<span class="listing-cat">
<span class="icon"></span>
TEXT
</span>
<span class="listing-name">
<span class="main">TEXT</span>
<span class="sub">TEXT</span>
</span>
</a>
</li>
...
I want to navigate to the page behind the <a> HTML element. Right now, I have:
agent = Mechanize.new
page = agent.get("URL")
page = page.at('#searchResultsBox > div.listings-wrap > ul > li:nth-child(1) > a')
but it keeps returning NIL (verified by puts page.class).
I also tried using sleep to try to ensure that pages have time to load before continuing.
Is there anything I am doing wrong? I thought using the CSS selector would do the trick.
Maybe the website content is loaded dynamically, by JavaScript.
Inspect the content of your page variable and see if the content there is complete or not.
If the content is incomplete, it means that there has to be some other requests, to the serwer returning that data. You can search for them opening Chrome DevTools (or other tool). In the tab "Network" you will see all requests made by website. Search for the one containing data that you need and then scrape it by Mechanize.
Related
I have a tumblr blog embedded into my website (iframe) and want all clicks to open in a new tab to land on the post detail url (e.g. https://diestadtgaertner.tumblr.com/post/657405245299818496). I already adapted the template to get this working for most post types by exchanging the respective href variable with "https://diestadtgaertner.tumblr.com/post/{PostID}" and add target="_blank". However, I can't get this to work for the pictureset. Does anyone know how this might work?
Help would be greatly appreciated!
Thanks & best,
Torge
You can edit your template so the photoset gets output into a normal div (I think the default is to load photosets inside an iframe themselves, which could be causing you issues.
This is the block from my tumblr template:
<ul>
...
{block:Photoset}
<li class="post photoset">
{block:Photos}
<img src="{PhotoURL-500}" {block:HighRes}style="display:none"{/block:HighRes} />
{block:HighRes}
<img src="{PhotoURL-HighRes}" class="highres" />
{/block:HighRes}
{/block:Photos}
{block:Caption}
<div class="description">{Caption}</div>
{/block:Caption}
<p>
<span class="icon-link ion-ios-infinite-outline"></span>
{block:Date}{DayOfMonthWithZero}.{MonthNumberWithZero}.{ShortYear}{/block:Date}
</p>
</li>
{/block:Photoset}
</ul>
In any case you could wrap the entire block in the Permalink href. Something like:
<ul>
...
{block:Photoset}
<li class="post photoset">
<a href="{Permalink}"> // this permalink href now wraps the entire content of the post.
{block:Photos}
<img src="{PhotoURL-500}" {block:HighRes}style="display:none"{/block:HighRes} />
{block:HighRes}
<img src="{PhotoURL-HighRes}" class="highres" />
{/block:HighRes}
{/block:Photos}
{block:Caption}
<div class="description">{Caption}</div>
{/block:Caption}
</a>
</li>
{/block:Photoset}
</ul>
The issue now is that the default click links for the images inside this post (if they exist) will no longer function normally.
It is difficult to test this without the link to your site, but I think updating your tumblr template first should hopefully give you the result you are after, but of course I would recommend a backing up your code.
Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
I have a webpage with list of pages:
<div class="pager">
<span class="current_page">1</span>
<span class="page" samo:page="2">2</span>
<span class="page" samo:page="3">3</span>
<span class="page" samo:page="4">4</span>
<span class="page" samo:page="5">5</span>
<span class="page" samo:page="6">6</span>
<span class="page" samo:page="7">7</span>
<span class="page" samo:page="8">8</span>
<span class="page" samo:page="9">9</span>
<span class="page" samo:page="10">10</span>
<span class="page" samo:page="11">11</span>
</div>
How can I click on the span using mechanize?
According to this ASCIIcasts you can perform searches and findings:
There are two methods on the page object that we can use to extract
elements from a page using Nokogiri. The first of these is called at
and will return a single element that matches a selector.
agent.page.at(".edit_item")
The second method is search. This is similar, but returns an array of
all of the elements that match.
agent.page.search(".edit_item")
http://asciicasts.com/episodes/191-mechanize
So doing something like:
agent.page.at(".page")
Will return the array of spans. And then you will be able to work with them and just do the #click action.
EDITED:
As long as the span is a non interactive element, and click is a Link action, you will have to find a workaround:
How to click link in Mechanize and Nokogiri?
I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?
Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
ยป Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")
At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.
Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.
I want to count the images displayed in a page using capybara.The html code displayed below.for that i use following code to return the total count but the count returns 0.In my page i have 100 more images.
c= page.all('.thumbnail_select').count
puts c(returns 0)
HTML
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon_img"></span></span>
</a>
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test1.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon1_img"></span></span>
</a>
.........
.........
How can i count the total images?
You have a few options.
Either find all div's with class thumbnail_select by using all("div[class='thumbnail_select']").count
But this is an awkward way of doing it since it looks for the div and not the images.
A better way would to be to look for all images using all("img").count as long as no other image is present on the page.
If neither of these works either the problem might be that your page is not loaded when you start looking for the images. Then just simply put a page.should have_content check before the image count to make sure that the page is loaded.