Scraping a web page by using XPath - xpath

I'm scraping some web pages in order to get some information. I'm using Scrapy and XPath language.
This is an example of page I would get. In the page there are many of this li element
<li ckIgnore="false" codmod="3857" ccar="A" area="NEW" versArea="NEW" shorturl="1" modurl="/auto">
<article>
<img width="210" height="158" src="" alt="" modello=>
<img src="" alt="logo" class="logo-listing" width="38">
<div class="hgroup">
<a href="">
<h5>ABARTH</h5>
<h3>500 cabrio</h3>
</a>
</div>
</article>
</li>
I'm using this syntax to get all the divs which have hgroup class. Unfortunately when I try to print print out models variable this is empty.
def parse(self, response):
sel = Selector(response)
models = sel.xpath("//div[#class='hgroup']/a")

It's possible that what scrapy "sees" is different than what you see on your browser. Try using scrapy shell "http://example.com" and check the response.body if what your looking for is there.

Related

How to properly get the value contained inside a section using XPath?

having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL

Extract text inside anchor tag using xpath

I am trying to ascertain how many pages are there for any search result on a site so that i can scrape data for all the pages using lxml and xpath.
There is a pagination tab with the following structure:
Page: 1 2 3 ... 7 next
the html content for the same being something like
<ul class="ulclass">
<li></li>
<li>
<span> You are on the first page</span>
"1"
</li>
<li>
<a href="link to second page">
<span></span>
"2"
</a>
</li>
<li>
</li>
...
<li>
<a href="link to last page">
<span></span>
"7"
</a>
</li>
My approach is to extract the page numbers 1,2,3,7 so that i can repeat the web scraping 7 times for every page 'cause otherwise it just scrapes the first result of the page.
I have written the following xpath, but it doesnot return correct page numbers.
xpath('//ul[#class="ulclass"]/li/a/text())
If I expand your example to form this,
<ul class="ulclass">
<li><span>You are on the first page</span>"1"</li>
<li><span></span>"2"</li>
<li><span></span>"3"</li>
<li><span></span>"4"</li>
<li><span></span>"5"</li>
<li><span></span>"6"</li>
<li><span></span>"7"</li>
</ul>
then using scrapy in Python I can get this:
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('..//ul[#class="ulclass"]/li/a/text()').extract()
['"2"', '"3"', '"4"', '"5"', '"6"', '"7"']

Using page.at with CSS selector in Mechanize

I am trying to scrape a webpage with Mechanize, with the following structure:
<div id="searchResultsBox">
<div class="listings-wrap">
<div class="listings-header">
<div class="listing-cat">Category</div>
<div class="listing-name">Name</div>
</div>
<ul class="listings">
<li class="listing">
<a href="/ShowRatings.jsp?tid=1143052">
<span class="listing-cat">
<span class="icon"></span>
TEXT
</span>
<span class="listing-name">
<span class="main">TEXT</span>
<span class="sub">TEXT</span>
</span>
</a>
</li>
...
I want to navigate to the page behind the <a> HTML element. Right now, I have:
agent = Mechanize.new
page = agent.get("URL")
page = page.at('#searchResultsBox > div.listings-wrap > ul > li:nth-child(1) > a')
but it keeps returning NIL (verified by puts page.class).
I also tried using sleep to try to ensure that pages have time to load before continuing.
Is there anything I am doing wrong? I thought using the CSS selector would do the trick.
Maybe the website content is loaded dynamically, by JavaScript.
Inspect the content of your page variable and see if the content there is complete or not.
If the content is incomplete, it means that there has to be some other requests, to the serwer returning that data. You can search for them opening Chrome DevTools (or other tool). In the tab "Network" you will see all requests made by website. Search for the one containing data that you need and then scrape it by Mechanize.

Using scrapy extract the url of image

I am using scrapy to scrape images. I notice that some image url is specified by #src,like the following:
<a href="http://www.wandoujia.com/apps/com.uu">
<img src="http://img.wdjimg.com/mms/icon/v1/5/09/14687d011083dc84036fc68dc3c80095_68_68.png" width="68" height="68" alt="UU电话" class="icon">
</a>
Some are different:
<a href="http://www.wandoujia.com/apps/com.hcsql.shengqiandianhua">
<img data-original="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" alt="省钱电话" class="icon lazy" width="68" height="68" src="http://img.wdjimg.com/mms/icon/v1/6/44/a27006acfbe8b6aa39bee49c6f004446_68_68.png" style="display: block;">
</a>
I use the following code to extract. The result is : 1)if only the src occur, the #src is the real link of image; 2) if the data-original occurs, the #data-original is the real link,#src is not. So my question is what should i do if I want to extract the url of the image under the both two cases.
sel.xpath('/a/img/#src').extract()
You can try:
sel.xpath('//a/img[not(#data-original)]/#src | //a/img/#data-original').extract()

count images in a page using capybara

I want to count the images displayed in a page using capybara.The html code displayed below.for that i use following code to return the total count but the count returns 0.In my page i have 100 more images.
c= page.all('.thumbnail_select').count
puts c(returns 0)
HTML
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon_img"></span></span>
</a>
<a class="thumbnail thumbnail_img_wrap">
<img alt="" src="test1.jpg">
<div class="thumbnail_select">
<div class="thumail_selet_backnd"></div>
<div class="thumbil_selt_text">Click to Select</div>
</div>
<p>ucks</p>
<span class="info_icon"><span class="info_icon1_img"></span></span>
</a>
.........
.........
How can i count the total images?
You have a few options.
Either find all div's with class thumbnail_select by using all("div[class='thumbnail_select']").count
But this is an awkward way of doing it since it looks for the div and not the images.
A better way would to be to look for all images using all("img").count as long as no other image is present on the page.
If neither of these works either the problem might be that your page is not loaded when you start looking for the images. Then just simply put a page.should have_content check before the image count to make sure that the page is loaded.

Resources