Scrapy Xpath cannot find content in web page - xpath

I tried to crawl this site: https://www.kbb.com/compare-cars/results/?vehicleids=1 and get some info from this page.
I used scrapy shell to test my xpath:
scrapy shell https://www.kbb.com/compare-cars/results/?vehicleids=1
And I tried to use the following xpath to get the div which does exist in the html but just cannot be fetched by my scrapy, it just returned an empty list for me.
response.xpath('//div[#id="vehicleCardsContainer"]')

Related

Web-scraping with Ruby Mechanize

I have tried scraping a web page with ruby mechanize but is not working. Basically that website have some products and i need the links of the products.
HTML
I was test this code below, and I expected the links to the products but the output doesn't show anything.
`
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://shopee.com.br/Casa-e-Decora%C3%A7%C3%A3o-cat.11059983/')
page.css('col-xs-2-4 shopee-search-item-result__item').each do |product|
puts product.link.text
end
`
Part of the page you are trying to parse is being rendered client side, so when mechanize gets the HTML it does not contain the links you are looking for.
Luckily for you the website is using a JSON API so it is pretty easy to extract the information of the products.

Importxml using xpath returns #N/A error with imported contents is empty"

I am trying to scrape texts using importxml / xpath. But it returns #N/A and "Imported content is empty".
I am trying to scrape "223 3563 ROSS DRIVE" on the top. (In fact, I will try to scrape all the information in the page later.)
I have tried with these two codes from GoogleSheet, but no success.
Please help me resolve this.
=index(IMPORTXML("https://bcres.paragonrels.com/publink/default.aspx?GUID=11a3e139-7499-4271-813a-fe2f70ffd304&Report=Yes","//div[#class='mls21']"),1,1)
=index(IMPORTXML("https://bcres.paragonrels.com/publink/default.aspx?GUID=11a3e139-7499-4271-813a-fe2f70ffd304&Report=Yes","//*[#id='divHtmlReport']/div/div[154]"),1,1)
IMPORT function can’t extract from content that loads with JavaScript. You can check this article: https://www.benlcollins.com/spreadsheets/import-social-media-statistics/#notWorking
You can confirm that the website is JavaScript controlled by clicking on the ‘Lock’ icon beside the browser’s address bar, select ‘Site settings’ and set JavaScript to ‘Block’. Reload the page and if you don’t see the data you’re trying to extract, then you will not be able to get it with IMPORT functions.
An alternative to do it through Google Sheets would be to look for a Sheets Add-on that may work with JavaScript content.

Scrape an HTML page after AJAX calls for elements not in the page source

I'm trying to scrape a webpage where the content I want loads after the DOM completes. The new content is fetched through AJAX calls.
So the fetched content isn't available in the page source. I can see when inspecting the page.
When I use cURL it doesn't find the elements on the page. What is the best method to get this content?
I'm trying to use PhantomJS for this, but I'm not sure if that can do it either.
Thanks.

I don't understand why this XPath expression is not working as a Scrapy selector

I am just beginning to learn Scrapy, and I do not understand why the XPath described below is returning zero results.
I am trying to build a spider that crawls http://www.foodsafety.gov/recalls/recent/index.html
Specifically in my testing with the Scrapy shell I was trying to extract the headlines. Using the inspector in Safari's developer console I determined that the XPath for the headline text is //div[#id="recallList"]/h2/a/text(). Using find in the developer console I was able to locate 25 headlines with the above XPath.
However, when I use the Scrapy shell to test the XPath I get an empty list using
>> response.xpath('//div[#id="recallList"]/h2/a/text()').extract()
I am using
>> scrapy shell "http://www.foodsafety.gov/recalls/recent/index.html"
to crawl the site.
The response gives empty result because the content is loaded through Javascript which is not supported by scrapy as of now. If you'll look in the network panel in the developer console, you will see a another request is made to this url http://ajax.googleapis.com/ajax/services/feed/load?v=1.0&callback=jsonp1455174771252&q=http://www.fda.gov/AboutFDA/ContactFDA/StayInformed/RSSFeeds/FoodSafety/rss.xml&num=13 which returns a json. You can use this url to get all your data.

How to get the raw HTML source code for a page by using Ruby or Nokogiri?

I'm using Nokogiri (Ruby Xpath library) to grep contents on web pages. Then I found problems with some web pages, such as Ajax web pages, and that means when I view source code I won't be seeing the exact contents such as <table>, etc.
How can I get the HTML code for the actual content?
Don't use Nokogiri at all if you want the raw source of a web page. Just fetch the web page directly as a string, and then do not feed that to Nokogiri. For example:
require 'open-uri'
html = open('http://phrogz.net').read
puts html.length #=> 8461
puts html #=> ...raw source of the page...
If, on the other hand, you want the post-JavaScript-modified contents of a page (such as an AJAX library that executes JavaScript code to fetch new content and change the page), then you can't use Nokogiri. You need to use Ruby to control a web browser (e.g. read up on Selenium or Watir).

Resources