How to scrape a web page with dynamic content added by JavaScript? - ruby

I am trying to scrape this webpage, it have lazy load as we scroll it gets loaded. Using Nokogiri I am able to scrape the initial page, but not the rest of the page which load after scrolling.

To get lazy loaded page, scrap the following pages:
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...
require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'
number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
products = doc.css(".browse-product")
break if products.size == 0
products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[#class='fk-display-block']/#href")
image = item.at_xpath(".//div/a/img/#src")
puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"
number += 1
end
end

Related

Web Crawling. Not able to crawl the data correctly

require 'open-uri'
require 'nokogiri'
url = "https://grofers.com/cn/grocery-staples/cid/16"
response = open(url).read
parsed_data = Nokogiri::HTML(response)
results = []
content = parsed_data.css('.section-right').css('.products products--grid').each do |row|
title = row.css('a.product__wrapper').text
puts title
results << content
end

Nokogiri example not showing array (Ruby)

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

How to use three url to make one url array. Use the same url for nokogiri

I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten

Iterate through pages nokogiri get link address

I'm trying to get images or image addresses from the website below. It works for the one website that I put below: "http://www.1stsourceservall.com/Category/Accessories". However--once it's finished with the page--I want it to then click on the next page link and cycle through all 20+ pages. How would I do that?
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1stsourceservall.com/Category/Accessories"
while (url) do
doc = Nokogiri::HTML(open(url))
puts doc.css(".productImageMed")
end
link = doc.css('.pagination a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end

Resources