Web Crawling. Not able to crawl the data correctly - ruby

require 'open-uri'
require 'nokogiri'
url = "https://grofers.com/cn/grocery-staples/cid/16"
response = open(url).read
parsed_data = Nokogiri::HTML(response)
results = []
content = parsed_data.css('.section-right').css('.products products--grid').each do |row|
title = row.css('a.product__wrapper').text
puts title
results << content
end

Related

Nokogiri example not showing array (Ruby)

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

Using Ruby's Anemone Gem to Scrape All Email Addresses From a Site

I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email address on the first iteration of the first loop.
For some reason, my regex does not seem to be matching:
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
#urlCounter, #anemoneCounter = 0
$allUrls, $emailUrls, $emails = []
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
matchOrNil = body_text.match(/\A[^#\s]+#[^#\s]+\z/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = body_text.match
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]
So here is the working version of the code. Uses a single regex to find a string containing an email and three more to clean it.
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
#urlCounter = 0
$anemoneCounter = 0
$allUrls = []
$emailUrls = []
$emails = []
end
def email_clean(email)
email = email.gsub(/(\w+=)/,"")
email = email.gsub(/(\w+:)/, "")
email = email.gsub!(/\A"|"\Z/, '')
return email
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
#matchOrNil = body_text.match(/\A[^#\s]+#[^#\s]+\z/)
matchOrNil = body_text.match(/[^#\s]+#[^#\s]+/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = matchOrNil
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
found_email = "href=\"mailto:genuinestoragesheds#gmail.com\""
puts emailGrab.email_clean(found_email)
\A and \z in your match beginning and end of string respectively. Obviously that webpage contains more that just an email string, or you wound't do the regex test at all.
You can simplify it to just /[^#\s]+#[^#\s]+/, but you would still need to cleanup the string the extract the email.

How to use three url to make one url array. Use the same url for nokogiri

I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten

How do I parse XML nodes from an API request?

How do I save the information from an XML page that I got from a API?
The URL is "http://api.url.com?number=8-6785503" and it returns:
<OperatorDataContract xmlns="http://psgi.pts.se/PTS_Number_Service" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>Tele2 Sverige AB</Name>
<Number>8-6785503</Number>
</OperatorDataContract>
How do I parse the Name and Number nodes to a file?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://api.url.com?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exporterad.txt", "w") do |file|
doc.xpath("//*").each do |item|
title = item.xpath('//result[group_name="Name"]')
phone = item.xpath("/Number").text.strip
puts "#{title} ; \n"
puts "#{phone} ; \n"
company = " #{title}; #{phone}; \n\n"
file.write(company.gsub(/^\s+/,''))
end
end
Besides the fact that your code isn't valid Ruby, you're making it a lot harder than necessary, at least for a simple scrape and save:
require 'nokogiri'
require 'open-uri'
url = "http://api.pts.se/PTSNumberService/Pts_Number_Service.svc/pox/SearchByNumber?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exported.txt", "w") do |file|
name = doc.at('Name').text
number = doc.at('Number').text
file.puts name
file.puts number
end
Running that results in a file called "exported.txt" that contains:
Tele2 Sverige AB
8-6785503
You can build upon that as necessary.

How to scrape a web page with dynamic content added by JavaScript?

I am trying to scrape this webpage, it have lazy load as we scroll it gets loaded. Using Nokogiri I am able to scrape the initial page, but not the rest of the page which load after scrolling.
To get lazy loaded page, scrap the following pages:
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...
require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'
number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
products = doc.css(".browse-product")
break if products.size == 0
products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[#class='fk-display-block']/#href")
image = item.at_xpath(".//div/a/img/#src")
puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"
number += 1
end
end

Resources