Hey y'all I have been trying to learn scrapy, and am working on my first project right now. I have written this code to try to scrape NFL player news from http://www.rotoworld.com/playernews/nfl/football/?rw=1. I tried to set up a loop to get each container from the site, but when I run the code it isn't scraping anything. The code runs fine, even pus out a csv file when I ask it too. It just isn't scraping what I think I am telling it to scrape. Any help would be great! Thanks
import scrapy
from Roto_Player_News.items import NFLNews
class Roto_News_Spider2(scrapy.Spider):
name="PlayerNews2"
allowed_domains = ["rotoworld.com"]
start_urls = ('http://www.rotoworld.com/playernews/nfl/football/',)
def parse(self,response):
containers= response.xpath('//*[#id="cp1_pnlNews"]/div/div[2]')
def parse(self, response):
for container in containers:
def parse(self, response):
item=NFLNews()
item['player']= response.xpath('//div[#class="pb"][1]/div[#id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[#class="report"]/text()')
item['headline'] = response.xpath('//div[#class="pb"][1]/div[#id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[#class="report"]/p/text()').extract()
item['info'] = response.xpath('//div[#class="pb"][1]/div[#id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[#class="impact"]/text()').extract()
item['date'] = response.xpath('//div[#class="pb"][1]/div[#id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[#class="info"]/div[#class="date"]/text()').extract()
item['source'] = response.xpath('//div[#class="pb"][1]/div[#id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[#class="info"]/div[#class="source"]/a/text()').extract()
yield item
Your defined xpaths do not look good. Try this instead. It should fetch you the content you wish to scrape. Just do the copy and paste.
import scrapy
class Roto_News_Spider2(scrapy.Spider):
name = "PlayerNews2"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first()
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Report":report,"Date":date,"Impact":impact,"Source":source}
Related
I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.
def self.generate_page #generate a Mechainze page object,the first page
agent = Mechanize.new
url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
page = agent.get(url)
page
end
def self.next_page(n_page) #get next page recursively by click next tag showed in each pages
puts n_page
# if I dont use puts , I get nothing , when using puts, I get
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>
#which are the results I want
np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
result = next_page(np) unless np.nil?
result # here the value is empty, I dont know what is worng
end
def self.get_page # trying to pass the result of next_page() method
puts next_page(generate_page)
# it seems result is never passed here,
end
I followed these two links What is recursion and how does it work?
and Ruby recursive function
but still cant figure out what's wrong.. hope someone can help me out.. Thanks
There are a few issues with your code:
You shouldn't be calling Mechanize.new more than once.
From a stylistic perspective, you are doing too many nil checks.
Unless you have a preference for recursion, it'll probably be easier to do it iteratively.
To have your next_page method return an array containing every link page in the chain, you could write this:
# you should store the mechanize agent as a global variable
Agent = Mechanize.new
# a helper method to DRY up the code
def click_to_next_page(page)
Agent.click(n_page.link_with(:text=>/next/))
end
# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
results = []
np = click_to_next_page(n_page)
results.push(np)
until !np
np = click_to_next_page(np)
np && results.push(np)
end
results
end
# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages
I try to make a WebCrawler which find links from a homepage and visit the found links again and again..
Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it how to visit the new links in a loop and print there statistics too.
*
#visit = {}
#src = Net::HTTP.start(#url.host, #url.port) do |http|
http.get(#url.path)
#content = #src.body
*
def govisit
if #content =~ #commentTag
end
cnt = #content.scan(#aTag)
cnt.each do |link|
#visit[link] = []
end
puts "Links on this site: "
#visit.each do |links|
puts links
end
if #visit.size >= 500
exit 0
end
printStatistics
end
First of all you need a function that accepts a link and returns the body output. Then parse all the links out of the body and keep a list of links. Check that list if you didn't visit the link yet. Remove those visited links from the new links list and call the same function again and do it all over.
To stop the crawler at a certain point you need to build in a condition the while loop.
based on your code:
#visited_links = []
#new_links = []
def get_body(link)
#visited_links << link
#src = Net::HTTP.start(#url.host, #url.port) { |http| http.get(#url.path) }
#src.body
end
def get_links(body)
# parse the links from your body
# check if the content does not have the same link
end
start_link_body = get_body("http://www.test.com")
get_links(start_link_body)
while #visited_links < 500 do
body = get_body(#new_links.shift)
get_links(body)
end
I would like to scrape search results from http://maxdelivery.com, but unfortunately, they are using POST instead of GET for their search form. I found this description of how to use Nokogiri and RestClient to fake a post form submission, but it's not returning any results for me: http://ruby.bastardsbook.com/chapters/web-crawling/
I've worked with Nokogiri before, but not for the results of a POST form submission.
Here's my code right now, only slightly modified from the example at the link above:
class MaxDeliverySearch
REQUEST_URL = "http://www.maxdelivery.com/nkz/exec/Search/Display"
def initialize(search_term)
#term = search_term
end
def search
if page = RestClient.post(REQUEST_URL, {
'searchCategory'=>'*',
'searchString'=>#term,
'x'=>'0',
'y'=>'0'
})
puts "Success finding search term: #{#term}"
File.open("temp/Display-#{#term}.html", 'w'){|f| f.write page.body}
npage = Nokogiri::HTML(page)
rows = npage.css('table tr')
puts "#{rows.length} rows"
rows.each do |row|
puts row.css('td').map{|td| td.text}.join(', ')
end
end
end
end
Now (ignoring the formatting stuff), I would expect if page = RestClient.post(REQUEST_URL, {...} to return some search results if passed a 'good' search term, but each time I just get the search results page back with no actual results, as if I had pasted the URL into the browser.
Anyone have any idea what I'm missing? Or, just how to get back the results I'm looking for with another gem?
With the class above, I would like to be able to do:
s = MaxDeliverySearch.new("ham")
s.search #=> big block of search results objects to traverse
Mechanize is what you should use to automate a web search form. This should get you started using Mechanize.
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://maxdelivery.com')
form = page.form('SearchForm')
form.searchString = "ham"
page = agent.submit(form)
page.search("div.searchResultItem").each do |item|
puts item.search(".searchName i").text.strip
end
I am trying to parse the URL shown in the doc variable below. My issue is with the job variable. When I return it, it returns every job title on the page instead of that specific job title for the given review. Does anyone have advice how to return the specific job title I'm referring to?
require 'nokogiri'
require 'open-uri'
# Perform a google search
doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
reviews = []
current_review = Hash.new
doc.css('.employerReview').each do |item|
pro = item.parent.css('p:nth-child(1) .notranslate').text
con = item.parent.css('p:nth-child(2) .notranslate').text
job = item.parent.css('.review-microdata-heading .i-occ').text
puts job
advice = item.parent.css('p:nth-child(3) .notranslate').text
current_review = {'pro' => pro, 'con' => con, 'advice' => advice}
reviews << current_review
end
Looks like item.parent is #MainCol in each case, in other words the entire column.
Changing item.parent.css to item.css should solve your problem.
This is part of a ruby script. I want to save the results to a text file. I only want the results specified in these two DIVS.
url = browser.html
doc = Nokogiri::HTML(open(url))
price = doc.css("#sectionPrice").text
ship = doc.css("#shippingCharges td").text
How do I save the scraped results? Mind you that the script loading the page is working correclty. In SHELL I can see the values of my scrape using XPATH as follows.
page_html = Nokogiri::HTML.parse(browser.html)
shipping = puts page_html.xpath(".//*[#id='shippingCharges']").inner_text
price = puts page_html.xpath(".//*[#id='sectionPrice']").inner_text
How do I save this data to a CSV or XML?
//Side Question: Is this data returned in SHELL saved anywhere? How do I access it outside of SHELL
url = browser.html
doc = Nokogiri::HTML(open(url))
price = doc.css("#sectionPrice").text
ship = doc.css("#shippingCharges td").text
CSV.open("/users/fabio/desktop/ruby/gp.csv", "wb") do |csv|
csv << [price, ship]
end
Not creating the CSVfile. Nothing appearing in the DIR What gives?
It is pretty simple to write this to a csv file.
Just add the following in:
require 'csv'
CSV.open("file.csv", "wb") do |csv|
csv << [price, ship]
end
If shipping and price are arrays then you will want to iterate through them but this is how you create a csv.
Hope this gets you on your way.
Cheers!