Why only the first link is fetched? - ruby

I'm trying to fetch news from Hacker News and write a link's title and URL to an HTML file. However, only the first link is getting written and others are not. What am I doing wrong?
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
response["items"].each do |item|
return '' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
File.open("/tmp/news.html", "w") do |f|
f.puts links
end

You shouldn't use return keyword in this case. It ends the method prematurely and returns only the first link. Use this instead:
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
# convert response['items'] array to array of strings
response["items"].map do |item|
'' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
links.length # => 30

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

Finding all links from ten URLs while reading a file

How can I extract all href options in an <a> tag from a page while reading in a file?
If I have a text file that contains the target URLs:
http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html
Here's the code I have:
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
# set the page_url to the current line
page = Nokogiri::HTML(open(line))
links = page.css("a")
puts links[0]["href"]
end
end
I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).
Something like this untested code:
links = []
hrefs = []
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
links << line
end
end
page = Nokogiri::HTML(html)
page.css("a").each do |tag|
hrefs << tag['href']
end
links.each do |link|
if hrefs.include?(link)
puts "its here"
end
end
If all I wanted to do was output the 'href' for each <a>, I'd write something like:
File.foreach('myfile.txt') do |url|
page = Nokogiri::HTML(open(url))
puts page.search('a').map{ |link| link['href'] }
end
Of course <a> tags don't have to have a 'href' but puts won't care.

how to post (http-post) content of pdf using ruby?

I am trying to post (raw) content of a PDF in ruby using the following block
require 'pdf/reader'
require 'curb'
reader = PDF::Reader.new('folder/file.pdf')
raw_string = ''
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
c = Curl::Easy.new('http://0.0.0.0:4567/pdf_upload')
c.http_post(Curl::PostField.content('param1', 'value1'),Curl::PostField.content('param2', 'value2'), c.http_post(Curl::PostField.content('body', raw_string)))
Inside the API implementation params[:body] seems to be empty all the time (though puts raw_string confirms that the variable has all the values.
Also, is there a better way to post pdf content?
Regarding how you're building raw_string...
Instead of:
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
You should be able to do something like one of these:
raw_string = reader.pages.map(&:raw_content).join
raw_string = reader.pages.map{ |p| p.raw_content.to_s }.join
I'd also recommend you write your last line spread across several lines, for clarity and readability:
c.http_post(
Curl::PostField.content('param1', 'value1'),
Curl::PostField.content('param2', 'value2'),
c.http_post(Curl::PostField.content('body', raw_string))
)

Nokogiri and XPath: saving text result of scrape

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.
Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Nest output of recursive function

I have written this piece of code, which outputs a list of jobdescriptions (in Danish). It works fine, however I would like to alter the output a bit. The function is recursive because the jobs are nested, however the output does not show the nesting.
How do I configure the function to show an output like this:
Job 1
- Job 1.1
- Job 1.2
-- Job 1.2.1
And so on...
require 'nokogiri'
require 'open-uri'
def crawl(url)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'])
end
end
end
crawl('/Job.aspx')
require 'nokogiri'
require 'open-uri'
def crawl(url, nesting_level = 0)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts " " * nesting_level + txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'], nesting_level + 1)
end
end
end
crawl('/Job.aspx')
I see two options:
Pass an additional argument to the recursive function to indicate the level you are currently in. Initialize the value to 0 and each time you call the function increment this value. Something like this:
def crawl(url, level)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'], level + 1)
end
end
end
Make use of the caller array that holds the callstack. Use the size of this array to indicate the level in the recursion you are located in.
def try
puts caller.inspect
end
try
I would personally stick to the fist version as it seems easier to read, but requires you to modify the interface.

Resources