Ruby: Script won't grab URL from youtube correctly - ruby

I found this script on pastebin that is an IRC bot that will find youtube videos for you. I have not touched it at all (Bar the channel settings), it works well however it won't grab the URL to the video that has been searched. This code is not mine! I jsut would like to get it to work as it would be quite useful!
#!/usr/bin/env ruby
require 'rubygems'
require 'cinch'
require 'nokogiri'
require 'open-uri'
require 'cgi'
bot = Cinch::Bot.new do
configure do |c|
c.server = "irc.freenode.net"
c.nick = "YouTubeBot"
c.channels = ["#test"]
end
helpers do
#Grabs the first result and returns the TITLE,LINK,DESCRIPTION
def youtube(query)
doc = Nokogiri::HTML(open("http://www.youtube.com/results?q=#{CGI.escape(query)}"))
result = doc.css('div#search-results div.result-item-main-content')[0]
title = result.at('h3').text
link = "www.youtube.com"+"#{result.at('a')[:href]}"
desc = result.at('p.description').text
rescue
"No results found"
else
CGI.unescape_html "#{title} - #{desc} - #{link}"
end
end
on :channel, /^!youtube (.+)/ do |m, query|
m.reply youtube(query)
end
on :channel, "polkabot quit" do |m|
m.channel.part("bye")
end
end
bot.start
Currently if i use the command
!youtube asdf
I get this returned:
19:25 < YouTubeBot> asdfmovie - Worldwide Store www.cafepress.com ...
asdfmovie cakebomb tomska
epikkufeiru asdf movie ... tomska ... [Baby Giggling] Man: Got your nose! [Baby ...
- www.youtube.com#
As you can see the URL is just www.youtube.com# not the URL of the video.
Thanks a lot!

This is an xpath issue. It looks like the third 'a' that has the href you want so try:
link = "www.youtube.com#{result.css('a')[2][:href]}"

Related

Ruby crawl site, add URL parameter

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

Why is this ruby code returning a blank page instead of filling it up with user names?

I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.

Ruby EOFError with open-uri and loop

I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Resources