Removing <head> issues, I need guide/assistance - ruby

So guys, i'm make a web parser it was good but i saw that some words inside <head> is screwing everything up (and <strong> in the body too). My code is This one here before nokogiri but I'm new at ruby programming and just started to know about Nokogiri few hours ago.
I wish someone could help me make this work. I need to .read the URL, remove <head> and everything inside it and scan words over the rest of the page
PS: Is it possible bring JUST the body and read it? It would be easier
PSS: About <strong> tags, is it hard to remove it?
My exercise is count how many especific word are in the page, not source code, thats why i need to only grab the body and eliminate a tag
Really hope someone can help me >.<
Thnks guys!
Here is my actual failure code / The pure original is here
require 'open-uri'
require 'cgi'
require 'nokogiri'
class Counter
def initialize(url)
#url = url
end
def decapitate
Nokogiri::HTML(url)
url.css('head').remove.to_s
end
def scan(word)
url.scan(word)
end
end
url, word = ARGV
puts "Found #{Counter.new(url).open.decapitate.scan(word).length} maches."

Many mistakes there.
url in decapitate is an undefined local variable. You need to use #url.
Nokogiri::HTML expects either an IO object, or a string, not an URL. You probably wanted to use open(#url) to read the URL contents (I assume, given that you required open-uri
Nokogiri::HTML returns a document, but you don't store this return value anywhere
Consequently, url (or rather #url) would be a string, and strings don't have a css method; you want to apply css to the document instead
remove will return the node that is removed; as the last thing in the method, that will be what is returned. Thus decapitate would return the text of the head node.
At the end, ...decapitate.scan will invoke the String#scan method, not the method you defined.
You can do what you want as follows:
def count(pattern, url)
doc = Nokogiri::HTML(open(url))
doc.css('head').remove
doc.text.scan(pattern).size
end

Related

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

Ruby RSS::Parser.to_s silently fails?

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?
From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.
I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

nokogiri doc.xpath('head') returns nil

I am trying to get all the scripts declared in the head section of a given html, but no matter how I try, it always returns nil.
doc = Nokogiri::HTML(open('http://www.walmart.com.br/'))
puts doc.at('body') # returns nill
doc.xpath('//html/head').each # this also will never iterate
Any suggestions?
The page's DOCTYPE isn't valid, so Nokogiri parses the page improperly. A quick, inefficient fix to the problem:
require 'nokogiri'
require 'open-uri'
require 'pp'
# Request the HTML before parsing
html = open("http://www.walmart.com.br/").read
# Replace original DOCTYPE with a valid DOCTYPE
html = html.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
# Parse
doc = Nokogiri::HTML(html)
# Party.
pp doc.xpath("/html/head")
Ok, when I tried it in script/console, I could indeed get something useful for:
doc.at('body')
so I'm not sure what's going wrong there for you.
For the html head, I can't get the head element either. html works fine, but head either way doesn't.
I think there's something screwy with that walmart page. I tried doing the same thing for
Nokogiri::HTML(open('http://google.com/'))
and it worked just fine.
So unless you can figure out what they're doing to stop you from accessing parts of the page... then I don't know.
If you can deal with all scripts from the doc, I found that this one works just fine:
doc.xpath('//script')

Need to build a url and work with the returned result

I would like to start with a little script that fetches the examination results of me and my friends from our university website.
I would like to pass it the roll number as the post parameter and work with the returned data,
I don't know how to create the post string.
It would be great if someone could tell me where to start, what are the things to learn, links to a tutorial would be most appreciated.
I donĀ“t want someone to write code for me, just guidance on how to get started.
I've written a solution here just as a reference for whatever you might come up with. There are multiple ways of attacking this.
#fetch_scores.rb
require 'open-uri'
#define a constant named URL so if the results URL changes we don't
#need to replace a hardcoded URL everywhere.
URL = "http://www.nitt.edu/prm/ShowResult.html?&param="
#checking the count of arguments passed to the script.
#it is only taking one, so let's show the user how to use
#the script
if ARGV.length != 1
puts "Usage: fetch_scores.rb student_name"
else
name = ARGV[0] #could drop the ARGV length check and add a default using ||
# or name = ARGV[0] || nikhil
results = open(URL + name).read
end
You might examine Nokogiri or Hpricot to better parse/format your results. Ruby is an "implicit return" language so if you happened to wonder why we didn't have a return statement that's because results will be returned by the script since it was last executed.
You could have a look at the net/http library, included as part of the standard library. See http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html for details, there are some examples on that page to get you started.
A very simple way to do this is to use the open-uri library and just put the query parameters in the URL query string:
require 'open-uri'
name = 'nikhil'
results = open("http://www.nitt.edu/prm/ShowResult.html?&param=#{name}").read
results now contains the body text fetched from the URL.
If you are looking for something more ambitious, look at net/http and the httparty gem.

extracting content of content attribute in meta tag of a website given a specified value for the name attribute with nokogiri in ruby?

My first question here, would be awesome to find an answer. I am new to using nokogiri.
Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post):
<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>
I would now like to have a script to run through the meta tags, locate the one with the name attribute "description" and get what is in the content attribute.
I have tried something like this
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
a = link.attributes['name']
b = link.attributes['content']
end
after which I could select the link where the attribute name is equal to description - but this code returns nil for a and b.
I played around with
posts = doc.xpath("//meta"), posts = doc.xpath("//meta/*"), etc. but still nil.
The problem is not with the xpath, as it seems the document does not parse. You can check that with puts doc, it does not contain the full input. It seems to be a problem with parsing comments (I suspect either invalid HTML or a bug in libxml2).
In your case I would use a regular expression as workaround. Given that <meta> tags are simple enough that might work, eg /<meta name="([^"]*)" content="([^"]*)"/
you should change
doc = Nokogiri::HTML(open(url))
to
doc = Nokogiri::HTML(open(url).read)
update: or maybe not :) actually your code works for me, using ruby 1.8.7 / nokogiri 1.4.0

Resources