Reading Several URIs in ruby - ruby

I need to read the contents of web page for several times and extract some information out of it for which I use regular expressions. I am using open-uri to read contents of the page and the sample code I written is as follows:
require 'open-uri'
def getResults(words)
results = []
words.each do |word|
results.push getAResult(word)
end
results
end
def getAResult(word)
file = open("http://www.somapage.com?option=#{word}")
contents = file.read
file.close
contents.match /some-regex-here/
$1.empty? ? -1 : $1.to_f
end
The problem is unless I comment out file.close line getAResult returns always -1. When I try this code on console, getAResult immediately returns -1, but ruby process runs for another two to three seconds or so.
If I remove file.close line getAResult returns the correct result, but now getResults is a bunch of -1s except for the first one. I tried to use curb gem for reading the page, but similar problem appears.
This seems like an issue related with threading. However, I couldn't come up with something reasonable to search and find a corresponding solution. What do you think problem would be?
NOTE: This web page I try to read does not return results so fast. It takes some time.

try hpricot or nokogiri
it can search documents via XPath in your html file

You should grab the match result, like the following:
1.9.3-327 (main):0 > contents.match /div/
=> #<MatchData "div">
1.9.3-327 (main):0 > $1
=> nil
1.9.3-327 (main):0 > contents.match /(div)/
=> #<MatchData "div" 1:"div">
1.9.3-327 (main):0 > $1
=> "div"

If you are worried about thread safety, then you shouldn't use the $n regexp variables. Capture your results directly, like this:
value = contents[/regexp/]
Specifically, here's a more ruby-like formatting of that method:
def getAResult(word)
contents = open("http://www.somapage.com?option=#{word}"){|f| f.read }
value = contents[/some-regex-here/]
value.empty? ? -1 : value.to_f
end
The block form of #open (as above) automatically closes the file when you are done with it.

Related

How do I ignore the nil values in the loop with parsed values from Mechanize?

In my text file are a list of URLs. Using Mechanize I'm using that list to parse out the title and meta description. However, some of those URL pages don't have a meta description which stops my script with a nil error:
undefined method `[]' for nil:NilClass (NoMethodError)
I've read up and seen solutions if I were using Rails, but for Ruby I've only seen reject and compact as possible solutions to ignore nil values. I added compact at the end of the loop, but that doesn't seem to do anything.
require 'rubygems'
require 'mechanize'
File.readlines('parsethis.txt').each do |line|
page = Mechanize.new.get(line)
title = page.title
metadesc = page.at("head meta[name='description']")[:content]
puts "%s, %s, %s" % [line.chomp, title, metadesc]
end.compact!
It's just a list of urls in a text like this:
http://www.a.com
http://www.b.com
This is what will output in the console for example:
http://www.a.com, Title, This is a description.
If within the list of URLs there is no description or title on that particular page, it throws up the nil error. I don't want it to skip any urls, I want it to go through the whole list.
Here is one way to do it:
Edit( for added requirement to not skip any url's):
metadesc = page.at("head meta[name='description']")
puts "%s, %s, %s" % [line.chomp, title, metadesc ? metadesc[:content] : "N/A"]
This is untested but I'd do something like this:
require 'open-uri'
require 'nokogiri'
page_info = {}
File.foreach('parsethis.txt') { |url|
page = Nokogiri::HTML(open(url))
title = page.title
meta_desc = page.at("head meta[name='description']")
meta_desc_content = meta_desc ? meta_desc[:content] : nil
page_info[url] = {:title => title, :meta_desc => meta_desc_content}
}
page_info.each do |url, info|
puts [
url,
info[:title],
info[:meta_desc]
].join(', ')
end
File.foreach iteratively reads a file, returning each line individually.
page.title could return a nil if a page doesn't have a title; titles are optional in pages.
I break down accessing the meta-description into two steps. Meta tags are optional in HTML so they might not exist, at which point a nil would be returned. Trying to access a content= parameter would result in an exception. I think that's what you're seeing.
Instead, in my code, meta_desc_content is conditionally assigned a value if the meta-description tag was found, or nil.
The code populates the page_info hash with key/value pairs of the URL and its associated title and meta-description. I did it this way because a hash-of-hashes, or possibly an array-of-hashes, is a very convenient structure for all sorts of secondary manipulations, such as returning the information as JSON or inserting into a database.
As a second step the code iterates over that hash, retrieving each key/value pair. It then joins the values into a string and prints them.
There are lots of things in your code that are either wrong, or not how I'd do them:
File.readlines('parsethis.txt').each returns an array which you then have to iterate over. That isn't scalable, nor is it efficient. File.foreach is faster than File.readlines(...).each so get in the habit of using it unless you are absolutely sure you know why you should use readlines.
You use Mechanize for something that Nokogiri and OpenURI can do faster. Mechanize is a great tool if you are working with forms and need to navigate a site, but you're not doing that, so instead you're dragging around additional code-weight that isn't necessary. Don't do that; It leads to slow programs among other things.
page.at("head meta[name='description']")[:content] is an exception in waiting. As I said above, meta-descriptions are not necessarily going to exist in a page. If it doesn't then you're trying to do nil[:content] which will definitely raise an exception. Instead, work your way down to the data you want so you can make sure that the meta-description exists before you try to get at its content.
You can't use compact or compact! the way you were. An each block doesn't return an array, which is the class you need for compact or compact!. You could have used map but the logic would have been messy and puts inside map is rarely used. (Probably shouldn't be used is more likely but that's a different subject.)

How do I avoid EOFError with Ruby script?

I have a Ruby script (1.9.2p290) where I am trying to call a number of URLs, and then append information from those URLs into a file. The issue is that I keep getting an end of file error - EOFError. An example of what I'm trying to do is:
require "open-uri"
proxy_uri = URI.parse("http://IP:PORT")
somefile = File.open("outputlist.txt", 'a')
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
somefile.puts output
somefile.close
I don't know why I keep getting this end of file error, or how I can avoid getting the error. I think it might have something to do with the URL that I'm calling (based on some dialogue here: What is an EOFError in Ruby file I/O?), but I'm not sure why that would affect the I/O or cause an end of file error.
Any thoughts on what I might be doing wrong here or how I can get this to work?
Thanks in advance!
The way you are writing your file isn't idiomatic Ruby. This should work better:
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
File.open("outputlist.txt", 'a') do |fo|
fo.puts output
end
I suspect that the file is being closed because it's been opened, then not written-to while 100 pages are processed. If that takes a while I can see why they'd close it to avoid apps using up all the file handles. Writing it the Ruby-way automatically closes the file immediately after the write, avoiding holding handles open artificially.
As a secondary thing, rather than use a simple pattern match to try to locate image tags, use a real HTML parser. There will be little difference in processing speed, but potentially more accuracy.
Replace:
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
with:
require 'nokogiri'
doc = Nokogiri::HTML(open('SOMEURL' + num, :proxy => proxy_uri))
output << doc.search('img').size

ruby and curl: skipping invalid pages

I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian
but if you try the same where a page has no title for example
curl = %x(curl http://zales.1.ai)
it dies with undefined method for nill class as it has no title ....
I can't check if curl is nil as it is not in this case (it contains another line)
Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.
UPDATE
I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work
1) open-uri simply fails in a simple domain with _
like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri
2)The second case if a page has no title like
http://zales.1.ai
3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/
A fourth case would be a page that has nothing but an internal server error or passenger/rack error.
The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)
curl = %x(curl http://voldemortas.1.ai/)
begin
simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
simian = "" # curl was nil?
rescue ArguementError
simian = "" # not html?
end
puts simian
Now I am aware that this is not elegant nor optimal.
REPHRASED QUESTION
Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).
You're making it much to hard on yourself.
Nokogiri doesn't care where you get the HTML, it just wants the body of the document. You can use Curb, Open-URI, a raw Net::HTTP connection, and it'll parse the content returned.
Try Curb:
require 'curb'
require 'nokogiri'
doc = Nokogiri::HTML(Curl.get('http://http://odin.1.ai').body_str)
doc.at('title').text
=> "Welcome to Dotgeek.org * 1.ai"
If you don't know whether you'll have a <title> tag, then don't try to do it all at once:
title = doc.at('title')
next if (!title)
puts title.text
Take a look at "equivalent of curl for Ruby?" for more ideas.
You just need to check for the match before accessing it. If curl.match is nil, the you can't access the grouping:
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)
simian &&= simian[1] # only access the matched group if available
puts simian
Do heed the Tin Man's advice and use Nokogiri. Your regexp is really only suitable for a brittle solution -- it fails when the title element is spread over multiple lines.
Update
If you really don't want to use an HTML parser and if you promise this is for a quick script, you can use OpenURI (wrapper around net/http) in the standard library. It's at least a little cleaner than parsing curl output.
require 'open-uri'
def extract_title_content(line)
title = line.match(%r{<title>(.*)</title>})
title &&= title[1]
end
def extract_title_from(uri)
title = nil
open(uri) do |page|
page.lines.each do |line|
return title if title = extract_title_content(line)
end
end
rescue OpenURI::HTTPError => e
STDERR.puts "ERROR: Could not download #{uri} (#{e})"
end
puts extract_title_from 'http://odin.1.ai'
What you're really looking for, it seems, is a way to skip non-HTML responses. That's much easier with a curl wrapper like curb, like the Tin Man suggested, than dropping to the shell and using curl there:
1.9.3p125 :001 > require 'curb'
=> true
1.9.3p125 :002 > response = Curl.get('http://odin.1.ai')
=> #<Curl::Easy http://odin.1.ai?>
1.9.3p125 :003 > response.content_type
=> "text/html"
1.9.3p125 :004 > response = Curl.get('http://voldemortas.1.ai')
=> #<Curl::Easy http://voldemortas.1.ai?>
1.9.3p125 :005 > response.content_type
=> "image/png"
1.9.3p125 :006 >
So your code could look something like this:
response = Curl.get(url)
if response.content_type == "text/html" # or more fuzzy: =~ /text/
match = response.body_str.match(/<title>(.*)<\/title>/)
title = match && match[1]
# or use Nokogiri for heavier lifting
end
No more exceptions
puts simian

how to find the file path from the open command

I need to get the path of the file in fo variable so that i can pass the path to the unzip_file function. how do i get the path here?
url = 'http://www.dtniq.com/product/mktsymbols_v2.zip'
open(url, 'r') do |fo|
puts "unzipfile "
unzip_file(fo, "c:\\temp11\\")
end
In terms of how to do it I would do this:
Find out the class of the object I am dealing with
ruby-1.9.2-p290 :001 > tmp_file = open('tmp.txt', 'r')
=> #<File:tmp.txt>
ruby-1.9.2-p290 :001 > tmp_file.class
=> File
Go look up the documentation for that class
Google Search : ruby file
Which returns Class: File ruby-doc.org => www.ruby-doc.org/core/classes/File.html
Look at the methods. There is one called path -> looks interesting
If I haven't found an answer by now then
Continue looking around google/stack overflow for a bit
I really can't find a solution that matches my problem. Time to ask a question on here
Most of the time 1..3 should get you what you need. Once you learn to read the documentation you can do things a lot quicker. It's just trying to overcome how difficult it is to get into the docs when you first start.
The fo in your block should be a Tempfile so you can use the path method:
url = 'http://www.dtniq.com/product/mktsymbols_v2.zip'
open(url, 'r') do |fo|
puts "unzipfile "
unzip_file(fo.path, "c:\\temp11\\")
end

How do I tell the line number for a node using the Nokogiri reader interface?

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.

Resources