Foreach loop in XML generator not breaking - ruby

I am trying to generate XML, but the loop isn't breaking. Here is a part of the code:
#key = 0
#cont.each do |pr|
xml.product {
#key += 1
puts #key.to_s
begin
#main = Nokogiri::HTML(open(#url+pr['href'], "User-Agent" => "Ruby/#{RUBY_VERSION}","From" => "foo#bar.invalid", "Referer" => "http://www.ruby-lang.org/"))
rescue
puts "rescue"
next
end
puts pr['href']
puts #key.to_s
break //this break doesn't work
#something else
}
end
Most interesting is that in the final generated XML file, break worked. The file contains only one product, but on the console #key was printed fully, which means the foreach loop doesn't break.
Could it be a Nokogiri XML-specific error, because of open brackets in the head of the loop?

In general I think how you're going about trying to generate the XML is confused. Don't convolute your code any more than necessary; Instead of starting to generate some XML then aborting it inside the block because you can't find the page you want, grab the pages you want first, then start processing.
I'd move the begin/rescue block outside the XML generation. Its existence inside the XML generation block results in poor logic and questionable practices of using next and break. Instead I'd recommend something like this untested code:
#main = []
#cont.each do |pr|
begin
#main << Nokogiri::HTML(
open(#url + pr['href'])
)
rescue
puts 'rescue'
next
end
end
builder = Nokogiri::XML::Builder.new do |xml|
xml.root {
xml.products {
#main.each do |m|
xml.product {
xml.id_ m.at('id').text
xml.name m.at('name').text
}
end
}
}
end
puts builder.to_xml
Which makes it easy to see that the code is keying off being able to retrieve a page.
This code is untested because we have no idea what your input values are or what your output should look like. Having valid input, expected output and a working example of your code that demonstrates the problem is essential if you want help debugging a problem with your code.
The use of #url + pr['href'] isn't generally a good idea. Instead use the URI class to build up the URL for you. URI handles encoding and ensures the URI is valid.

Related

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

Can you pass a block of code that returns an error to a method?

I often find myself dealing with these kind of scenarios:
require 'nokogiri'
require "open-uri"
url = "https://www.random_website.com/contains_info_I_want_to_parse"
nokodoc = Nokogiri::HTML(open(url))
# Let's say one of the following line breaks the ruby script
# because the element I'm searching doesn't contain an attribute.
a = nokodoc.search('#element-1').attribute('href').text
b = nokodoc.search('#element-2').attribute('href').text.gsub("a", "A")
c = nokodoc.search('#element-3 h1').attribute('style').text.strip
What happens is that I'll be creating about 30 variables all searching for different elements in a page, and I'll be looping that code over multiple pages. However, a few of these pages may have an ever-so-slightly different layout and won't have one of those div. This will break my code (because you can't call .attribute or .gsub on nil for example). But I can never guess which line before-hand.
My go-to solution is usually surround each line with:
begin
line #n
rescue
puts "line #n caused an error"
end
I'd like to be able to do something like:
url = "https://www.random_website.com/contains_info_I_want_to_parse"
nokodoc = Nokogiri::HTML(open(url))
catch_error(a, nokodoc.search('#element-1').attribute('href').text)
catch_error(b, nokodoc.search('#element-2').attribute('href').text.gsub("a", "A"))
catch_error(c, nokodoc.search('#element-3 h1').attribute('style').text.strip)
def catch_error(variable_name, code)
begin
variable_name = code
rescue
puts "Code in #{variable_name} caused an error"
end
variable_name
end
I know that putting & before each new method works:
nokodoc.search('#element-1')&.attribute('href')&.text
But I want to be able to display the error with a 'puts' in my terminal to see when my code gives an error.
Is it possible?
You can't pass your code as a regular argument to a method because it'll be evaluated (and raise an exception) before it gets passed to your catch_error method. You could pass it as a block--something like
a = catch_error('element_1 href text') do
nokodoc.search('#element-1').attribute('href').text
end
def catch_error(error_description)
yield
rescue
puts "#{error_description} caused an error"
end
Note that you can't pass a to the method as variable_name: it hasn't been defined anywhere before calling that method, so you'll get an undefined local variable or method error. Even if you define a earlier, it won't work correctly. If your code works without raising an exception, the method will return the right value but the value won't get stored anywhere outside the method scope. If there is an exception, variable_name will have whatever value a had before the method (nil if you defined it without setting it), so your error message would output something like Code in caused an error. That's why I added an error_description parameter.
You could also try logging the message and backtrace if you didn't want to have to specify an error description every time.
a = catch_error(nokodoc) do |doc|
doc.search('#element-1').attribute('href').text
end
def catch_error(doc)
yield doc
rescue => ex
puts doc.title # Or something else that identifies the document
puts ex.message
puts ex.backtrace.join("\n")
end
I made one additional change here: passing the document in as a parameter so that rescue could easily log something that identifies the document, in case that's important.

Using Ruby to parse and write Puppet node definitions

I am writing a helper API in Ruby to automatically create and manipulate node definitions. My code is working; it can read and write the node defs successfully, however, it is a bit clunky.
Ruby is not my main language, so I'm sure there is a cleaner, and more rubyesque solution. I would appreciate some advice or suggestions.
Each host has its own file in manifests/nodes containing just the node definition. e.g.
node 'testnode' {
class {'firstclass': }
class {'secondclass': enabled => false }
}
The classes all are either enabled (default) or disabled elements. In the Ruby code, I store these as an instance variable hash #elements.
The read method looks like this:
def read()
data = File.readlines(#filepath)
for line in data do
if line.include? 'class'
element = line[/.*\{'([^\']*)':/, 1]
if #elements.include? element.to_sym
if not line.include? 'enabled => false'
#elements[element.to_sym] = true
else
#elements[element.to_sym] = false
end
end
end
end
end
And the write method looks like this:
def write()
data = "node #{#hostname} {\n"
for element in #elements do
if element[1]
line = " class {'#{element[0]}': }\n"
else
line = " class {'#{element[0]}': enabled => false}\n"
end
data += line
end
data += "}\n"
file = File.open(#filepath, 'w')
file.write(data)
file.close()
end
One thing to add is that these systems will be isolated from the internet. So I'd prefer to avoid large number of dependency libraries as I'll need to install / maintain them manually.
If your goal is to define your node's programmatically, there is a much more straightforward way then reading and writing manifests. One of the built-in features of puppet is "External Node Classifiers"(ENC). The basic idea is that something external to puppet will define what a node should look like.
In the simplest form, the ENC can be a ruby/python/whatever script that writes out yaml with the list of classes and enabled parameters. Reading and writing yaml from ruby is as simple as it gets.
Ruby has some pretty good methods to iterate over data structures. See below for an example of how to rubify your code a little bit. I am by no means an expert on the subject, and have not tested the code. :)
def read
data = File.readlines(#filepath)
data.each_line do |line|
element = line[/.*\{'([^\']*)':/, 1].to_sym
if #elements.include?(element)
#elements[element] = line.include?('enabled => false') ? false : true
end
end
end
def write
File.open(#filepath, 'w') do |file|
file.puts "node #{#hostname} {"
#elements.each do |element|
if element[1]
file.puts " class {'#{element[0]}': }"
else
file.puts " class {'#{element[0]}': enabled => false }"
end
end
file.puts '}'
end
end
Hope this points you in the right direction.

How do I post/upload multiple files at once using HttpClient?

def test_post_with_file filename = 'test01.xml'
File.open(filename) do |file|
response = #http_client.post(url, {'documents'=>file})
end
end
How do I modify the above method to handle a multi-file-array post/upload?
file_array = ['test01.xml', 'test02.xml']
You mean like this?
def test_post_with_file(file_array=[])
file_array.each do |filename|
File.open(filename) do |file|
response = #http_client.post(url, {'documents'=>file})
end
end
end
I was having the same problem and finally figured out how to do it:
def test_post_with_file(file_array)
form = file_array.map { |n| ['documents[]', File.open(n)] }
response = #http_client.post(#url, form)
end
You can see in the docs how to pass multiple values: http://rubydoc.info/gems/httpclient/HTTPClient#post_content-instance_method .
In the "body" row, I tried without success to use the 4th example. Somehow HttpClient just decides to apply .to_s to each hash in the array.
Then I tried the 2nd solution and it wouldn't work either because only the last value is kept by the server. But I discovered after some tinkering that the second solution works if the parameter name includes the square brackets to indicate there are mutiple values as an array.
Maybe this is a bug in Sinatra (that's what I'm using), maybe the handling of such data is implementation-dependent, maybe the HttpClient doc is outdated/wrong. Or a combination of these.

Ruby TMail size-limit on body?

I have a small application that process emails as downloaded from a imap-server with fetchmail. The processing consists of finding base64-encoded attachments with a XML-file inside.
Here is the code (somewhat stripped):
def extract_data_from_mailfile(mailfile)
begin
mail = TMail::Mail.load(mailfile)
rescue
return nil
end
bodies_found = []
if mail.multipart? then
mail.parts.each do |m|
bodies_found << m.body
end
end
## Let's parse the parts we found in the mail to see if one of them
## looks XML-ish. Hacky but works for now.
## was XML.
bodies_found.each do |body|
if body =~ /^<\?XML /i then
return body
end
end
return nil # Nothing found.
end
This works great, but on large XML-files (typically >600k mailfiles), this breaks.
>> mail.parts[1].body.size
=> 487424 <-- should have been larger - doesn't include the end of the file
Base64-decoding doesn't happen automatically either. But this is when I try to run decode manually:
>> Base64::decode64(mail.parts[1].body)
[...] ll="SMTP"></Sendt><Sendt"
That's part of the XML-file, but it has been clipped.
Any way to get the entire attachment? any tips?
I see your code breaks out the loop at the first found XML fragment. Perhaps the larger messages divide their XML into smaller chunks inside the same multi-part MIME message? You would then return an array of bodies and concat them
mail.parts[1].body[0] + mail.parts[1].body[1]
(PS. It's a long shot, I haven't tried this)

Resources