Anemone Ruby spider - create key value array without domain name - ruby

I'm using Anemone to spider a domain and it works fine.
the code to initiate the crawl looks like this:
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
This very nicely prints out all the page urls for the domain like so:
http://www.example.com/
http://www.example.com/about
http://www.example.com/articles
http://www.example.com/articles/article_01
http://www.example.com/contact
What I would like to do is create an array of key value pairs using the last part of the url for the key, and the url 'minus the domain' for the value.
E.g.
[
['','/'],
['about','/about'],
['articles','/articles'],
['article_01','/articles/article_01']
]
Apologies if this is rudimentary stuff but I'm a Ruby novice.

I would define an array or hash first outside of the block of code and then add your key value pairs to it:
require 'anemone'
path_array = []
crawl_url = "http://www.example.com/"
Anemone.crawl(crawl_url) do |anemone|
anemone.on_every_page do |page|
path_array << page.url
puts page.url
end
end
From here you can then .map your array into a useable multi-dimensional array:
path_array.map{|x| [x[crawl_url.length..10000], x.gsub("http://www.example.com","")]}
=> [["", "/"], ["about", "/about"], ["articles", "/articles"], ["articles/article_01", "/articles/article_01"], ["contact", "/contact"]]
I'm not sure if it will work in every scenario, however I think this can give you a good start for how to collect the data and manipulate it. Also if you are wanting a key/value pair you should look into Ruby's class Hash for more information on how to use and create hash's in Ruby.

The simplest and possibly least robust way to do this would be to use
page.url.split('/').last
to obtain your 'key'. You would need to test various edge cases to ensure it worked reliably.
edit: this will return 'www.example.com' as the key for 'http://www.example.com/' which is not the result you require

Related

ruby sinatra how to redirect with regex

I am trying to move stuff at root to /en/ directory to make my little service multi-lingual.
So, I want to redirect this url
mysite.com/?year=2018
to
mysite.com/en/?year=2018
My code is like
get %r{^/(\?year\=\d{4})$} do |c|
redirect "/en/#{c}"
end
but it seems like I never get #{c} part from the url.
Why is that? or are there just better ways to do this?
Thanks!
You can use the request.path variable to get the information you're looking for.
For example,
get "/something" do
puts request.path # => "/something"
redirect "/en#{request.path}"
end
However if you are using query parameters (i.e. ?yeah=2000) you'll have to manually pass those off to the redirect route.
Kind of non-intuitively, there's a helper method for this in ActiveRecord.
require 'active_record'
get "/something" do
puts params.to_param
# if params[:year] is 2000, you'll get "year=2000"
redirect "/en#{request.path}?#{params.to_param}"
end
You could alternatively write your own helper method pretty easily:
def hash_to_param_string(hash)
hash.reduce("") do |string, (key, val)|
string << "#{key}=#{val}&"
end
end
puts hash_to_param_string({key1: "val1", key2: "val2"})
# => "key1=val1&key2=val2"

Redis-objects Ruby gem, how to retrieve Redis list and iterate?

I'm trying to use redis-objects Ruby gem to store some Redis data in lists.
I am able to create a list by following the example in the documentation.
I am able to find the list from Redis using lrange. Not sure if that is the best way, I couldn't find a method provided by redis-objects.
Initially when I iterate the elements in the list I get the elements in the form of Hashes.
However after I get the list using lrange those are not hashes and I cannot access the data.
What would be the appropriate way to find the list and get the items in hash form?
You can see the code below and the outputs from the console.
#list = Redis::List.new('list_name', :marshal => true)
#list << {:name => "Nate", :city => "San Diego"}
#list.each do |el|
puts el
puts el.class
puts "#{el[:name]} lives in #{el[:city]}"
end
redis = Redis.current
#list = redis.lrange("list_name", 0, -1)
#list.each do |el|
puts el
puts el.class
puts "#{el[:name]} lives in #{el[:city]}"
end
Each of the puts:
{:name=>"Nate", :city=>"San Diego"}
Hash
Nate lives in San Diego
{: nameI" Nate:ET: cityI"San Diego;T
String
Completed 500 Internal Server Error in 349ms
TypeError - no implicit conversion of Symbol into Integer:
Right. The text below, from the Gem documentation explains it!
There is a Ruby class that maps to each Redis type, with methods for
each Redis API command. Note that calling new does not imply it's
actually a "new" value - it just creates a mapping between that Ruby
object and the corresponding Redis data structure, which may already
exist on the redis-server.
So I don't need to use lrange to get to the list. Using Redis::List.new('list_name', :marshal => true) will get me a handle to the list. Then I can iterate, add or remove items from the list.
Reading is helpful...

How do I ignore the nil values in the loop with parsed values from Mechanize?

In my text file are a list of URLs. Using Mechanize I'm using that list to parse out the title and meta description. However, some of those URL pages don't have a meta description which stops my script with a nil error:
undefined method `[]' for nil:NilClass (NoMethodError)
I've read up and seen solutions if I were using Rails, but for Ruby I've only seen reject and compact as possible solutions to ignore nil values. I added compact at the end of the loop, but that doesn't seem to do anything.
require 'rubygems'
require 'mechanize'
File.readlines('parsethis.txt').each do |line|
page = Mechanize.new.get(line)
title = page.title
metadesc = page.at("head meta[name='description']")[:content]
puts "%s, %s, %s" % [line.chomp, title, metadesc]
end.compact!
It's just a list of urls in a text like this:
http://www.a.com
http://www.b.com
This is what will output in the console for example:
http://www.a.com, Title, This is a description.
If within the list of URLs there is no description or title on that particular page, it throws up the nil error. I don't want it to skip any urls, I want it to go through the whole list.
Here is one way to do it:
Edit( for added requirement to not skip any url's):
metadesc = page.at("head meta[name='description']")
puts "%s, %s, %s" % [line.chomp, title, metadesc ? metadesc[:content] : "N/A"]
This is untested but I'd do something like this:
require 'open-uri'
require 'nokogiri'
page_info = {}
File.foreach('parsethis.txt') { |url|
page = Nokogiri::HTML(open(url))
title = page.title
meta_desc = page.at("head meta[name='description']")
meta_desc_content = meta_desc ? meta_desc[:content] : nil
page_info[url] = {:title => title, :meta_desc => meta_desc_content}
}
page_info.each do |url, info|
puts [
url,
info[:title],
info[:meta_desc]
].join(', ')
end
File.foreach iteratively reads a file, returning each line individually.
page.title could return a nil if a page doesn't have a title; titles are optional in pages.
I break down accessing the meta-description into two steps. Meta tags are optional in HTML so they might not exist, at which point a nil would be returned. Trying to access a content= parameter would result in an exception. I think that's what you're seeing.
Instead, in my code, meta_desc_content is conditionally assigned a value if the meta-description tag was found, or nil.
The code populates the page_info hash with key/value pairs of the URL and its associated title and meta-description. I did it this way because a hash-of-hashes, or possibly an array-of-hashes, is a very convenient structure for all sorts of secondary manipulations, such as returning the information as JSON or inserting into a database.
As a second step the code iterates over that hash, retrieving each key/value pair. It then joins the values into a string and prints them.
There are lots of things in your code that are either wrong, or not how I'd do them:
File.readlines('parsethis.txt').each returns an array which you then have to iterate over. That isn't scalable, nor is it efficient. File.foreach is faster than File.readlines(...).each so get in the habit of using it unless you are absolutely sure you know why you should use readlines.
You use Mechanize for something that Nokogiri and OpenURI can do faster. Mechanize is a great tool if you are working with forms and need to navigate a site, but you're not doing that, so instead you're dragging around additional code-weight that isn't necessary. Don't do that; It leads to slow programs among other things.
page.at("head meta[name='description']")[:content] is an exception in waiting. As I said above, meta-descriptions are not necessarily going to exist in a page. If it doesn't then you're trying to do nil[:content] which will definitely raise an exception. Instead, work your way down to the data you want so you can make sure that the meta-description exists before you try to get at its content.
You can't use compact or compact! the way you were. An each block doesn't return an array, which is the class you need for compact or compact!. You could have used map but the logic would have been messy and puts inside map is rarely used. (Probably shouldn't be used is more likely but that's a different subject.)

How do I read the content of an Excel spreadsheet using Ruby?

I am trying to read an Excel spreadsheet file with Ruby, but it is not reading the content of the file.
This is my script
book = Spreadsheet.open 'myexcel.xls';
sheet1 = book.worksheet 0
sheet1.each do |row|
puts row.inspect ;
puts row.format 2;
puts row[1];
exit;
end
It is giving me the following:
[DEPRECATED] By requiring 'parseexcel', 'parseexcel/parseexcel' and/or
'parseexcel/parser' you are loading a Compatibility layer which
provides a drop-in replacement for the ParseExcel library. This
code makes the reading of Spreadsheet documents less efficient and
will be removed in Spreadsheet version 1.0.0
#<Spreadsheet::Excel::Row:0xffffffdbc3e0d2 #worksheet=#<Spreadsheet::Excel::Worksheet:0xb79b8fe0> #outline_level=0 #idx=0 #hidden=false #height= #default_format= #formats= []>
#<Spreadsheet::Format:0xb79bc8ac>
nil
I need to get the actual content of file. What am I doing wrong?
It looks like row, whose class is Spreadsheet::Excel::Row is effectively an Excel Range and that it either includes Enumerable or at least exposes some enumerable behaviours, #each, for example.
So you might rewrite your script something like this:
require 'spreadsheet'
book = Spreadsheet.open('myexcel.xls')
sheet1 = book.worksheet('Sheet1') # can use an index or worksheet name
sheet1.each do |row|
break if row[0].nil? # if first cell empty
puts row.join(',') # looks like it calls "to_s" on each cell's Value
end
Note that I've parenthesised arguments, which is generally advisable these days, and removed the semi-colons, which are not necessary unless you're writing multiple statement on a line (which you should rarely - if ever - do).
It's probably a hangover from a larger script, but I'll point out that in the code given the book and sheet1 variables aren't really needed, and that Spreadsheet#open takes a block, so a more idiomatic Ruby version might be something like this:
require 'spreadsheet'
Spreadsheet.open('MyTestSheet.xls') do |book|
book.worksheet('Sheet1').each do |row|
break if row[0].nil?
puts row.join(',')
end
end
I don't think you need to require parseexcel, just require 'spreadsheet'
Have you read the guide, it is super easy to follow.
Is it a one line file? If so you need:
puts row[0];

How to parse SOAP response from ruby client?

I am learning Ruby and I have written the following code to find out how to consume SOAP services:
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
weather=service.getTodaysBirthdays('1/26/2010')
The response that I get back is:
#<SOAP::Mapping::Object:0x80ac3714
{http://www.abundanttech.com/webservices/deadoralive} getTodaysBirthdaysResult=#<SOAP::Mapping::Object:0x80ac34a8
{http://www.w3.org/2001/XMLSchema}schema=#<SOAP::Mapping::Object:0x80ac3214
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2f6c
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac2cc4
{http://www.w3.org/2001/XMLSchema}choice=#<SOAP::Mapping::Object:0x80ac2a1c
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2774
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac24cc
{http://www.w3.org/2001/XMLSchema}sequence=#<SOAP::Mapping::Object:0x80ac2224
{http://www.w3.org/2001/XMLSchema}element=[#<SOAP::Mapping::Object:0x80ac1f7c>,
#<SOAP::Mapping::Object:0x80ac13ec>,
#<SOAP::Mapping::Object:0x80ac0a28>,
#<SOAP::Mapping::Object:0x80ac0078>,
#<SOAP::Mapping::Object:0x80abf6c8>,
#<SOAP::Mapping::Object:0x80abed18>]
>>>>>>> {urn:schemas-microsoft-com:xml-diffgram-v1}diffgram=#<SOAP::Mapping::Object:0x80abe6c4
{}NewDataSet=#<SOAP::Mapping::Object:0x80ac1220
{}Table=[#<SOAP::Mapping::Object:0x80ac75e4
{}FullName="Cully, Zara"
{}BirthDate="01/26/1892"
{}DeathDate="02/28/1979"
{}Age="(87)"
{}KnownFor="The Jeffersons"
{}DeadOrAlive="Dead">,
#<SOAP::Mapping::Object:0x80b778f4
{}FullName="Feiffer, Jules"
{}BirthDate="01/26/1929"
{}DeathDate=#<SOAP::Mapping::Object:0x80c7eaf4>
{}Age="81"
{}KnownFor="Cartoonists"
{}DeadOrAlive="Alive">]>>>>
I am having a great deal of difficulty figuring out how to parse and show the returned information in a nice table, or even just how to loop through the records and have access to each element (ie. FullName,Age,etc). I went through the whole "getTodaysBirthdaysResult.methods - Object.new.methods" and kept working down to try and work out how to access the elements, but then I get to the array and I got lost.
Any help that can be offered would be appreciated.
If you're going to parse the XML anyway, you might as well skip SOAP4r and go with Handsoap. Disclaimer: I'm one of the authors of Handsoap.
An example implementation:
# wsdl: http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl
DEADORALIVE_SERVICE_ENDPOINT = {
:uri => 'http://www.abundanttech.com/WebServices/DeadOrAlive/DeadOrAlive.asmx',
:version => 1
}
class DeadoraliveService < Handsoap::Service
endpoint DEADORALIVE_SERVICE_ENDPOINT
def on_create_document(doc)
# register namespaces for the request
doc.alias 'tns', 'http://www.abundanttech.com/webservices/deadoralive'
end
def on_response_document(doc)
# register namespaces for the response
doc.add_namespace 'ns', 'http://www.abundanttech.com/webservices/deadoralive'
end
# public methods
def get_todays_birthdays
soap_action = 'http://www.abundanttech.com/webservices/deadoralive/getTodaysBirthdays'
response = invoke('tns:getTodaysBirthdays', soap_action)
(response/"//NewDataSet/Table").map do |table|
{
:full_name => (table/"FullName").to_s,
:birth_date => Date.strptime((table/"BirthDate").to_s, "%m/%d/%Y"),
:death_date => Date.strptime((table/"DeathDate").to_s, "%m/%d/%Y"),
:age => (table/"Age").to_s.gsub(/^\(([\d]+)\)$/, '\1').to_i,
:known_for => (table/"KnownFor").to_s,
:alive? => (table/"DeadOrAlive").to_s == "Alive"
}
end
end
end
Usage:
DeadoraliveService.get_todays_birthdays
SOAP4R always returns a SOAP::Mapping::Object which is sometimes a bit difficult to work with unless you are just getting the hash values that you can access using hash notation like so
weather['fullName']
However, it does not work when you have an array of hashes. A work around is to get the result in xml format instead of SOAP::Mapping::Object. To do that I will modify your code as
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
service.return_response_as_xml = true
weather=service.getTodaysBirthdays('1/26/2010')
Now the above would give you an xml response which you can parse using nokogiri or REXML. Here is the example using REXML
require 'rexml/document'
rexml = REXML::Document.new(weather)
birthdays = nil
rexml.each_recursive {|element| birthdays = element if element.name == 'getTodaysBirthdaysResult'}
birthdays.each_recursive{|element| puts "#{element.name} = #{element.text}" if element.text}
This will print out all elements that have any text.
So once you have created an xml document you can pretty much do anything depending upon the methods the library you choose has ie. REXML or Nokogiri
Well, Here's my suggestion.
The issue is, you have to snag the right part of the result, one that is something you can actually iterator over. Unfortunately, all the inspecting in the world won't help you because it's a huge blob of unreadable text.
What I do is this:
File.open('myresult.yaml', 'w') {|f| f.write(result.to_yaml) }
This will be a much more human readable format. What you are probably looking for is something like this:
--- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id024 !ruby/object:XSD::QName
name: ListAddressBooksResult <-- Hash name, so it's resul["ListAddressBooksResult"]
namespace: http://apiconnector.com
source:
- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id023 !ruby/object:XSD::QName
name: APIAddressBook <-- this bastard is enumerable :) YAY! so it's result["ListAddressBooksResult"]["APIAddressBook"].each
namespace: http://apiconnector.com
source:
- - !ruby/object:SOAP::Mapping::Object
The above is a result from DotMailer's API, which I spent the last hour trying to figure out how to enumerate over the results. The above is the technique I used to figure out what the heck is going on. I think it beats using REXML etc this way, I could do something like this:
result['ListAddressBooksResult']['APIAddressBook'].each {|book| puts book["Name"]}
Well, I hope this helps anyone else who is looking.
/jason

Resources