hpricot: get image from URL and parse element - ruby

i am trying to get the exact URL of an image inside a page and then download it. i haven't yet gotten to the download point, as i am trying to isolate the URL of the image. here is the code:
#!/usr/bin/ruby -w
require 'rubygems'
require 'hpricot'
require 'open-uri'
raw = Hpricot(open("http://www.amazon.com/Weezer/dp/B000003TAW/"))
ele = raw.search("img[#src*=jpg]").first
img = ele.match("(\")(.*?)(\")").captures
puts img[1]
when i run it as it is, i receive:
undefined method `match' for #<Hpricot::Elem:0xb731948c> (NoMethodError)
if i comment out the last 2 lines and add
puts ele
i get:
<img src="http://ecx.images-amazon.com/images/I/51rpVNqXmYL._SL500_AA240_.jpg" style="display:none;" />
which is the correct portion of the page i want to parse. however, the error is when i try to get just the "http://ecx.images-amazon.com/images/I/51rpVNqXmYL._SL500_AA240_.jpg" style="display:none;" part.
i am not totally sure why it can't perform a match, as I understand the search i am running should be getting an array of the image elements and returning the first. so i assumed that i could not run the match on the entire array, so i tried
img = ele[1].match("(\")(.*?)(\")").captures
puts img
and that returns
undefined method `match' for nil:NilClass (NoMethodError)
i am lost. please excuse my ignorance, as i am just beginning to learn ruby. any help is appreciated.

Change this line:
img = ele.match("(\")(.*?)(\")").captures
To:
img = ele[:src]
The reason for the errors is that Hpricot:Elem isn't a string. Try:
ele.responde.to? :match
and you get false.
However, you could do:
ele.to_s.match("(\")(.*?)(\")").captures[1]
the secret is in the to_s

Related

How to scrape the next page in ruby

I am trying to scrape the next page of the website called https://www.jobsatosu.com/postings/search. Because there are many jobs, there are many pages. Our team successfully scraped the first page like this:
def initialize
#agent_menu = Mechanize.new
#page = #agent_menu.get(PAGE_URL)
#form = #page.forms[0]
I am working on trying to scrape the next page. Also, we were told to use Nokogiri and Mechanize in Ruby. I just have to scrape the next page and do not have to parse it.
This is what I did:
def next_page
#page_num += 1
new_url = "https://www.jobsatosu.com/postings/search?page=#{#page_num}"
#new_page = #agent_menu.get(new_url)
#new_form = #new_page.forms[0]
end
I made one page_num for all to share. If someone calls the method, then it gets iterated by 1 and it gets the new URL, puts it in #new_page.
I haven't tested this out, but any thoughts on this code?
You need to initialize #page_num = 0 before use
In the first time #page_num is nil so #page_num += 1 raises execption
NoMethodError: undefined method '+' for nil:NilClass
Actually you don't describe variable before using but in this case, you need to do

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Watir-Webdriver: How can I get the size (file size) of an website image

I have the following html code:
I saw that for Watir-webdriver the "Watir::Image.file_size" method is not currently supported.
I found out that the "Watir-Classic/Image.rb" has the same method, and it seems that can be used.
# this method returns the filesize of the image, as an int
def file_size
assert_exists
#o.invoke("fileSize").to_i
end
I created a method that should retrieve the image size, but it seems I am not initializing the object correctly. Here is my code from the method:
img_src="/location/on_the_server/image"
chart_image = Watir::Image.new(:src, img_src)
puts chart_image.file_size
The problem is that I receive the following error:
"ArgumentError: invalid argument "/location/on_the_server/image""
I saw that for initialization the object requires (container,specifiers). I tried to change the initialization line to "chart_image = Watir::Image.new(img_src, :src)" but the error keeps appearing.
Could anyone tell me what am I doing wrong?
Is there another way to get the file size of an image from a website?
Thank you.
You should not be initializing Watir::Image directly. Instead, you should use the image() method of a browser or element object.
#Assuming that browser = Watir::Browser that is open
img_src="/location/on_the_server/image"
chart_image = browser.image(:src, img_src)
puts chart_image.file_size
Update - Download Image to Determine File Size:
You could download the image using with open-uri (or similar) and then use Ruby's File class to determine the size:
require 'watir-webdriver'
require "open-uri"
#Specify where to save the image
save_file = 'C:\Users\my_user\Desktop\image.png'
#Get the src of the image you want. In this example getting the first image on Google.
browser = Watir::Browser.new
browser.goto('www.google.ca')
image_location = browser.image.src
#Save the file
File.open(save_file, 'wb') do |fo|
fo.write open(image_location, :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE).read
end
#Output the size
puts File.size(save_file).size

How to insert a string to a text field using mechanize in ruby?

I know is a very simple question but I've been stuck for an hour and I just can't understand how this works.
I need to scrape some stuff from my school's library so I need to insert 'CE' to a text field and then click on a link with text 'Clasificación'. The output is what I am going to use to work. So here is my code.
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
url = 'http://biblio02.eld.edu.mx/janium-bin/busqueda_rapida.pl?Id=20110720161008#'
searchStr = 'CE'
agent = Mechanize.new
page = agent.get(url)
searchForm = page.form_with(:method => 'post')
searchForm['buscar'] = searchStr
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm,clasificacionLink)
When I run it, it gives me this error
janium.rb:31: undefined method `[]=' for nil:NilClass (NoMethodError)
Thanks!
I think your problem is actually on line 13, not 31, and I'll even tell why I think that. Not only does your script not have 31 lines but, from the fine manual:
form_with(criteria)
Find a single form matching criteria.
There are several forms on that page that have method="post". Apparently Mechanize returns nil when it can't exactly match the form_with criteria including the single part mentioned in the documentation; so, if your criteria matches more than one thing, form_with returns nil instead of choosing one of the options and you end up trying to do this:
nil['buscar'] = searchStr
But nil doesn't have a []= method so you get your NoMethodError.
If you use this:
searchForm = page.form_with(:name => 'forma')
you'll get past the first part as there is exactly one form with name="forma" on that page. Then you'll have trouble with this:
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm, clasificacionLink)
as Mechanize doesn't know what to do with JavaScript (at least mine doesn't). But if you use just this:
page = agent.submit(searchForm)
you'll get a page and then you can continue building and debugging your script.
mu's answer sounds reasonable. I am not sure if this is strictly necessary, but you might also try to put braces around searchStr.
searchForm['buscar'] = [searchStr]

Ruby: Problems using Mechanize to access my form!

Just for fun, I wrote a very small rails blog (just a hello world).
Now I want to create a post using mechanize.
So I created a Ruby Prog and started coding.
Here is my problem:
Rails creates my form element including all inputs.
In HTML my inputs look like this:
<input type="text" size="30" name="post[title]" id="post_title">
or
<textarea rows="20" name="post[description]" id="post_description" cols="40"></textarea>
Well...
Here is my Ruby Prog using Mechanize:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://localhost:3000/posts/new')
target_form = page.form_with(:class => 'new_post')
target_form.post[title] = "test"
target_form.post[description] = "test"
page = agent.submit(target_form)
puts "end"
I know where my error is but I don't know how to fix it.
At target_form.post[title] = "test" it crashes, cause of
undefined method `name' for nil:NilClass (NoMethodError)
I think (please correct me), it's because of the input name, cause it is post[title] instead of only post right?
How can I fix it?
How about
target_form.field_with(:name => "post[title]").value = "test"
target_form.field_with(:name => "post[description]").value = "test"

Resources