Extract text by row using Mechanize - ruby

I'm trying to extract some datas from a page.
My page is:
<div class="row">
<div class="title">
Orange
</div>
<div class="color">
orange
</div>
</div>
<div class="row">
<div class="title">
Banana
</div>
<div class="color">
yellow
</div>
</div>
I want to extract title and color by row.
This is my script:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get("#{url}")
page.search("div.row").each do |row|
title = row.at("div.title a")
email = row.at("div.color a")
puts title
puts email
end
It works but results are like this:
Orange
orange
Banana
yellow
I would like to extract only texts.
I try with title = row.at("div.title a").text but I have an error undefined method 'text' for nil:NilClass (NoMethodError)
Any idea? thx

Related

How to get ID of an element using Watir where the child contains the string i search for

<div class="wrapper">
<div id="minHeightBlock" style="min-height: 430px;">
<div class="borderbox"><div class="standaloneBox">
<div class="sysHeaderContainer clearfix"> … </div>
<div class="notesForGuests"> … </div>
<div class="filterBox clearfix"> … </div>
<div class="resListHeader"> … </div>
<div id="corporaContainer" class="fullList">
<div id="c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8" class="resItem clearfix">
<div class="resTitle">
<span id="filter-empty" class="statBall statFile empty" title="Status: Empty corpus"></span>
<span class="theText">
12321 corpora
</span>
</div>
<div class="resType"> … </div>
<div class="resSize"> … </div>
<div class="resPermission private"> … </div>
<div class="resDomain"> … </div>
<div class="resDescr"> … </div>
<div class="resDetails clearfix" style="display:none;"> … </div>
</div>
<div id="c-b8c0faba-e662-4998-836f-0ee58009b7fa" class="resItem clearfix"> … </div>
<div id="c-9d02b887-4835-4606-ad4b-775b39af9f48" class="resItem clearfix"> … </div>
<div id="c-021d3ba1-db03-4c4e-81a5-294737eb5b54" class="resItem clearfix"> … </div>
This is the code of the webpage im trying to script using Watir. All i know is only the what kind of span text the element should contain. I have many of these elements and i need to colect all of the element ID values so i can use them in further actions.
I have comented the places in the above code what i know and what i need to get.
So far i have tried this code:
#b.div(:id, "pageHeader").link(:text, "Corpora").click
sleep 5
#b.div(:id, "corporaContainer").spans(:text => /TestAuto\s.*/).each do |span|
puts span.parent.attribute_value("id")
end
But no output is done. Maybe im doing something wrong. Help me get this nut shell cracked.
Your attempt was close. The problem is that span.parent only goes up to the <div class="resTitle">. You need to go up one more parent:
#b.div(:id, "corporaContainer").spans(:text => /corpora/).each do |span|
puts span.parent.parent.attribute_value("id")
end
(Note that I changed the text in the locator of the spans since TestAuto\s.* did not match the sample html.)
Alternatively, I sometimes find it better to find the divs that contain the span. This way you do not have to worry about the number of parents changing:
p #b.divs(:class => 'resItem')
.find_all { |div| div.span(:text => /corpora/).exists? }
.collect { |div| div.id }
#=> ["c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8"]
Below is a working example. Note that there are 2 important things:
The list of results is loaded asynchronously. Therefore you need to wait for the list to finish loading before capturing the results. sleep(5) might work, but you are better off using an actual wait method (since it seems to take longer than 5 seconds).
Make sure the search text actually exists on the page. In the below example, there is no "12321 corpora" title that was mentioned in the sample html.
Example:
require 'watir-webdriver'
# Title to search for:
title_text = /UniAdm/
# Go to the Corpora page:
#b = Watir::Browser.new :ff
#b.goto "https://www.letsmt.eu/Corpora.aspx"
# Wait for the results to load:
container = #b.div(:id, "corporaContainer")
container.div(:class => 'resItem').wait_until_present
# Find the matching ids:
p container.divs(:class => 'resItem')
.find_all { |div| div.span(:class => 'theText', :text => title_text).exists? }
.collect { |div| div.id }
#=> ["c-87ee80a9-e529-48b2-92be-bc8d76375478", "c-f139e781-4789-41f9-82e8-914e0e3eff81", "c-e17641d2-9364-4e87-9047-ba35580dc32f"]

Watir: How to retrieve all HTML elements that match an attribute? (class, id, title, etc)

I have a page that is dynamically created and displays a list of products with their prices. Since it's dynamic, the same code is reused to create each product's information, so they share the tags and same classes. For instance:
<div class="product">
<div class="name">Product A</div>
<div class="details">
<span class="description">Description A goes here...</span>
<span class="price">$ 180.00</span>
</div>
</div>
<div class="product">
<div class="name">Product B</div>
<div class="details">
<span class="description">Description B goes here...</span>
<span class="price">$ 43.50</span>
</div>
</div>`
<div class="product">
<div class="name">Product C</div>
<div class="details">
<span class="description">Description C goes here...</span>
<span class="price">$ 51.85</span>
</div>
</div>
And so on.
What I need to do with Watir is recover all the texts inside the spans with class="price", in this example: $ 180.00, $43.50 and $51.85.
I've been playing around with something like this:
#browser.span(:class, 'price').each do |row| but is not working.
I'm just starting to use loops in Watir. Your help is appreciated. Thank you!
You can use pluralized methods for retrieving collections - use spans instead of span:
#browser.spans(:class => "price")
This retrieves a span collection object which behaves in similar to the Ruby arrays so you can use Ruby #each like you tried, but i would use #map instead for this situation:
texts = #browser.spans(:class => "price").map do |span|
span.text
end
puts texts
I would use the Symbol#to_proc trick to shorten that code even more:
texts = #browser.spans(:class => "price").map &:text
puts texts

Parsing nodes with Nokogiri?

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?
Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg
Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

In ruby when I try mytext.include? (">Model number<") is returning false

In ruby when I try mytext.include?(">Model number<") is returning false.
But mytext.include?("Model number") is returning true
What is wrong in the first condition?
mytext contains the string "Model number" inside ">" and "<"
This is relevant HTML:
<div class="bucket"> <div class="h1"><strong>Product Specifications</strong></div> <div class="content"> <div class="tsSectionHeader">Product Information</div> <div class="tsTable"> <div class="tsRow"><span class="tsLabel">Model number</span><span>516C</span></div> <div class="tsRow"><span class="tsLabel">Maximum weight recommendation</span><span>35 Pounds</span></div> <div class="tsRow"><span class="tsLabel">Material Type</span><span>Wood</span></div> </div> </div> </div>
You have to learn some HTML. > and < are part of span tag: <span></span>.
This is where the text appears:
<span class="tsLabel">Model number</span>
So a span has text Model number. You can get the text using Watir with this:
browser.span(:class => "tsLabel").text

get div nested in div element using Nokogiri

For following HTML, I want to parse it and get following result using Nokogiri.
event_name = "folk concert 2"
event_link = "http://www.douban.com/event/12761580/"
event_date = "20th,11,2010"
I know doc.xpath('//div[#class="nof clearfix"]') could get each div element, but how should I proceed to get each attribution like event_name, and especially the date?
HTML
<div class="nof clearfix">
<h2>folk concert 2 <span class="pl2"> </span></h2>
<div class="pl intro">
Date:25th,11,2010<br/>
</div>
</div>
<div class="nof clearfix">
<h2>folk concert <span class="pl2"> </span></h2>
<div class="pl intro">
Date:10th,11,2010<br/>
</div>
</div>
I don't know xpaths, I prefer to use css selectors, they make more sense to me. This tutorial might be useful for you.
require 'rubygems'
require 'nokogiri'
require 'pp'
Event = Struct.new :name , :link , :date
doc = Nokogiri::HTML DATA
events = doc.css("div.nof.clearfix").map do |eventnode|
name = eventnode.at_css("h2 a").text.strip
link = eventnode.at_css("h2 a")['href']
date = eventnode.at_css("div.pl.intro").text.strip
Event.new name , link , date
end
pp events
__END__
<div class="nof clearfix">
<h2>folk concert 2 <span class="pl2"> </span></h2>
<div class="pl intro">
Date: 25th,11,2010<br/>
</div>
</div>
<div class="nof clearfix">
<h2>folk concert <span class="pl2"> </span></h2>
<div class="pl intro">
Date: 10th,11,2010<br/>
</div>
</div>
This outputs:
[#<struct Event
name="folk concert 2",
link="http://www.douban.com/event/12761580/",
date="Date: 25th,11,2010">,
#<struct Event
name="folk concert",
link="http://www.douban.com/event/12761581/",
date="Date: 10th,11,2010">]

Resources