Parsing nodes with Nokogiri? - ruby

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?

Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg

Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

Related

How to use Ruby/Nokogiri to Strip <tr> and <td> tags not enclosed in <table> tag?

I'm dealing with some badly formed HTML where table elements aren't enclosed in a table tag, such as the following:
<div class="row">
<div class="large-12 columns main-content">
<tr>
<td colspan="4"><img src="../img/H006265.jpg"></td>
</tr><tr valign="top">
<td> </td>
</tr>
</div>
</div>
I'd like to get rid of the junk tags and end up with something like this:
<div class="row">
<div class="large-12 columns main-content">
<img src="../img/H006265.jpg">
</div>
</div>
There are legit tables elsewhere in the document, so I'm not able to just strip and tags altogether, only those not enclosed in a tag.
I've tried having Nokogiri parse it, thinking it would clean up the incorrect HTML, to no avail:
Nokogiri::HTML::DocumentFragment.parse(badly_formed_html_string)
You can use the parsed fragment to clean your HTML:
frag = Nokogiri::HTML::DocumentFragment.parse(badly_formed_html_string)
frag.css('tr').each do |tr|
tr.add_previous_sibling tr.children
tr.remove
end
frag.css('td').each do |td|
td.add_previous_sibling td.children
td.remove
end
puts frag.to_s
# <div class="row">
# <div class="large-12 columns main-content">
# <img src="../img/H006265.jpg">
#
# </div>
# </div>
Thanks to Uri's code for helping me find a good answer, his was close but this did the trick for me of stripping those tags only when not enclosed in a tag:
def strip_tag_if_not_in_parent(doc, tag, parent)
doc.css(tag).each do |element|
if (element.parent.name != parent)
new_element = Nokogiri::HTML::DocumentFragment.parse(element.inner_html)
element.replace new_element
end
end
doc
end
doc = strip_tag_if_not_in_parent(doc, 'tr', 'table')
doc = strip_tag_if_not_in_parent(doc, 'td', 'tr')

How to get ID of an element using Watir where the child contains the string i search for

<div class="wrapper">
<div id="minHeightBlock" style="min-height: 430px;">
<div class="borderbox"><div class="standaloneBox">
<div class="sysHeaderContainer clearfix"> … </div>
<div class="notesForGuests"> … </div>
<div class="filterBox clearfix"> … </div>
<div class="resListHeader"> … </div>
<div id="corporaContainer" class="fullList">
<div id="c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8" class="resItem clearfix">
<div class="resTitle">
<span id="filter-empty" class="statBall statFile empty" title="Status: Empty corpus"></span>
<span class="theText">
12321 corpora
</span>
</div>
<div class="resType"> … </div>
<div class="resSize"> … </div>
<div class="resPermission private"> … </div>
<div class="resDomain"> … </div>
<div class="resDescr"> … </div>
<div class="resDetails clearfix" style="display:none;"> … </div>
</div>
<div id="c-b8c0faba-e662-4998-836f-0ee58009b7fa" class="resItem clearfix"> … </div>
<div id="c-9d02b887-4835-4606-ad4b-775b39af9f48" class="resItem clearfix"> … </div>
<div id="c-021d3ba1-db03-4c4e-81a5-294737eb5b54" class="resItem clearfix"> … </div>
This is the code of the webpage im trying to script using Watir. All i know is only the what kind of span text the element should contain. I have many of these elements and i need to colect all of the element ID values so i can use them in further actions.
I have comented the places in the above code what i know and what i need to get.
So far i have tried this code:
#b.div(:id, "pageHeader").link(:text, "Corpora").click
sleep 5
#b.div(:id, "corporaContainer").spans(:text => /TestAuto\s.*/).each do |span|
puts span.parent.attribute_value("id")
end
But no output is done. Maybe im doing something wrong. Help me get this nut shell cracked.
Your attempt was close. The problem is that span.parent only goes up to the <div class="resTitle">. You need to go up one more parent:
#b.div(:id, "corporaContainer").spans(:text => /corpora/).each do |span|
puts span.parent.parent.attribute_value("id")
end
(Note that I changed the text in the locator of the spans since TestAuto\s.* did not match the sample html.)
Alternatively, I sometimes find it better to find the divs that contain the span. This way you do not have to worry about the number of parents changing:
p #b.divs(:class => 'resItem')
.find_all { |div| div.span(:text => /corpora/).exists? }
.collect { |div| div.id }
#=> ["c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8"]
Below is a working example. Note that there are 2 important things:
The list of results is loaded asynchronously. Therefore you need to wait for the list to finish loading before capturing the results. sleep(5) might work, but you are better off using an actual wait method (since it seems to take longer than 5 seconds).
Make sure the search text actually exists on the page. In the below example, there is no "12321 corpora" title that was mentioned in the sample html.
Example:
require 'watir-webdriver'
# Title to search for:
title_text = /UniAdm/
# Go to the Corpora page:
#b = Watir::Browser.new :ff
#b.goto "https://www.letsmt.eu/Corpora.aspx"
# Wait for the results to load:
container = #b.div(:id, "corporaContainer")
container.div(:class => 'resItem').wait_until_present
# Find the matching ids:
p container.divs(:class => 'resItem')
.find_all { |div| div.span(:class => 'theText', :text => title_text).exists? }
.collect { |div| div.id }
#=> ["c-87ee80a9-e529-48b2-92be-bc8d76375478", "c-f139e781-4789-41f9-82e8-914e0e3eff81", "c-e17641d2-9364-4e87-9047-ba35580dc32f"]

Use nokogiri with xpath

How can i use nokogiri, to fetch image via xpath, but my main problem is that, i could have this div, but didn't have image:
image_node = #get_doc.xpath( '//*[#id="recaptcha_image"]/img/#src').map {|a| a.value }
#binding.pry
if image_node != nil
rec = Net::HTTP.get( URI.parse( "#{image_node['src']}" ) )
end
but i get
in `[]': can't convert String into Integer (TypeError)
how is it correct to use?
some part of html:
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img *****>
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field"
style="width: 295px">
I recommend CSS over XPath for most HTML queries, and many XML ones. Using CSS makes this very "visible":
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
doc.at('#recaptcha_widget img')['src'] # => "path_to_image.jpg"
how to do check, if i have div, but didn't have image?
How do you check if you didn't have the embedded <img> tag inside the <div>? Break your lookup into two parts, and check for a nil:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<div id="recaptcha_image2">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
img = doc.at('#recaptcha_widget img')
img_src = img['src'] # => "path_to_image.jpg"
If the <img> tag doesn't exist you'll get nil:
img = doc.at('#recaptcha_widget2 img') # => nil
From that point you'd continue with a check to see if img was set:
if (img)
# ...do something...
end
Or, use a trailing rescue to capture the nil-exception and assign nil to img_src then test for it:
img_src = doc.at('#recaptcha_widget img')['src'] rescue nil # => "path_to_image.jpg"
img_src = doc.at('#recaptcha_widget2 img')['src'] rescue nil # => nil
if (img_src)
# do something
end

Watir: How to retrieve all HTML elements that match an attribute? (class, id, title, etc)

I have a page that is dynamically created and displays a list of products with their prices. Since it's dynamic, the same code is reused to create each product's information, so they share the tags and same classes. For instance:
<div class="product">
<div class="name">Product A</div>
<div class="details">
<span class="description">Description A goes here...</span>
<span class="price">$ 180.00</span>
</div>
</div>
<div class="product">
<div class="name">Product B</div>
<div class="details">
<span class="description">Description B goes here...</span>
<span class="price">$ 43.50</span>
</div>
</div>`
<div class="product">
<div class="name">Product C</div>
<div class="details">
<span class="description">Description C goes here...</span>
<span class="price">$ 51.85</span>
</div>
</div>
And so on.
What I need to do with Watir is recover all the texts inside the spans with class="price", in this example: $ 180.00, $43.50 and $51.85.
I've been playing around with something like this:
#browser.span(:class, 'price').each do |row| but is not working.
I'm just starting to use loops in Watir. Your help is appreciated. Thank you!
You can use pluralized methods for retrieving collections - use spans instead of span:
#browser.spans(:class => "price")
This retrieves a span collection object which behaves in similar to the Ruby arrays so you can use Ruby #each like you tried, but i would use #map instead for this situation:
texts = #browser.spans(:class => "price").map do |span|
span.text
end
puts texts
I would use the Symbol#to_proc trick to shorten that code even more:
texts = #browser.spans(:class => "price").map &:text
puts texts

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>
Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.
Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']
I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

Resources