How do I use Nokogiri to scrape text from an image tag? - ruby

I need to get text from a list of image tags that are formatted like this:
<img src="/images/TextImage.ashx?text=Richmond" style="border-width:0px;" class="">
When I enter the XPath into Nokogiri, I get:
[#<Nokogiri::XML::Element:0x80513954 name="img" attributes=[#<Nokogiri::XML::Attr:0x805138dc name="src" value="/images/TextImage.ashx?text=Richmond">, #<Nokogiri::XML::Attr:0x805138b4 name="style" value="border-width:0px;">]>]
Is there any way that I can tell Nokogiri to return "Richmond"? I'm looking for a method that will return the text after a certain string. If there is not a way to get only "Richmond", how do I get it to return the value?

You can extract the src attribute with an xpath expression like
src = doc.at_xpath '//img/#src'
After that, you’ll need to extract the name from the attribute, probably with a regex.
For example (this may need to be more involved, depending on what formats are possible in the src attribute in your HTML page):
/\?text=(.*)/ =~ src
puts $1

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

How to scarpe the href using Nokogiri

I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri
If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

select a word in a text blob in ruby based on a pattern

I have a text blob and I would like to select URL's based on whether they have .png or .jpg. I would like to select the entire word based on a pattern.
For example in this blob:
width='17'></a> <a href='http://click.e.groupon.com/? qs=94bee0ddf93da5b3903921bfbe17116f859915d3a978c042430abbcd51be55d8df40eceba3b1c44e' style=\"text-decoration: none;\">\n<img alt='Facebook' border='0' height='18' src='http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg' style='display: i
I'd like to select the image:
http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg
Can I use nokogiri on an html text blob?
Using Nokogiri and XPath:
frag = Nokogiri::HTML.fragment(str) # Don't construct an entire HTML document
images = frag.xpath('.//img/#src').map(&:text).grep /\.(png|jpg|jpeg)\z/
The XPath says:
.// — anywhere in this fragment
img — find all the <img> elements
/#src — now find the src attribute of each
Then we:
map(&:text) – convert all the Nokogiri::XML::Attr to the value of the attribute.
grep - find only those strings in the array that end with the appropriate text.
Yes, you can use nokogiri, and you should!
Here's a simple snippet:
require "nokogiri"
str = "....your blob"
html_doc = Nokogiri::HTML(str)
html_doc.css("a").collect{|e| e.attributes["href"].value}.select{|e| e.index(".png") || e.index(".jpeg") }
If you only want to find urls ending in .jpg or .png a pattern like this should do it.
https?:\/\/.*?\.(?:jpg|png)

Parsing inner tags using Nokogiri

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

Resources