How do I replace a specific string with another string? - ruby

I have some content read from an XML file:
page_content = doc.xpath("/somenode/body").inner_text
This content holds some data:
<p> Hello World, ""How are you today""
Hello
etc.
</p>
As you can see, some of the content is wrapped with two pairs of double quotes.
My desired result is to replace the two pairs of double quotes with a single pair:
<p> Hello World, "How are you today"
Hello
etc.
</p>
What I have tried is:
page_content.gsub!(/[""]/, '"')
page_content.gsub!("\"\"", '"')
This does not seem to do the job. Any suggestions on how I can obtain my desired result?

It's important to understand how a parser like Nokogiri works.
To help you, it tries to fix-up damaged/malformed HTML or XML. Your HTML is malformed, so it's GOING to be fixed as Nokogiri parses it, however, that process can make Nokogiri mangle the HTML further. To avoid that, we sometimes have to preprocess the content before we hand it to Nokogiri, or we have to unravel it afterwards by replacing nodes.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p> Hello World, ""How are you today""
Hello
etc.
</p>
EOT
That parses the HTML into a DOM.
doc.at('p').to_html
# => "<p> Hello World, \"\"How are you today\"\"\n<a href=\"\" www.hello.comm>Hello</a>\netc.\n</p>"
The text ""How are you today"" was processed without any mangling because it's a text node:
doc.at('p').child.class # => Nokogiri::XML::Text
doc.at('p').child.content # => " Hello World, \"\"How are you today\"\"\n"
That's easily fixed after parsing:
doc.at('p').child.content = doc.at('p').child.content.gsub('""', '"')
# => " Hello World, \"How are you today\"\n"
Trying to fix the parameters of the <a> tag are an entirely different story, because, by that point, Nokogiri has fixed the doubled-quotes, causing the markup to be wrong:
doc.at('a').to_html
# => "<a href=\"\" www.hello.comm>Hello</a>"
Notice that www.hello.comm has been promoted outside its containing quotes.
To fix this requires some preprocessing before handing the HTML to Nokogiri, OR to fix the node and replace the damaged one with the fixed one.
Here's the basis for preprocessing the <a> tag:
html = <<EOT
<p> Hello World, ""How are you today""
Hello
etc.
</p>
EOT
html.gsub(/href=""([^"]+)""/, 'href="\1"')
# => "<p> Hello World, \"\"How are you today\"\"\nHello\netc.\n</p>\n"
If you go that route, don't get fancy. Write small, atomic changes, to avoid your pattern breaking if the HTML changes.
A more robust way (where "robust" is somewhat less than we'd normally get using a parser) is:
bad_a = doc.at('a')
fixed_a = bad_a.to_html.gsub(/""\s([^>]+)>/, '"\1">')
bad_a.replace(fixed_a)
doc.at('p')
# => #(Element:0x3fe4ce9de9e4 {
# name = "p",
# children = [
# #(Text " Hello World, \"How are you today\"\n"),
# #(Element:0x3fe4ce9e0fdc {
# name = "a",
# attributes = [
# #(Attr:0x3fe4ce9e0fa0 {
# name = "href",
# value = "www.hello.comm"
# })],
# children = [ #(Text "Hello")]
# }),
# #(Text "\netc.\n")]
# })
doc.at('p').to_html
# => "<p> Hello World, \"How are you today\"\nHello\netc.\n</p>"
It's possible to use a blanket gsub to massage the text, but that's got a high risk of collateral damage in large/complicated documents. Imagine what would happen to a document if
html.gsub('""', '"')
was used when there are many tags containing empty strings like:
<input value="" name="foo"><input value="" name="bar">
The result of the search/replace would be:
<input value=" name="foo"><input value=" name="bar">
That hardly improves things, and instead would have horribly mangled the document further.
Instead, it's better to surgically fix the problem. Back in the dark, early, pioneer days of the the web, we used to see a huge amount of malformed content, and having to process it with regular expressions was the normal plan of attack. Now, with parsers, we can usually avoid it and can isolate the problem and selectively fix exactly what we want. Looking at the code necessary to do so shows it doesn't take a lot to do it right.

page_content.gsub!('\"\"', '"')

page_content.gsub!(/"{2}/, '"')
rubular.com

a='<p> Hello World, ""How are you today""
Hello
etc.
</p>'
a.gsub! '""', '"'
[19] pry(main)> puts a
<p> Hello World, "How are you today"
Hello
etc.
</p>

Related

Nokogiri children method

I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).

How to remove characters in string after email

I'm using this code to list email addresses from a HTML page.
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'in.rb'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
This is sample code I'm parsing:
<a href="mailto:joe#example.com?subject=My Business Is Dying">
But I'm getting more than just the email address. I'm getting this in my results:
joe#example.com?subject=My Business Is Dying
How do I drop off everything after the question mark so it's only the email address?
You could always chop off anything after the ? character:
addresses.map! do |address|
address.sub(/\?.*/, '')
end
I'd probably use one of these two:
str = 'joe#example.com?subject=My Business Is Dying'
str.split('?').first # => "joe#example.com"
str[/^[^?]+/] # => "joe#example.com"
The second is a simple regular expression embedded in String's [] (slice) method. The pattern basically says "start at the beginning and grab everything up until a question mark."
They're equivalent as far as speed goes. I'd probably use the first because it's easier to read.

How to to parse HTML contents of a page using Nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)
I am trying to fetch the basic set of information like:
event_name
categories
sponsor
venue
event_location
cost
For example, for event_name I have this xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"
And use it like:
puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"
This returns nil for event_name.
If I save the URL contents locally then above XPath works.
Along with this, I need above mentioned information as well. I checked the other XPaths too, but the result turns out to be blank.
Here's how I'd go about doing this:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
Which, when run, results in entries being an array of hashes:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder#si.edu and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
I didn't try to gather the specific information you asked for because event_name doesn't exist and what you're doing is very generic and easily done once you understand a few rules.
XML is generally very repetitive because it represents tables of data. The "cells" of the table might vary but there's repetition you can use to help you. In this code
doc.search('entry')
loops over the <entry> nodes. Then it's easy to look inside them to find the information needed.
The XML uses namespaces to help avoid tag-name collisions. At first those seem really hard, but Nokogiri provides the collect_namespaces method for the document that returns a hash of all namespaces in the document. If you're looking for a namespaces-tag, pass that hash as the second parameter.
Nokogiri allows us to use XPath and CSS for selectors. I almost always go with CSS for readability. ns|tag is the format to tell Nokogiri to use a CSS-based namespaced tag. Again, pass it the hash of namespaces in the document and Nokogiri will do the rest.
If you're familiar with working with Nokogiri you'll see the above code is very similar to normal code used to pull the content of <td> cells inside <tr> rows in an HTML <table>.
You should be able to modify that code to gather the data you need without risking namespace collisions.
The provided link contains XML, so your XPath expressions should work with XML structure.
The key thing is that the document has namespaces. As I understand all XPath expressions should keep that in mind and specify namespaces too.
In order to simply XPath expressions one can use the remove_namespaces! method:
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output
doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event
event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"
Most likely you would like to have array of all event hashes.
You can do it like:
doc.xpath('//feed/entry').reduce([]) do |memo, event|
event_hash = {
title: event.xpath('./title').text,
categories: event.xpath('./categories').text
# all other attributes you need ...
}
memo << event_hash
end
It will give you an array like:
[
{:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"},
{:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},
...
]

How does Nokogiri handle unclosed HTML tags like <br>?

When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br> is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br> separately.

How to replace every occurrence of a pattern in a string using Ruby?

I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.
So, I implemented this:
string.gsub!(/<(\w+) /) do |match|
case match
when 'Image' then 'Img'
when 'Text' then 'Txt'
end
end
puts string
which deletes all opening tags but does not do much else.
What am I doing wrong here?
Here's another way:
class String
def minimize_tags!
{"image" => "img", "text" => "txt"}.each do |from,to|
gsub!(/<#{from}\b/i,"<#{to}")
gsub!(/<\/#{from}>/i,"<\/#{to}>")
end
self
end
end
This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...
Here's the beauty of using a parser such as Nokogiri:
This lets you manipulate selected tags (nodes) and their attributes:
require 'nokogiri'
xml = <<EOT
<xml>
<Image ImagePath="path/to/image">image comment</Image>
<Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT
doc = Nokogiri::XML(xml)
doc.search('Image').each do |n|
n.name = 'img'
n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n|
n.name = 'txt'
n.attributes['TextFont'].name = 'font'
n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <img path="path/to/image">image comment</img>
# >> <txt font="courier" size="9">this is the text</txt>
# >> </xml>
If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.
The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.
Try this:
string.gsub!(/(<\/?)(\w+)/) do |match|
tag_mark = $1
case $2
when /^image$/i
"#{tag_mark}Img"
when /^text$/i
"#{tag_mark}Txt"
else
match
end
end

Resources