How to use a regex search phrase in HTTP response body - ruby

I am trying to search for a phrase like this in HTTP response body:
>> myvar1
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
When I do this, I do not get any result:
>> myvar.scan(/<HEAD> <TITLE>TestExample [Date]<\/TITLE><\/HEAD>/)
[]
Here, [Date] is a dynamic variable that gets its value via loop iteration.
What should I add/change in the regex?
I am using Nokogiri to scan for keyword in HTTP response body.

Please do not parse any markup like HTML with regular expressions. For such purposes it is much more maintainable to feed it into a proper SAX or DOM parser and just extract what you want that way. The reason for this is that no matter how clever you formulate your regex, there will always be corner cases you probably forgot.
require 'nokogiri'
response = "<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>"
doc = Nokogiri::HTML( response )
doc.css( "title" ).text

This will work
<HEAD> <TITLE>TestExample (.*?)<\/TITLE><\/HEAD>
http://rubular.com/r/latepMqrjx
You probably don't need something as specific as <HEAD> <TITLE> as I doubt that there will be more than one title. Case sensitivity and newlines may also be an issue. I'd probably use
/<title>TestExample (.*?)<\//im

You're making it much too hard. Using Nokogiri, you can easily parse and search HTML and/or XML.
To get the <title> text simply use Nokogiri's HTML::Document#title method:
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
doc.title # => "TestExample [Date]"
There's no regex to write or maintain, and this will work as long as the HTML is reasonably valid.
Since you're trying to get what looks like a template for a date, you'll probably want to rewrite that string, which Nokogiri also makes easy using title =:
require 'date'
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
title = doc.title
title['[Date]'] = Date.today.to_s
doc.title = title
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>TestExample 2020-03-18</title>
# >> </head> </html>

Related

How to replace outer tags using Nokogiri

Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end
Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.

Extracting the text from the child elements of a div with a class?

I have a small Sinatra app:
app.rb:
get '/' do
# the first two lines are lifted directly from our previous script
url = "http://www.nba.com/"
data = Nokogiri::HTML(open(url))
# this line has only be adjusted slightly with the inclusion of an ampersand
# before concerts. This creates an instance variable that can be referenced
# in our display logic (view).
#headlines = data.css('#nbaAssistSkip')
#top_stories = data.css('#nbaAssistSkip')
# this tells sinatra to render the Embedded Ruby template /views/shows.erb
erb :shows
end
show.erb:
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Nokogiri App</title>
</head>
<body>
<div>
<h2><%= #headlines %></h2>
<p><%= #top_stories %></p>
</div>
</body>
</html>
I'm new to Nokogiri, and I was wondering how I can extract the text from the links within .nbaBreakingNews div (e.g. Live on NBA...):
And display them in my template.
(Right now, I only know how to extract text from html tags with classes and IDs).
The a elements in those sections would be:
data.css('.nbaBreakingNewscv a')
That means any a element that descends from an element with class nbaBreakingNewscv. To show the text of those a elements you would do:
data.css('.nbaBreakingNewscv a').each do |a|
puts a.text
end

Nokogiri lose my attribute's value named 'multiple'

Here's the code:
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='false'></div></html>")
puts doc.errors
doc.css("input").each do |el|
puts el.attributes['multiple']
end
puts doc.to_html
And here's the output:
false
<!DOCTYPE html>
<html><body>
<input multiple id="test" some="2"><div multiple></div>
</body></html>
[Finished in 2.0s]
Where are the two ='false' ??
EDIT
PLus, is there a way to turn down the default correction?? (use to_xhtml can keep the ='false' but and CDATA into script tag)
In my option, to_xhtml seems working more strictly, why to_xhtml keep the multiple='false' instead??
EDIT2
Here's my temporary workaround: gsub(/multiple=/, 'blahhhhh') before parsing and gsub(/blahhhhh/, 'multiple=') back after parsing
Replace to_html with to_xhtml and you will get multiple attributes values back again.
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='true'></div></html>")
puts doc.to_xhtml
will output
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<input multiple="false" id="test" some="2" />
<div multiple="true"></div>
</body>
</html>
Update This happens because in HTML the multiple attribute (and other attributes such disabled or selected) doesn't require to have a value so Nokogiri strips it to clean up the output code.
Update 2
why to_xhtml keep the multiple='false' instead?
Because XHTML don't let to omit the value of the attributes, so Nokogiri keeps them.
The best thing you can do, I think, is to feed Nokogiri with proper HTML code in the first place, i.e. omit the multiple attribute entirely instead of write multiple="false".

Nokogiri equivalent of Hpricot's html method

Hpricot's html method spits out just the HTML in the document:
> Hpricot('<p>a</p>').html
=> "<p>a</p>"
By contrast, the closest I can come with Nokogiri is the inner_html method, which wraps its output in <html> and <body> tags:
> Nokogiri.HTML('<p>a</p>').inner_html
=> "<html><body><p>a</p></body></html>"
How can I get the behavior of Hpricot's html method with Nokogiri? I.e., I want this:
> Nokogiri.HTML('<p>a</p>').some_method_i_dont_know_about
=> "<p>a</p>"
How about:
require 'nokogiri'
puts Nokogiri.HTML('<p>a</p>').to_html #
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>a</p></body></html>
If you don't want Nokogiri to create a HTML document, then you can tell it to parse it as a document fragment:
puts Nokogiri::HTML::DocumentFragment.parse('<p>a</p>').to_html
# >> <p>a</p>
In either case, the to_html method returns the HTML version of the document.
> Nokogiri.HTML('<p>a</p>').xpath('/html/body').inner_html
=> "<p>a</p>"

Preserve structure of an HTML page, removing all text nodes

I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div> the above may or may not do what you expect. Alternatively, the following may match your needs:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script> element. I've added more text nodes to show how it handles them.
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)

Resources