Nokogiri lose my attribute's value named 'multiple' - ruby

Here's the code:
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='false'></div></html>")
puts doc.errors
doc.css("input").each do |el|
puts el.attributes['multiple']
end
puts doc.to_html
And here's the output:
false
<!DOCTYPE html>
<html><body>
<input multiple id="test" some="2"><div multiple></div>
</body></html>
[Finished in 2.0s]
Where are the two ='false' ??
EDIT
PLus, is there a way to turn down the default correction?? (use to_xhtml can keep the ='false' but and CDATA into script tag)
In my option, to_xhtml seems working more strictly, why to_xhtml keep the multiple='false' instead??
EDIT2
Here's my temporary workaround: gsub(/multiple=/, 'blahhhhh') before parsing and gsub(/blahhhhh/, 'multiple=') back after parsing

Replace to_html with to_xhtml and you will get multiple attributes values back again.
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='true'></div></html>")
puts doc.to_xhtml
will output
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<input multiple="false" id="test" some="2" />
<div multiple="true"></div>
</body>
</html>
Update This happens because in HTML the multiple attribute (and other attributes such disabled or selected) doesn't require to have a value so Nokogiri strips it to clean up the output code.
Update 2
why to_xhtml keep the multiple='false' instead?
Because XHTML don't let to omit the value of the attributes, so Nokogiri keeps them.
The best thing you can do, I think, is to feed Nokogiri with proper HTML code in the first place, i.e. omit the multiple attribute entirely instead of write multiple="false".

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

Ruby: regex to remove tags if attributes don't have allowed values

I have such a text:
click here!blah-blah-some-text-here-blahclick here!
What's the correct way to remove all <a></a> tags (end everything inside them) if <a href= does NOT have some-good-website?
A possible solution using Nokogiri:
require 'nokogiri'
TEST = 'click here!blah-blah-some-text-here-blahclick here!'
page = Nokogiri::HTML(TEST)
links = page.css("a") # parse all <a></a> elements from content
links.each do |link|
if link['href'] =~ /http:\/\/www\.i-am-hacker\.com\/blah/
link.remove
end
end
puts page # output content for debugging
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>blah-blah-some-text-here-blahclick here!
</body></html>
Useful resource: http://ruby.bastardsbook.com/chapters/html-parsing/
This site helped me a lot understanding how to use nokogiri
If you need to install nokogiri, you can do that by using the following command:
gem install nokogiri

Extracting the text from the child elements of a div with a class?

I have a small Sinatra app:
app.rb:
get '/' do
# the first two lines are lifted directly from our previous script
url = "http://www.nba.com/"
data = Nokogiri::HTML(open(url))
# this line has only be adjusted slightly with the inclusion of an ampersand
# before concerts. This creates an instance variable that can be referenced
# in our display logic (view).
#headlines = data.css('#nbaAssistSkip')
#top_stories = data.css('#nbaAssistSkip')
# this tells sinatra to render the Embedded Ruby template /views/shows.erb
erb :shows
end
show.erb:
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Nokogiri App</title>
</head>
<body>
<div>
<h2><%= #headlines %></h2>
<p><%= #top_stories %></p>
</div>
</body>
</html>
I'm new to Nokogiri, and I was wondering how I can extract the text from the links within .nbaBreakingNews div (e.g. Live on NBA...):
And display them in my template.
(Right now, I only know how to extract text from html tags with classes and IDs).
The a elements in those sections would be:
data.css('.nbaBreakingNewscv a')
That means any a element that descends from an element with class nbaBreakingNewscv. To show the text of those a elements you would do:
data.css('.nbaBreakingNewscv a').each do |a|
puts a.text
end

How to use a regex search phrase in HTTP response body

I am trying to search for a phrase like this in HTTP response body:
>> myvar1
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
When I do this, I do not get any result:
>> myvar.scan(/<HEAD> <TITLE>TestExample [Date]<\/TITLE><\/HEAD>/)
[]
Here, [Date] is a dynamic variable that gets its value via loop iteration.
What should I add/change in the regex?
I am using Nokogiri to scan for keyword in HTTP response body.
Please do not parse any markup like HTML with regular expressions. For such purposes it is much more maintainable to feed it into a proper SAX or DOM parser and just extract what you want that way. The reason for this is that no matter how clever you formulate your regex, there will always be corner cases you probably forgot.
require 'nokogiri'
response = "<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>"
doc = Nokogiri::HTML( response )
doc.css( "title" ).text
This will work
<HEAD> <TITLE>TestExample (.*?)<\/TITLE><\/HEAD>
http://rubular.com/r/latepMqrjx
You probably don't need something as specific as <HEAD> <TITLE> as I doubt that there will be more than one title. Case sensitivity and newlines may also be an issue. I'd probably use
/<title>TestExample (.*?)<\//im
You're making it much too hard. Using Nokogiri, you can easily parse and search HTML and/or XML.
To get the <title> text simply use Nokogiri's HTML::Document#title method:
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
doc.title # => "TestExample [Date]"
There's no regex to write or maintain, and this will work as long as the HTML is reasonably valid.
Since you're trying to get what looks like a template for a date, you'll probably want to rewrite that string, which Nokogiri also makes easy using title =:
require 'date'
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
title = doc.title
title['[Date]'] = Date.today.to_s
doc.title = title
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>TestExample 2020-03-18</title>
# >> </head> </html>

Truncate String from html string

I need to truncate some data received from a URI:PARSE...it is full of html codes and data, The result at the end is what I need.
Here's the string (abbreviated) ' junk"Result">Q8:0;junk
What's is the best way to truncate the extra stuff in the string so that I can split the data I need into variables.
Thanks in advance,
Philip
pabbott#cpak.com
i would recommend to use Nokogiri to extract your value from Result span:
require 'nokogiri'
response = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="w3.org/1999/xhtml"><head><title>;
</title></head><body>
<form name="form1" method="post" action="tenHSServer.aspx?t=34&f=DeviceValue&d=R10" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTkzNDcxNzcwM2RkM4AHUDZdWZytDdspzLq7+FOXRfQ=" />
</div>
<span id="Result">R10:100;</span>
</form></body>
</html>'
result = nil
if doc = Nokogiri::HTML(response) rescue nil
if span = doc.css('#Result')
result = span.text
end
end
puts result
#=> R10:100;
however if you can not / do not want to install Nokogiri, use this regexp instead:
result = response.scan(/id=["|']Result["|']>([^<]*)<\//m).flatten.first
puts result
#=> R10:100;
Remove everything up to and including <span id=\"Result\"> with the first call to sub()
Then remove everything after and including </span> from what's left with the second call to sub()
Assume you store your html in the variable mystring
result = mystring.sub(/.*<span id=\"Result\">/,'').sub(/<\/span>.*/,'')
If you can't always rely on the elements being spans, you could use the following:
result = mystring.sub(/.*id=\"Result\">/,'').sub(/<\/.*/,'')

Resources