Truncate String from html string - ruby

I need to truncate some data received from a URI:PARSE...it is full of html codes and data, The result at the end is what I need.
Here's the string (abbreviated) ' junk"Result">Q8:0;junk
What's is the best way to truncate the extra stuff in the string so that I can split the data I need into variables.
Thanks in advance,
Philip
pabbott#cpak.com

i would recommend to use Nokogiri to extract your value from Result span:
require 'nokogiri'
response = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="w3.org/1999/xhtml"><head><title>;
</title></head><body>
<form name="form1" method="post" action="tenHSServer.aspx?t=34&f=DeviceValue&d=R10" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTkzNDcxNzcwM2RkM4AHUDZdWZytDdspzLq7+FOXRfQ=" />
</div>
<span id="Result">R10:100;</span>
</form></body>
</html>'
result = nil
if doc = Nokogiri::HTML(response) rescue nil
if span = doc.css('#Result')
result = span.text
end
end
puts result
#=> R10:100;
however if you can not / do not want to install Nokogiri, use this regexp instead:
result = response.scan(/id=["|']Result["|']>([^<]*)<\//m).flatten.first
puts result
#=> R10:100;

Remove everything up to and including <span id=\"Result\"> with the first call to sub()
Then remove everything after and including </span> from what's left with the second call to sub()
Assume you store your html in the variable mystring
result = mystring.sub(/.*<span id=\"Result\">/,'').sub(/<\/span>.*/,'')
If you can't always rely on the elements being spans, you could use the following:
result = mystring.sub(/.*id=\"Result\">/,'').sub(/<\/.*/,'')

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How to replace outer tags using Nokogiri

Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end
Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.

Nokogiri lose my attribute's value named 'multiple'

Here's the code:
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='false'></div></html>")
puts doc.errors
doc.css("input").each do |el|
puts el.attributes['multiple']
end
puts doc.to_html
And here's the output:
false
<!DOCTYPE html>
<html><body>
<input multiple id="test" some="2"><div multiple></div>
</body></html>
[Finished in 2.0s]
Where are the two ='false' ??
EDIT
PLus, is there a way to turn down the default correction?? (use to_xhtml can keep the ='false' but and CDATA into script tag)
In my option, to_xhtml seems working more strictly, why to_xhtml keep the multiple='false' instead??
EDIT2
Here's my temporary workaround: gsub(/multiple=/, 'blahhhhh') before parsing and gsub(/blahhhhh/, 'multiple=') back after parsing
Replace to_html with to_xhtml and you will get multiple attributes values back again.
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html><html><input multiple='false' id='test' some='2'/><div multiple='true'></div></html>")
puts doc.to_xhtml
will output
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<input multiple="false" id="test" some="2" />
<div multiple="true"></div>
</body>
</html>
Update This happens because in HTML the multiple attribute (and other attributes such disabled or selected) doesn't require to have a value so Nokogiri strips it to clean up the output code.
Update 2
why to_xhtml keep the multiple='false' instead?
Because XHTML don't let to omit the value of the attributes, so Nokogiri keeps them.
The best thing you can do, I think, is to feed Nokogiri with proper HTML code in the first place, i.e. omit the multiple attribute entirely instead of write multiple="false".

Parsing webpage with some html tags using Nokogiri

For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.

How do I select IDs using xpath in Nokogiri?

Using this code:
doc = Nokogiri::HTML(open("text.html"))
doc.xpath("//span[#id='startsWith_']").remove
I would like to select every span#id starting with 'startsWith_' and remove it. I tried searching, but failed.
Here's an example:
require 'nokogiri'
html = '
<html>
<body>
<span id="doesnt_start_with">foo</span>
<span id="startsWith_bar">bar</span>
</body>
</html>'
doc = Nokogiri::HTML(html)
p doc.search('//span[starts-with(#id, "startsWith_")]').to_xml
That's how to select them.
doc.search('//span[starts-with(#id, "startsWith_")]').each do |n|
n.remove
end
That's how to remove them.
p doc.to_xml
# >> "<span id=\"startsWith_bar\">bar</span>"
# >> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n <span id=\"doesnt_start_with\">foo</span>\n \n</body></html>\n"
The page "XPath, XQuery, and XSLT Functions" has a list of the available functions.
Try this xpath expression:
//span[starts-with(#id, 'startsWith_')]

Resources