How to add comment into an XML file - ruby

How do I add a comment into an XML file using Nokogiri?
For example, I have an existing html file. I want to add <!--doc-->. How should I do it so I get:
...
<body>
<!--doc-->
</body>
...

I'd use:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body></body></html>')
doc.at('body') << '<!-- foo -->'
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><!-- foo --></body></html>
Or you can use a bit longer code:
doc.at('body').add_child('<!-- foo -->')
Which results in the same thing.
It gets a little more interesting/complicated if <body> has more nodes, and you care where the comment goes, but that's still basically locating where you want the comment to be inserted, and then doing one of the above.

I use following code fix:
require 'nokogiri'
d = Nokogiri::HTML(%Q(<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
</body>
</html>
))
d.css('body')[0].add_child(Nokogiri::XML::Comment.new(d, "doc"))
puts d.to_s

Related

How to search within a nodeset and delete a node from that same nodeset

I have the following xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<w:body>
<w:p w14:paraId="56037BEC" w14:textId="1188FA30" w:rsidR="001665B3" w:rsidRDefault="008B4AC6">
<w:r>
<w:t xml:space="preserve">This is the story of a man who </w:t>
</w:r>
<w:ins w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="0">
<w:r w:rsidR="003566BF">
<w:t>went</w:t>
</w:r>
</w:ins>
<w:del w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="1">
<w:r w:rsidDel="003566BF">
<w:delText>goes</w:delText>
</w:r>
</w:del>
...
I use Nokogiri to parse the xml as follows:
zip = Zip::File.open("test.docx")
doc = zip.find_entry("word/document.xml")
file = Nokogiri::XML.parse(doc.get_input_stream)
I have a 'deletions' nodeset that contains all of the w:del elements:
#deletions = file.xpath("//w:del")
I search inside of this nodeset to see if an element exists as follows:
my_node_set = #deletions.search("//w:del[#w:id='1']" && "//w:del/w:r[#w:rsidDel='003566BF']")
If it exists I want to remove it from the deletions nodeset. I do this with the following:
deletions.delete(my_node_set.first)
Which seems to work as no errors are returned and it displays the deleted nodeset in the terminal.
However, when I check my #deletions nodeset it seems the item is still there:
#deletions.search("//w:del[#w:id='1']" && "//w:del/w:r[#w:rsidDel='003566BF']")
I'm just getting my head around Nokogiri so I'm obviously not searching for the element properly inside of my #deletions nodeset and am instead searching the entire document.
How can I search inside of the #deletions nodeset for the element and then delete it from the nodeset?
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo"><p>foo</p></div>
<div id="bar"><p>bar</p></div>
</body>
</html>
EOT
divs contains the div tags, which are a NodeSet:
divs = doc.css('div')
divs.class # => Nokogiri::XML::NodeSet
And contains:
divs.to_html # => "<div id=\"foo\"><p>foo</p></div><div id=\"bar\"><p>bar</p></div>"
You can search a NodeSet using at to find the first match:
divs.at('#foo').to_html # => "<div id=\"foo\"><p>foo</p></div>"
And you can easily remove it:
divs.at('#foo').remove
Which removes it from the document itself:
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >>
# >> <div id="bar"><p>bar</p></div>
# >> </body>
# >> </html>
It doesn't delete it from the NodeSet, but we don't care about that, the NodeSet is just a pointer to the nodes in the document itself used to give a list of what to delete.
If you then want an updated NodeSet after deleting certain nodes, rescan the document and rebuild the NodeSet:
divs = doc.css('div')
divs.to_html # => "<div id=\"bar\"><p>bar</p></div>"
If your goal is to remove all the nodes in the NodeSet, instead of searching through that list you can simply use:
divs.remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >>
# >>
# >> </body>
# >> </html>
When I'm deleting nodes I don't gather an intermediate NodeSet, instead I do it on the fly using something like:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo"><p>foo</p></div>
<div id="bar"><p>bar</p></div>
</body>
</html>
EOT
doc.at('div#bar p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div id="foo"><p>foo</p></div>
# >> <div id="bar"></div>
# >> </body>
# >> </html>
which deletes the embedded <p> tag in #bar. By relaxing the selector and changing from at to search I can remove them en masse:
doc.search('div p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div id="foo"></div>
# >> <div id="bar"></div>
# >> </body>
# >> </html>
If you insist on walking through the NodeSet, remember that they are like arrays, and you can treat them as such. Here's an example of using reject to skip a particular node:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo"><p>foo</p></div>
<div id="bar"><p>bar</p></div>
</body>
</html>
EOT
divs = doc.search('div').reject{ |d| d['id'] == 'foo' }
divs.map(&:to_html) # => ["<div id=\"bar\"><p>bar</p></div>"]
You won't receive a NodeSet though, you'll get an Array:
divs.class # => Array
While you can do that, you're better off using a specific selector to reduce the set rather than rely on Ruby to select or reject elements.

String#gsub messing up the replacement?

I'm trying to replace a part of a page with external content on the fly.
Here is the source.html:
<!DOCTYPE html>
<html>
<head>
<%= foobar %>
</head>
<body>
This is body
</body>
</html>
And a replacement string inject.js:
var REGEXP = /^\'$/i; var foo = 1;
A ruby code that outputs a file by combining both.
pageContent = File.read('./source.html')
jsContent = File.read('./inject.js');
output = pageContent.gsub("<%= foobar %>", jsContent)
File.open('./dest.html', "w+") do |f|
f.write(output)
end
However, I get the messed up dest.html which is happening because of \' in inject.js.
<!DOCTYPE html>
<html>
<head>
var REGEXP = /^
</head>
<body>
This is body
</body>
</html>$/i; var foo = 1;
</head>
<body>
This is body
</body>
</html>
How do I get rid of this issue?
Try using gsub block form:
output = pageContent.gsub("<%= foobar %>") { jsContent }
This one can help you in this case.
Could you please try %q{jsContent} something like that.

Ruby Haml - cuts out tag (&& ||) attributes

Strange behaviour of haml - it cuts out tag attributes.
for example, i write two ways:
first - head inside layout:
!!!
%html{ lang: I18n.locale }
%head{ 'data-hook' => 'inside_head' }
%title= "sample title"
%meta{ content: 'text/html; charset=UTF-8', 'http-equiv' => 'Content-Type' }
it produce next code:
<!DOCTYPE html>
<html lang="ru">
<head data-hook="inside_head">
<title> sample title
</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
if not look on identation, all is fine, but if i write head in partial and render it, haml cuts out head tag, but passing content of partial!
code, my second and preffered way is:
!!!
%html{ lang: I18n.locale }
= render 'shared/head', title: "sample app"
and partial in shared/head.haml:
%head{ 'data-hook' => 'inside_head' }
%title= title
%meta{ content: 'text/html; charset=UTF-8', 'http-equiv' => 'Content-Type' }
but, haml produce next strange code, tag 'head' is missed:
<!DOCTYPE html>
<html lang="ru">
<body>
<title> sample app
</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
What i doing wrong? Or Haml is buggy?
See my layout file:
See my head file:
See result html:
Try to rename _head.haml to _head.html.haml. Works for me.
So the final usage will be the same:
!!!
%html
= render 'shared/head'
And btw haml has better way to pass attributes for haml tags:
%head(data-hook='inside_head')

Can I put HTML into a variable?

Using the Sinatra library, I'm trying to condense two functions that display HTML code into a single function. Both these functions differ by only a small amount of HTML.
Here's an example.
def make_start_page()
<<EOS
<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
<p> Hello </p>
<img src="..." />
</body>
</html>
EOS
end
def make_guess_page()
<<EOS
<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
<p> Something different </p>
<a href="..." >1</a>
</body>
</html>
EOS
end
In the Ruby function that will call these two functions, I was wondering if it is possible to take the small portion of HTML that differs and pass it to a single, condensed version of these two functions that will display the page.
def handle()
if 1
var = "<p> Hello </p>
<img src="..." />"
elsif 2
var = "<p> Something different </p>
<a href="..." >1</a>"
make_start_guess_page(var)
end
You can interpolate variables in heredoc:
def make_start_page(var)
<<EOS
<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
#{var}
</body>
</html>
EOS
end
For example.
There no reason why you could not do that. However if you want to print it, you'll probably have to use functions like String#html_safe in rails, or != in haml

Nokogiri -- preserve doctype and meta tags

I'm using nokogiri to open an existing html file that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Foo</title>
</head>
<body>
<!-- stuff -->
</body>
</html>
Then I change the contents of the body tag like this:
html_file = Nokogiri::HTML("path/to/html/file")
html_file.css('body').first.inner_html = "new body content"
Then I write this new document to a file like this:
File.open("path/to/new/html/file", 'w') {|f| f.write html_file}
And this is my resulting html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
new body content
</body></html>
My question for you guys if it's possible to tell Nokogiri to preserve the original html file's doctype and meta tags, since it appears like they are being lost/changed when I open the document with Nokogiri and attempt to write it to a file.
Any help would be much appreciated. Thanks!
Finally figured it out:
I just changed the line:
html_file = Nokogiri::HTML("path/to/html/file")
to
html_file = Nokogiri::HTML(File.open("path/to/html/file").read)
and now it works like I'm expecting it to. Seems kind of inconsistent, but I'm sure there's a good reason for it.
Thanks for all of the suggestions #ezkl!

Resources