Is there a ruby gem that does diff between HTML documents? - ruby

Doing a diff of two different html documents turns out to be an entirely different problem than simply doing a diff of plain text. For example, if I do a naive LCS diff between:
Google</p>
and
Google</a></p>
the diff result is NOT:
</a>
but
/a></
I've tried most gems out there that claim to be html diff but all of them seem to be just implementing text based LCS diff. Is there any gem that does a diff while taking html tags into account?

Try Samy diffy or rubygems html-diff

After much searching for a gem to do this for me, I discovered that I can simply do a string compare between two parsed Nokogiri documents:
def should_match_html(html_text1, html_text2)
dom1 = Nokogiri::HTML(html_text1)
dom2 = Nokogiri::HTML(html_text2)
dom1.to_s.should == dom2.to_s
end
Then you can simply add this in your spec:
should_match_html expected_html, actual_html
The best part is that the built-in rspec matcher will automatically provide you a line-by-line diff result of the mismatched lines.

Related

Nokogiri XML Parser with Bad Attribute Values

I can't find any good documentation on the difference between how Nokogiri (or by implication libxml) handles attribute values in XML vs. HTML. One of our projects was still using the now defunct Hpricot gem, mostly because of it's lax acceptance of attributes.
The crux of the problem seems to be that our XML input has both unquoted and missing attribute values. I'm not a spec lawyer, but I gather that most of the HTML variants allow these attribute patterns and XML does not.
If Nokogiri (or libxml) is going to be strict, shouldn't there be an option to make it less strict on attributes? If I could get the HTML parser not to strip the namespaces, I could maybe use that.
We can't be the only team that has XMLish formats that aren't exactly fish or fowl but something in between. If we could fix it at the source we might do that, but in the meantime we have to handle the format as it is.
This is my hack to fix the attributes before sending it to Nokogiri:
ATTR_RE = /[^\s=>]+\s*(?:=(?:[^\s'">]+|\s*"[^"]*"|\s*'[^']*'))?/mo
ELEMENT_RE = /(<\s*[:\w]+)((?:\s+#{ATTR_RE})*)(\s*>)/mo
Nokogiri::XML(
data.gsub(ELEMENT_RE) do |m|
open, close = $1, $3
([open] +
$2.scan(ATTR_RE).map do |atr|
if atr =~ /=[ '"]/
atr
elsif atr =~ /=/
"#{$`.strip}=\"#{$'.strip}\""
else
"#{atr.strip}=\"#{atr.strip}\""
end
end
) * ' ' + close
end
)

Prevent double escaping with CodeRay and RDiscount

I am looking for the most straight forward way to have code syntax highlighting in markdown, using Ruby (without Rails).
I have tried some things with Kramdown and Rouge, and could not make it work, so I am now working with RDiscount and CodeRay.
Most of the things work as I expect, with one small and one big issue:
Small Issue: The only way I found to make CodeRay work with RDiscount, is by applying the highlighting on the HTML rather than on the markdown document. This seems a little off to me and prone to errors. Is there another way?
Big Issue: I am now facing a double HTML escaping issue, and was unable to find any html_escape: false option in the CodeRay documentation.
Code
require 'rdiscount'
require 'coderay'
markdown = <<EOF
```ruby
A > B
```
EOF
def coderay(text)
text.gsub(/\<code class="(.+?)"\>(.+?)\<\/code\>/m) do
CodeRay.scan($2, $1).html
end
end
html = RDiscount.new(markdown).to_html
html = coderay(html)
puts html
Output
Notice the double escaping on the greater than sign:
<pre>
<span class="constant">A</span> &gt; <span class="constant">B</span>
</pre>
I have found this related question, but its old and without a solution for this case.
The only way I can come up with, is to unescape the HTML before passing it through CodeRay, but this does not feel right to me. Nevertheless, a working alternative below:
def coderay(text)
text.gsub(/\<code class="(.+?)"\>(.+?)\<\/code\>/m) do
lang, code = $1, $2
code = CGI.unescapeHTML code
CodeRay.scan(code, lang).html
end
end

Better diffs with minitest

I'm looking at my test case results, and it's far to difficult to see where the one small failure in my test is coming from.
I'm dealing with reasonable sized data structures - and I don't want to change the to_s method so that it's slightly better for the minitest diff.
I've looked at the reporters but they don't seem to have anything like what I'm looking for. (I'm using ruby 1.9.3)
Is there any way that minitest or some library for minitest could highlight the part of the string that is different between two results?
Or is there something I'm missing that allows you to visually look at the diff more easily?
Edit: Example
Minitest::Assertion:
--- expected
+++ actual
## -1 +1 ##
-#<struct MyModule::Swipe id=0, lat=37.62996, lng=-122.42115, route=#<struct MyModule::Route id=17, bus_name="test_name", stops=[#<struct MyModule::Stop id=29, name="Cool Stop">]>, date_time="2015-10-29T11:05:02+00:00">
+#<struct MyModule::Swipe id=0, lat=37.62996, lng=-122.42115, route=#<struct MyModule::Route id=17, bus_name="test_name", stops=[#<struct MyModule::Stop id="29", name="Cool Stop">]>, date_time="2015-10-29T11:05:02+00:00">
Instead could show the line, and highlight in another colour the id="29" vs the id=29 only. Minitest seems to show the diff based on the lines printed.
pretty-diff
I had the same problem, and in case of invisible blank characters, this gem is still not good enough for debugging. I end up adding .inspect to both String that I passed to assert_equal in minitest test case.

Ruby Nokogiri - How to prevent Nokogiri from printing HTML character entities

I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&amp;".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

Resources