Parse HTML string into array - ruby

I'm developing a wiki-like difference functionality for bodies of HTML produced by TinyMCE. diff-lcs is a difference gem that accepts arrays or objects. Most difference tasks are on code and just compare lines. A difference on bodies of HTML ridden text is more complex. If I just plug in the bodies of text, I get a character by character comparison. Although the output would be correct, it would look like garbage.
seq1 = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"
seq2 = seq1.gsub(/[.!?]/, '\0|').split('|')
=> ["<p>Here is a paragraph.", " A sentence with <strong>bold text</strong>.", "</p><p>The second paragraph.", "</p>"]
If someone changes the second paragraph, the difference output involves the previous paragraphs end tag. I can't just use strip_tags because I'd like to keep formatting on the compare view. The ideal comparison is one based on complete sentences, with HTML separated out.
seq2.NokogiriMagic
=> ["<p>", "Here is a paragraph.", " A sentence with ", "<strong>", "bold text", "</strong>", ".", "</p>", "<p>", "The second paragraph.", "</p>"]
I found plenty of neat Nokogiri methods but nothing I've found does the above.

Here's how you could do it with a SAX parser:
require 'nokogiri'
html = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"
class ArraySplitParser < Nokogiri::XML::SAX::Document
attr_reader :array
def initialize; #array = []; end
def start_element(name, attrs=[])
tag = "<" + name
attrs.each { |k,v| tag += " #{k}=\"#{v}\"" }
#array << tag + ">"
end
def end_element(name); #array << "</#{name}>"; end
def characters(str); #array += str.gsub(/\s/, '\0|').split('|'); end
end
parser = ArraySplitParser.new
Nokogiri::XML::SAX::Parser.new(parser).parse(html)
puts parser.array.inspect
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>"]
Note that you'll have to wrap your HTML in a root element so that the XML parser doesn't miss the second paragraph in your example. Something like this should work:
# ...
Nokogiri::XML::SAX::Parser.new(parser).parse('<x>' + html + '</x>')
# ...
puts parser.array[1..-2]
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>", "<p>", "The ", "second ", "paragraph.", "</p>"]
[Edit] Updated to demonstrate how to retain element attributes in the "start_element" method.

You're not writing your code in idiomatic Ruby. We don't use mixed upper/lower case in variable names, also, in programming in general, it's a good idea to use mnemonic variable names for clarity. Refactoring your code to be more how I'd write it:
tags = %w[p ol ul li h6 h5 h4 h3 h2 h1 em strong i b table thead tbody th tr td]
# Deconstruct HTML body 1
doc = Nokogiri::HTML.fragment(#versionOne.body)
nodes = doc.css(tags.join(', '))
# Reconstruct HTML body 1 into comparable array
output = []
nodes.each do |node|
output << [
"<#{ node.name }",
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
'>'
].join
output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten
output << "</#{ node.name }>"
end
# Same deal for nokoOutput2
sdiff = Diff::LCS.sdiff(nokoOutput2.flatten, output.flatten)
The line:
tag | " #{ param.name }=\"#{ param.value }\" "
in your code isn't Ruby at all because String doesn't have a | operator. Did you add the | operator to your code and not show that definition?
A problem I see is:
output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten
Many of the tags you are looking for can contain other tags in your list:
<html>
<body>
<table><tr><td>
<table><tr><td>
foo
</td></tr></table>
</td></tr></table>
</body>
</html>
Creating a recursive method that handles:
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
would probably improve your output. This is untested but is the general idea:
def dump_node(node)
output = [
"<#{ node.name }",
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
'>'
].join
output += node.children.map{ |n| dump_node(n) }
output << "</#{ node.name }>"
end

Related

ruby multiline scan between ; and negate?

I'm trying to match text between ;-.
I used:
inputx.scan(/;-.+?\n[^\n]*;-/)
but it doesn't work.
My text is:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
I need to separate the text between ;-.
For example, this is the first element of the resulting array:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
And this is second:
fly;-1
cat;4
bird;4
dragon;6
mor;-1
You may use a regex that will match any line that ends with - and 1 or more digits, and then matches any text up to the first line that ends with - and 1 or more digits:
/.*-\d+$(?m:.*?-\d+$)/
See the Rubular demo
Details:
.*-\d+$ - any 0+ chars other than line breaks, followed with - and 1+ digits
(?m:.*?-\d+$) - a modifier group where . matches line breaks matching:
.*? - any 0+ chars, as few as possible
- - a hyphen
\d+ - 1 or more digits
$ - end of line.
You can use Array#split twice, the first to split by lines, and the second to split based on the presence of either ; or ;- (using the pattern /;-?/)
The pattern /;-?/ matches a semicolon followed by an optional -.
inputx.split("\n").map{|s| s.split(/;-?/)}
#=> [[" baseball", "1"], [" norm", "4"], [" dad", "3"], [" soda", "1"], [" robot", "8"], [" mmm", "3"], [" fly", "1"], [" cat", "4"], [" bird", "4"], [" dragon", "6"], [" mor", "1"]]
A pattern with scan or split results in a regex that is needlessly complicated because it's not the best tool in the box for the problem.
I'd use something like this:
text = <<EOT
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
EOT
ary = [[]]
text.lines.each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1\n",
# " norm;4\n",
# " dad;3\n",
# " soda;1\n",
# " robot;-8\n"],
# [" fly;-1\n",
# " cat;4\n",
# " bird;4\n",
# " dragon;6\n",
# " mor;-1\n"]]
If you don't want trailing new-lines:
ary = [[]]
text.lines.map(&:chomp).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1", " norm;4", " dad;3", " soda;1", " robot;-8"],
# [" fly;-1", " cat;4", " bird;4", " dragon;6", " mor;-1"]]
If you don't want the whitespace surrounding each element:
ary = [[]]
text.lines.map(&:strip).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [["baseball;-1", "norm;4", "dad;3", "soda;1", "robot;-8"],
# ["fly;-1", "cat;4", "bird;4", "dragon;6", "mor;-1"]]
How does this work? The .. and ... operator changes meaning depending on whether it's used in the context of a Range, or in an if condition. .. is called a "flip-flop" operator, which changes state when the first condition is met. It will begin returning true at that point, and will continue to do so until the second condition is met, at which point it begins returning false again. That makes it easy to look for something, then begin acting on subsequent lines until the second condition occurs.
Normally we'd use different conditions, such as searching for "begin" and "end" in a block of lines in a file. In this case though, we needed it to not immediately toggle since both the start and end condition were the same, which is where ... comes it. It waits one loop before testing for the second condition, allowing this code to continue, find the next lines until the "closing" ';-'.
I have to say, this data set is one of the weirdest I've ever seen. (The weirdest was some binary data for the address book out of an old email program years ago). I'd be concerned about the process that's generating it, and if that generation was under my control I'd change it to use something more standard.
We can use Enumerable#chunk and Ruby's flip-flop operator. This does not require the use of a regular expression. str is the string given by the OP.
arr = str.lines.chunk do |line|
true if line.include?('-') ... line.include?('-')
end.select(&:first).map { |_,a| a.join }
#=> ["baseball;-1\nnorm;4\ndad;3\nsoda;1\nrobot;-8\n",
# "fly;-1\ncat;4\nbird;4\ndragon;6\nmor;-1\n"]
arr.each { |s| puts "\n"; puts s }
baseball;-1
norm;4
dad;3
soda;1
robot;-8
fly;-1
cat;4
bird;4
dragon;6
mor;-1
It is necessary to use three (not two) dots in the flip-flop expression (search for "three dot" in the reference given above).

Basic Pig Latin | Troubles with .each

I'm solving a basic training exercise and I got stuck. I have to Move the first letter of each word to the end of it, then add 'ay' to the end of the word.. I've been googling and came up with this code:
def pig_it translate_pig_latin
move_letters = text.split(' ')
.each do {|x| x[1..-1] << x.[0] << 'ay' }
move_letters.join(' ')
end
But for some reason it gives me this error
-e:4: syntax error, unexpected '|', expecting '}'
.each do {|x| x[1..-1] << x.[0] << 'ay' }
I know it's a problem with the .each method, but after reading the documentation and googling around I can't figure out what's wrong with it.
def translate_pig_latin(text)
move_letters = text.split(' ')
.each { |x| return x[1..-1] << x[0] << 'ay' }
move_letters.join(' ')
end
Some notes -
As another user stated, don't mix and match do/end and {}
Also, when using bracket notation to retrieve an item from an array, don't use a . like you have in x.[0]
Your .each block is doing the correct thing(when you adhere to the above note) but isn't returning the result (by which I am confused). If you add an explicit return then your code works as above
A more drawn out method if this helps you understand what's happening better
def translate_pig_latin(text)
# create array to contain piglatinified phrase
new_phrase = []
# each word of the original phrase do
text.split(' ').each do |x|
# grab the characters after the first character
new_word = x[1..-1]
# add the first character plus 'ay' to the end of the string
new_word << x[0] + 'ay'
# add the newly piglatinified string to the phrase
new_phrase << new_word
end
# turn the phrase into a space separated string
new_phrase.join(' ')
end
Use either do...end or {...}. Don't mix them with do { as you did. That line should look like this:
.each { |x| x[1..-1] << x[0] << 'ay' }
or
.each do |x| x[1..-1] << x[0] << 'ay' end
From a style perspective, most Rubyists prefer to use {...} for single-line blocks and reserve do...end for blocks that span multiple lines of code.

Ruby regex to get text blocks including delimiters

When using scan in Ruby, we are searching for a block within a text file.
Sample file:
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
We want the following result in an array:
["begin\nsometext\nend","begin\nsometext2\nend"]
With this scan method:
textfile.scan(/begin\s.(.*?)end/m)
we get:
["sometext","sometext2"]
We want the begin and end still in the output, not cut off.
Any suggestions?
You may remove the capturing group completely:
textfile.scan(/begin\s.*?end/m)
See the IDEONE demo
The String#scan method returns captured values only if you have capturing groups defined inside the pattern, thus a non-capturing one should fix the issue.
UPDATE
If the lines inside the blocks must be trimmed from leading/trailing whitespace, you can just use a gsub against each matched block of text to remove all the horizontal whitespace (with the help of \p{Zs} Unicode category/property class):
.scan(/begin\s.*?end/m).map { |s| s.gsub(/^\p{Zs}+|\p{Zs}+$/, "") }
Here, each match is passed to a block where /^\p{Zs}+|\p{Zs}+$/ matches either the start of a line with 1+ horizontal whitespace(s) (see ^\p{Zs}+), or 1+ horizontal whitespace(s) at the end of the line (see \p{Zs}+$).
See another IDEONE demo
Here's another approach, using Ruby's flip-flop operator. I cannot say I would recommend this approach, but Rubiests should understand how the flip-flop operator works.
First let's create a file.
str =<<_
some
text
at beginning
begin
some
text
1
end
some text
between
begin
some
text
2
end
some text at end
_
#=> "some\ntext\nat beginning\nbegin\n some\n text\n 1\nend\n...at end\n"
FName = "text"
File.write(FName, str)
Now read the file line-by-line into the array lines:
lines = File.readlines(FName)
#=> ["some\n", "text\n", "at beginning\n", "begin\n", " some\n", " text\n",
# " 1\n", "end\n", "some text\n", "between\n", "begin\n", " some\n",
# " text\n", " 2\n", "end\n", "some text at end\n"]
We can obtain the desired result as follows.
lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.
map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
The two steps are as follows.
First, select and group the lines of interest, using Enumerable#chunk with the flip-flop operator.
a = lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }
#=> #<Enumerator: #<Enumerator::Generator:0x007ff62b981510>:each>
We can see the objects that will be generated by this enumerator by converting it to an array.
a.to_a
#=> [[true, ["begin\n", " some\n", " text\n", " 1\n", "end\n"]],
# [true, ["begin\n", " some\n", " text\n", " 2\n", "end\n"]]]
Note that the flip-flop operator is distinguished from a range definition by making it part of a logical expression. For that reason we cannot write
lines.chunk { |line| line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.to_a
#=> ArgumentError: bad value for range
The second step is the following:
b = a.map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
Ruby has some great methods in Enumerable. slice_before and slice_after can help with this sort of problem:
string = <<EOT
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
EOT
ary = string.split # => ["sometextbefore", "begin", "sometext", "end", "sometextafter", "begin", "sometext2", "end", "sometextafter2"]
.slice_after(/^end/) # => #<Enumerator: #<Enumerator::Generator:0x007fb1e20b42a8>:each>
.map{ |a| a.shift; a } # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"], []]
ary.pop # => []
ary # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"]]
If you want the resulting sub-arrays joined then that's an easy step:
ary.map{ |a| a.join("\n") } # => ["begin\nsometext\nend", "begin\nsometext2\nend"]

Convert array of Ruby hashes to JSON (pretty) not using stdlib?

As per the question, just wondering how to do this without the use of the Ruby stdlib 'JSON' module (and thus the JSON.pretty_generate method).
So I have an array of hashes that looks like:
[{"h1"=>"a", "h2"=>"b", "h3"=>"c"}, {"h1"=>"d", "h2"=>"e", "h3"=>"f"}]
and I'd like to convert it so that it looks like the following:
[
{
"h1": "a",
"h2": "b",
"h3": "c",
},
{
"h1": "d",
"h2": "e",
"h3": "f",
}
]
I can get the hash-rockets replaced with colon spaces using a simple gsub (array_of_hashes.to_s.gsub!(/=>/, ": ")), but not sure about how to generate it so that it looks like the above example. I had originally thought of doing this use a here-doc approach, but not sure this is the best way, plus i havn't managed to get it working yet either. I'm new to Ruby so apologies if this is obvious! :-)
def to_json_pretty
json_pretty = <<-EOM
[
{
"#{array_of_hashes.each { |hash| puts hash } }"
},
]
EOM
json_pretty
end
In general, working with JSON well without using a library is going to take more than just a few lines of code. That being said, the best way of JSON-ifying things is generally to do it recursively, for example:
def pretty_json(obj)
case obj
when Array
contents = obj.map {|x| pretty_json(x).gsub(/^/, " ") }.join(",\n")
"[\n#{contents}\n]"
when Hash
contents = obj.map {|k, v| "#{pretty_json(k.to_s)}: #{pretty_json(v)}".gsub(/^/, " ") }.join(",\n")
"{\n#{contents}\n}"
else
obj.inspect
end
end
This should work well if you input is exactly in the format you presented and not nested:
a = [{"h1"=>"a", "h2"=>"b", "h3"=>"c"}, {"h1"=>"d", "h2"=>"e", "h3"=>"f"}]
hstart = 0
astart = 0
a.each do |b|
puts "[" if astart == 0
astart+=1
b.each do |key, value|
puts " {" if hstart == 0
hstart += 1
puts " " + key.to_s + ' : ' + value
if hstart % 2 == 0
if hstart == a.collect(&:size).reduce(:+)
puts " }"
else
puts " },\n {"
end
end
end
puts "]" if astart == a.size
end
Output:
[
{
h1 : a
h2 : b
},
{
h3 : c
h1 : d
},
{
h2 : e
h3 : f
}
]
You can take a look at my NeatJSON gem for how I did it. Specifically, look at neatjson.rb, which uses a recursive solution (via a proc).
My code has a lot of variation based on what formatting options you supply, so it obviously does not have to be as complex as this. But the general pattern is to test the type of object supplied to your method/proc, serialize it if it's simple, or (if it's an Array or Hash) re-call the method/proc for each value inside.
Here's a far-simplified version (no indentation, no line wrapping, hard-coded spacing):
def simple_json(object)
js = ->(o) do
case o
when String then o.inspect
when Symbol then o.to_s.inspect
when Numeric then o.to_s
when TrueClass,FalseClass then o.to_s
when NilClass then "null"
when Array then "[ #{o.map{ |v| js[v] }.join ', '} ]"
when Hash then "{ #{o.map{ |k,v| [js[k],js[v]].join ":"}.join ', '} }"
else
raise "I don't know how to deal with #{o.inspect}"
end
end
js[object]
end
puts simple_json({a:1,b:[2,3,4],c:3})
#=> { "a":1, "b":[ 2, 3, 4 ], "c":3 }

Better way to parse "Description (tag)" to "Description, tag"

I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses
Chemicals (chem)
Electrical (elec)
I need to convert these lines to comma separated values like so:
Chemicals, chem
Electrical, elec
What I am using is this:
lines = line.gsub!('(', ',').gsub!(')', '').split(',')
I would like to know if there is a better way to do this.
for posterity, this is the full code (based on the answers)
require 'rubygems'
require 'csv'
csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.readlines.each do |line|
(desc, cat) = line.split('(')
desc.strip!
cat.strip!
csvfile << [desc, cat[0,cat.length-1]]
end
end
Try something like this:
line.sub!(/ \((\w+)\)$/, ', \1')
The \1 will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem) with , chem.
Let's create an example using a text file:
lines = []
File.open('categories.txt', 'r') do |file|
while line = file.gets
lines << line.sub(/ \((\w+)\)$/, ', \1')
end
end
Based on the question updates I can propose this:
require 'csv'
csv_file = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end
csv_file.close
Starting with Ruby 1.9, you can do it in one method call:
str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
')' => ''}
str.gsub(/ \(|\)/, mapping) #=> "Chemicals, chem\n"
In Ruby, a cleaner, more efficient, way to do it would be:
description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
# the all characters up to the first space and all characters after. We can then use
# multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string
This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.
No need to manipulate the string. Just grab the data and output it to the CSV file.
Assuming that you have something like this in the data:
Chemicals (chem)
Electrical (elec)
Dyes & Intermediates (dyes)
This should work:
File.open('categories.txt', 'r') do |file|
file.each_line do |line|
csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
end
end
Benchmarks relevant to discussion in #hundredwatt's answer:
require 'benchmark'
line = "Chemicals (chem)"
# #hundredwatt
puts Benchmark.measure {
100000.times do
description, tag = line.split(' ', 2)
tag = tag[1, (tag.length - 1) - 1]
new_line = description << ", " << tag
end
} # => 0.18
# NeX
puts Benchmark.measure {
100000.times do
line.sub!(/ \((\w+)\)$/, ', \1')
end
} # => 0.08
# steenslag
mapping = { ' (' => ', ',
')' => ''}
puts Benchmark.measure {
100000.times do
line.gsub(/ \(|\)/, mapping)
end
} # => 0.08
know nothing about ruby, but it is easy in php
preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);
$result = $m[1].','.$m[2];

Resources