How can I read a string into a Ruby dictionary? - ruby

I currently want to replace my Wordpress-Blog by a Jekyll-Blog. To do so, I have to find an alternative to WordPress caption tags:
[caption id="attachment_76716" align="aligncenter" width="500"]<img src="http://martin-thoma.com/wp-content/uploads/2013/11/WER-calculation.png" alt="WER calculation" width="500" height="494" class="size-full wp-image-76716" /> WER calculation[/caption]
I thought it would be nice, if I could use them like that in my posts:
{% caption align="aligncenter" width="500" alt="WER calculation" text="WER calculation" url="../images/2013/11/WER-calculation.png" %}
While it should get rendered to:
<div style="width: 510px" class="wp-caption aligncenter">
<a href="../images/2013/11/WER-calculation.png">
<img src="../images/2013/11/WER-calculation.png" alt="WER calculation" width="500" height="494" class="size-full">
</a>
<p class="wp-caption-text">WER calculation</p>
</div>
So I've written some python code that does the replacement (once) and I wanted to write a Ruby / Liquid / Jekyll plugin that does the rendering. But I don't know how to read
align="aligncenter" width="500" alt="WER calculation" text="WER calculation" url="../images/2013/11/WER-calculation.png"
into a ruby dictionary (they seem to be called "Hash"?).
Here is my plugin:
# Title: Caption tag
# Author: Martin Thoma, http://martin-thoma.com
module Jekyll
class CaptionTag < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
#tokens = tokens
end
def render(context)
#hash = Hash.new
#array = #text.split(" ")
#array.each do |element|
key, value = element.split("=")
#hash[key] = value
end
#"#{#text} #{#tokens}"
"<div style=\"width: #{#hash['width']}px\" class=\"#{#hash['alignment']}\">" +
"<a href=\"../images/#{#hash['url']}\">" +
"<img src=\"../images/#{#hash['url']}\" alt=\"#{#hash['text']}\" width=\"#{#hash['width']}\" height=\"#{#hash['height']}\" class=\"#{#hash['class']}\">" +
"</a>" +
"<p class=\"wp-caption-text\">#{#hash['text']}</p>" +
"</div>"
end
end
end
Liquid::Template.register_tag('caption', Jekyll::CaptionTag)
In Python, I would use the CSV module and set delimiter to space and quotechar to ". But I'm new to Ruby.
I've just seen that Ruby also has a CSV-module. But it doesn't work, as the quoting isn't correct. So I need some html-parsing.
A Python solution
def parse(text):
splitpoints = []
# parse
isOpen = False
for i, char in enumerate(text):
if char == '"':
isOpen = not isOpen
if char == " " and not isOpen:
splitpoints.append(i)
# build data structure
dictionary = {}
last = 0
for i in splitpoints:
key, value = text[last:i].split('=')
last = i+1
dictionary[key] = value[1:-1] # remove delimiter
return dictionary
print(parse('align="aligncenter" width="500" alt="WER calculation" text="WER calculation" url="../images/2013/11/WER-calculation.png"'))

If you set row separator to space, column separator to '=' and quote char to '"', you can easily parse your string into Hash with Ruby's CSV class:
require 'csv'
def parse_attrs(input)
options = { col_sep: '=', row_sep: ' ', quote_char: '"' }
csv = CSV.new input, options
csv.each_with_object({}) do |row, attrs|
attr, value = row
value ||= true
attrs[attr] = value
end
end
Example:
irb(main):031:0> input = 'align="aligncenter" width="500" alt="WER calculation" text="WER calculation" url="../images/2013/11/WER-calculation.png" required'
=> "align=\"aligncenter\" width=\"500\" alt=\"WER calculation\" text=\"WER calculation\" url=\"../images/2013/11/WER-calculation.png\""
irb(main):032:0> parse_attrs input
=> {"align"=>"aligncenter", "width"=>"500", "alt"=>"WER calculation", "text"=>"WER calculation", "url"=>"../images/2013/11/WER-calculation.png"}

Related

Nokogiri-- To remove unwanted spaces between tags having no text

I have a HTML content as--
html = "<table id=\"soa_table\" class=\"table table-striped table-bordered table-condensed soa-table\"><thead><tr><th>SoA</th><th id=\"423\" class=\"soa-column text-center\">V1</th><th id=\"424\" class=\"soa-column text-center\">V2</th></tr></thead><tbody><tr><td class=\"soa-row\" id=\"631\">Label 1</td><td class=\"soa-element text-center\" form_id=\"631\" visit_id=\"423\" id=\"484\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"631\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"632\">Label 2</td><td class=\"soa-element text-center\" form_id=\"632\" visit_id=\"423\" id=\"485\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"632\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"633\">Label 3</td><td class=\"soa-element\" form_id=\"633\" visit_id=\"423\" id=\"0\"> </td><td class=\"soa-element text-center\" form_id=\"633\" visit_id=\"424\" id=\"486\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td></tr></tbody></table>"
Now I parsed it via Nokogiri and tried to gsub the spaces as---
Nokogiri::HTML(html).at('table').to_html.gsub(/>\s+</, "><")
But it doesn't works
remove unwanted spaces between tags having no text
I asume you mean this kind of space:
<td class="soa-element" form_id="631" visit_id="424" id="0"> </td>
^
That's a text node containing a single space.
Let's use a smaller example:
html = '<foo>value</foo><bar> </bar>'
doc = Nokogiri::HTML.fragment(html)
You can use PP to inspect the parsed document structure:
require 'pp'
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar", children = [ #(Text " ")] })]
})
The document contains two text nodes, one with "value" the other one with " ".
In order to remove the latter, we can traverse the document and remove all text nodes containing just whitespace:
doc.traverse { |node| node.remove if node.text? && node.text !~ /\S/ }
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar" })]
})
Finally, we can serialize the document:
doc.to_html
#=> "<foo>value</foo><bar></bar>"
gsub does not substitute into the source object. gsub! does. Also, you don't need Nokogiri at all.
require 'nokogiri'
puts 'Needlessly using Nokogiri'
html = "<p> </p>"
new_html = Nokogiri::HTML(html).at('p').to_html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #1'
html = "<p> </p>"
new_html = html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #2'
html = "<p> </p>"
puts html
html.gsub!(/>\s+</,'><')
puts html
The output of this program is:
Needlessly using Nokogiri
<p> </p>
<p></p>
--------------------
Solution #1
<p> </p>
<p></p>
--------------------
Solution #2
<p> </p>
<p></p>
Remove whitespace-only text nodes:
doc.search('//text()[normalize-space()=""]').remove
Update with example:
Nokogiri::HTML('<b></b> <b></b>').search('//text()[normalize-space()=""]').remove
#=> [#<Nokogiri::XML::Text:0x197ad78 " ">]

Parse HTML string into array

I'm developing a wiki-like difference functionality for bodies of HTML produced by TinyMCE. diff-lcs is a difference gem that accepts arrays or objects. Most difference tasks are on code and just compare lines. A difference on bodies of HTML ridden text is more complex. If I just plug in the bodies of text, I get a character by character comparison. Although the output would be correct, it would look like garbage.
seq1 = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"
seq2 = seq1.gsub(/[.!?]/, '\0|').split('|')
=> ["<p>Here is a paragraph.", " A sentence with <strong>bold text</strong>.", "</p><p>The second paragraph.", "</p>"]
If someone changes the second paragraph, the difference output involves the previous paragraphs end tag. I can't just use strip_tags because I'd like to keep formatting on the compare view. The ideal comparison is one based on complete sentences, with HTML separated out.
seq2.NokogiriMagic
=> ["<p>", "Here is a paragraph.", " A sentence with ", "<strong>", "bold text", "</strong>", ".", "</p>", "<p>", "The second paragraph.", "</p>"]
I found plenty of neat Nokogiri methods but nothing I've found does the above.
Here's how you could do it with a SAX parser:
require 'nokogiri'
html = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"
class ArraySplitParser < Nokogiri::XML::SAX::Document
attr_reader :array
def initialize; #array = []; end
def start_element(name, attrs=[])
tag = "<" + name
attrs.each { |k,v| tag += " #{k}=\"#{v}\"" }
#array << tag + ">"
end
def end_element(name); #array << "</#{name}>"; end
def characters(str); #array += str.gsub(/\s/, '\0|').split('|'); end
end
parser = ArraySplitParser.new
Nokogiri::XML::SAX::Parser.new(parser).parse(html)
puts parser.array.inspect
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>"]
Note that you'll have to wrap your HTML in a root element so that the XML parser doesn't miss the second paragraph in your example. Something like this should work:
# ...
Nokogiri::XML::SAX::Parser.new(parser).parse('<x>' + html + '</x>')
# ...
puts parser.array[1..-2]
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>", "<p>", "The ", "second ", "paragraph.", "</p>"]
[Edit] Updated to demonstrate how to retain element attributes in the "start_element" method.
You're not writing your code in idiomatic Ruby. We don't use mixed upper/lower case in variable names, also, in programming in general, it's a good idea to use mnemonic variable names for clarity. Refactoring your code to be more how I'd write it:
tags = %w[p ol ul li h6 h5 h4 h3 h2 h1 em strong i b table thead tbody th tr td]
# Deconstruct HTML body 1
doc = Nokogiri::HTML.fragment(#versionOne.body)
nodes = doc.css(tags.join(', '))
# Reconstruct HTML body 1 into comparable array
output = []
nodes.each do |node|
output << [
"<#{ node.name }",
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
'>'
].join
output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten
output << "</#{ node.name }>"
end
# Same deal for nokoOutput2
sdiff = Diff::LCS.sdiff(nokoOutput2.flatten, output.flatten)
The line:
tag | " #{ param.name }=\"#{ param.value }\" "
in your code isn't Ruby at all because String doesn't have a | operator. Did you add the | operator to your code and not show that definition?
A problem I see is:
output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten
Many of the tags you are looking for can contain other tags in your list:
<html>
<body>
<table><tr><td>
<table><tr><td>
foo
</td></tr></table>
</td></tr></table>
</body>
</html>
Creating a recursive method that handles:
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
would probably improve your output. This is untested but is the general idea:
def dump_node(node)
output = [
"<#{ node.name }",
node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
'>'
].join
output += node.children.map{ |n| dump_node(n) }
output << "</#{ node.name }>"
end

How do I check property of next item in each/do loop in Ruby?

Using RoR, I would like a helper to write a table of contents menu where root sections are dropdown menus for their subsections. In an each/do loop I would need to check if a section has subsections before outputting class="dropdown" on li and class="dropdown-toggle" data-toggle="dropdown" on the link.
Is there a way to check the properties of the next item (if any) in an each/do loop? Or do I need to switch to a loop with an index?
Here's my table of contents helper as is.
def showToc(standard)
html = ''
fetch_all_sections(standard).each do |section|
html << "<li>" << link_to("<i class=\"icon-chevron-right\"></i>".html_safe + raw(section[:sortlabel]) + " " + raw(section[:title]), '#s' + section[:id].to_s) << "</li>"
end
end
return html.html_safe
end
You can use the abstraction Enumerable#each_cons. An example:
>> xs = [:a, :b, :c]
>> (xs + [nil]).each_cons(2) { |x, xnext| p [x, xnext] }
[:a, :b]
[:b, :c]
[:c, nil]
That said, note your code is full of unidiomatic Ruby, you should probably post it to https://codereview.stackexchange.com/ for review.
If i'm reading your question correctly -- lets say fetch_all_sections(standard) returns an enumerable, such as Array, you could add a custom iterator to get what you want:
class Array
#yields |current, next|
def each_and_next
#index ||= 0
yield [self[#index], self[#index +=1]] until (#index == self.size)
#index = 0
end
end
p.s. I like #tokland's inline answer
a = [1,2,3,4]
a.each_and_next { |x,y| puts "#{x},#{y}" }
produces:
1,2
2,3
3,4
4,
I found a way to have class="dropdown" on the <li> and class="dropdown-toggle" data-toggle="dropdown" on the link not affect the anchor tag. Therefore, in this case, I can just check if section depth is 0 and act accordingly. The other answers are probably more relevant to most people but here's what worked for me.
def showToc(standard, page_type, section = nil, nav2section = false, title = nil, wtf=nil)
html = ''
new_root = true
fetch_all_sections(standard).each do |section|
if section[:depth] == 0
if !new_root
# end subsection ul and root section li
html << "</li>\n</ul>"
new_root = true
end
html << "<li class=\"dropdown\">" << link_to("<i class=\"icon-chevron-right\"></i>".html_safe + raw(section[:sortlabel]) + " " + raw(section[:title]), '#s' + section[:id].to_s, :class => "dropdown-toggle", :data => {:toggle=>"dropdown"})
else
# write ul if new root
if new_root
new_root = false
html << "<ul class=\"dropdown-menu\">\n" << "<li>" << link_to(raw(section[:sortlabel]) + " " + raw(section[:title]), '#s' + section[:id].to_s) << "</li>"
else
html << "<li>" << link_to(raw(section[:sortlabel]) + " " + raw(section[:title]), '#s' + section[:id].to_s) << "</li>"
end
end
end
return html.html_safe
end

Nokogiri replace inner text with <span>ed words

Here's an example HTML fragment:
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
I need to surround each word with a <span id="w0">Thus </span> like this:
<span id='w1'>Anon,</span> <span id='w2'>to</span> <span id='w3'>sudden</span>
<span id='w4'>silence</span> <span id='w5'>won,</span> ....
I written this which creates the new fragment. How do I replace/swap the new for old?
def callchildren(n)
n.children.each do |n| # call recursively until arrive at a node w/o children
callchildren(n)
end
if n.node_type == 3 && n.to_s.strip.empty? != true
new_node = ""
n.to_s.split.each { |w|
new_node = new_node + "<span id='w#{$word_number}'>#{w}</span> "
$word_number += 1
}
# puts new_node
# HELP? How do I get new_node swapped in?
end
end
My attempt to provide a solution for your problem:
require 'nokogiri'
Inf = 1.0/0.0
def number_words(node, counter = nil)
# define infinite counter (Ruby >= 1.8.7)
counter ||= (1..Inf).each
doc = node.document
unless node.is_a?(Nokogiri::XML::Text)
# recurse for children and collect all the returned
# nodes into an array
children = node.children.inject([]) { |acc, child|
acc += number_words(child, counter)
}
# replace the node's children
node.children = Nokogiri::XML::NodeSet.new(doc, children)
return [node]
end
# for text nodes, we generate a list of span nodes
# and return it (this is more secure than OP's original
# approach that is vulnerable to HTML injection)n
node.to_s.strip.split.inject([]) { |acc, word|
span = Nokogiri::XML::Node.new("span", node)
span.content = word
span["id"] = "w#{counter.next}"
# add a space if we are not at the beginning
acc << Nokogiri::XML::Text.new(" ", doc) unless acc.empty?
# add our new span to the collection
acc << span
}
end
# demo
if __FILE__ == $0
h = <<-HTML
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
HTML
doc = Nokogiri::HTML.parse(h)
number_words(doc)
p doc.to_xml
end
Given a Nokogiri::HTML::Document in doc, you could do something like this:
i = 0
doc.search('//p[#class="stanza"]/text()').each do |n|
spans = n.content.scan(/\S+/).map do |s|
"<span id=\"w#{i += 1}\">" + s + '</span>'
end
n.replace(spans.join(' '))
end

Better way to parse "Description (tag)" to "Description, tag"

I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses
Chemicals (chem)
Electrical (elec)
I need to convert these lines to comma separated values like so:
Chemicals, chem
Electrical, elec
What I am using is this:
lines = line.gsub!('(', ',').gsub!(')', '').split(',')
I would like to know if there is a better way to do this.
for posterity, this is the full code (based on the answers)
require 'rubygems'
require 'csv'
csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.readlines.each do |line|
(desc, cat) = line.split('(')
desc.strip!
cat.strip!
csvfile << [desc, cat[0,cat.length-1]]
end
end
Try something like this:
line.sub!(/ \((\w+)\)$/, ', \1')
The \1 will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem) with , chem.
Let's create an example using a text file:
lines = []
File.open('categories.txt', 'r') do |file|
while line = file.gets
lines << line.sub(/ \((\w+)\)$/, ', \1')
end
end
Based on the question updates I can propose this:
require 'csv'
csv_file = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end
csv_file.close
Starting with Ruby 1.9, you can do it in one method call:
str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
')' => ''}
str.gsub(/ \(|\)/, mapping) #=> "Chemicals, chem\n"
In Ruby, a cleaner, more efficient, way to do it would be:
description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
# the all characters up to the first space and all characters after. We can then use
# multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string
This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.
No need to manipulate the string. Just grab the data and output it to the CSV file.
Assuming that you have something like this in the data:
Chemicals (chem)
Electrical (elec)
Dyes & Intermediates (dyes)
This should work:
File.open('categories.txt', 'r') do |file|
file.each_line do |line|
csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
end
end
Benchmarks relevant to discussion in #hundredwatt's answer:
require 'benchmark'
line = "Chemicals (chem)"
# #hundredwatt
puts Benchmark.measure {
100000.times do
description, tag = line.split(' ', 2)
tag = tag[1, (tag.length - 1) - 1]
new_line = description << ", " << tag
end
} # => 0.18
# NeX
puts Benchmark.measure {
100000.times do
line.sub!(/ \((\w+)\)$/, ', \1')
end
} # => 0.08
# steenslag
mapping = { ' (' => ', ',
')' => ''}
puts Benchmark.measure {
100000.times do
line.gsub(/ \(|\)/, mapping)
end
} # => 0.08
know nothing about ruby, but it is easy in php
preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);
$result = $m[1].','.$m[2];

Resources