How to tidy up malformed xml in ruby - ruby

I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser

Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

Related

Ruby Nokogiri - How to prevent Nokogiri from printing HTML character entities

I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&amp;".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

writing a short script to process markdown links and handling multiple scans

I'd like to process just links written in markdown. I've looked at redcarpet which I'd be ok with using but I really want to support just links and it doesn't look like you can use it that way. So I think I'm going to write a little method using regex but....
assuming I have something like this:
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
tmp=str.scan(/\[.*\]\(.*\)/)
or if there is some way I could just gsub in place [hope](http://www.github.com) -> <a href='http://www.github.com'>hope</a>
How would I get an array of the matched phrases? I was thinking once I get an array, I could just do a replace on the original string. Are there better / easier ways of achieving the same result?
I would actually stick with redcarpet. It includes a StripDown render class that will eliminate any markdown markup (essentially, rendering markdown as plain text). You can subclass it to reactivate the link method:
require 'redcarpet'
require 'redcarpet/render_strip'
module Redcarpet
module Render
class LinksOnly < StripDown
def link(link, title, content)
%{#{content}}
end
end
end
end
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
md = Redcarpet::Markdown.new(Redcarpet::Render::LinksOnly)
puts md.render(str)
# => here is my thing hope and ...
This has the added benefits of being able to easily implement a few additional tags (say, if you decide you want paragraph tags to be inserted for line breaks).
You could just do a replace.
Match this:
\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)
Replace with:
\1
In Ruby it should be something like:
str.gsub(/\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)/, '\1')

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.
The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:
geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)
kmldoc.css("//Placemark").each_with_index do |placemark, i|
puts i
tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
h = HorryParcel.new
h.owner_name = tds.shift.text
tds.shift
tds.each_slice(2) do |k, v|
col = k.text.downcase
eval("h.#{col} = v.text")
end
coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
geo_shape = geofactory.polygon(geofactory.linear_ring(points))
proj_shape = geo_shape.projection
h.geo_shape = geo_shape
h.proj_shape = proj_shape
h.save
end
Anyway, I've tested this code with a much, much smaller sample of kml and it works.
However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.
I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.
Here's a sample of the kml (w/ just one placemark) if that helps.
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>justone.kml</name>
<Style id="PolyStyle00">
<LabelStyle>
<color>00000000</color>
<scale>0</scale>
</LabelStyle>
<LineStyle>
<color>ff0000ff</color>
</LineStyle>
<PolyStyle>
<color>00f0f0f0</color>
</PolyStyle>
</Style>
<Folder>
<name>justone</name>
<open>1</open>
<Placemark id="ID_010161">
<name>STUART CHARLES A JR</name>
<Snippet maxLines="0"></Snippet>
<description>""</description>
<styleUrl>#PolyStyle00</styleUrl>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
</kml>
EDIT:
99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.
class ClassName
attr_reader :before, :after
def go
#before = Time.now
run_actual_code
#after = Time.now
puts "process took #{(#after - #before) seconds} to complete"
end
def run_actual_code
...
end
end
The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.
For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.
The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.
Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.
So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu
I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.
http://en.wikipedia.org/wiki/Less_%28Unix%29
Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.
If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.
I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure

How to build an Object from XML, modify, then write to File in Ruby

I'm having an awful time trying to use a library to parse an XML File into a hash like object, modify it, then print it back out to another XML file in Ruby. For a class I'm taking, we're supposed to use a Java JAXB like library where we convert XML into an object. We've already done SAX and DOM methods so we can't use those methods of XML de-serialization. Nokogiri helped me with both of these in Ruby.
The only problem is that besides the SIMPLE modifications I'm making to the objects, when I write to file it has drastic differences. Is there a Ruby library meant for doing just this? I've tried: ROXML, XML::Mapping, and ActiveSupport::CoreExt. The only one I can get to even run is ActiveSupport, and even then it starts putting element attributes as child elements in the output XML.
I'm willing to try out XmlSimple, but I'm curious has anyone actually had to do this before/run into the same problems? Again, I can't read in lines one at a time like SAX or build a Tree like structure like DOM, it needs to be a hash like object.
Any help is much appreciated!
You should have a look into nokogiri: http://nokogiri.org/
Then you can parse the XML like this :
xml_file = "some_path"
#xml = Nokogiri::XML(File.open xml_file)
#xml.xpath('//listing').each do |node|
style = node.search("style").text
end
With Xpath, you can perform queries in the XML :
#xml.xpath("//listing[name='John']").first(10)
OK, I got it working. After looking at ActiveSupport::CoreExt 's source code I found it just uses a gem called xml-simple. What's obnoxious is the gem, library name in the require statement, and class name are a mixture of hyphenated and non hyphenated spellings. For future reference here's what I did:
# gem install xml-simple
# ^ all lowercase, hyphenated
require 'xmlsimple'
# ^ all lowercase, not hyphenated
doc = XmlSimple.xml_in 'hw3.xml', 'KeepRoot' => true
# ^ Camel cased (it's a class), not hyphenated
# doc.class => Hash
# manipulate doc as a hash
file = File.new('HW3a.xml', 'w')
file.write("<?xml version='1.0' encoding='utf-8'?>\n")
file.write(XmlSimple.xml_out doc, 'KeepRoot' => true)
I hope this helps someone. Also make sure you pay attention to case and hyphens with this gem!!!

Ruby RSS::Parser.to_s silently fails?

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?
From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.
I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

Resources