Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".
My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (<b>bold text</b>).
Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.
I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to <x>/</x>.
Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?
My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.
This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to <x>/</x>". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.
You might have better luck trying to patch REXML, since that is written in pure Ruby.
Related
I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis
Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.
The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?
I have to work with some really ugly looking markup and I am running it through Tidy on ruby. For the most part it works great except for the fact that it lumps a ton of hidden inputs that are in the markup on to one line. I know there is a setting for a column wrap but it would be nicer if it just put sibling inputs on separate lines. It is important because it would simplify debugging when looking at the markup and seeing the info quickly in those hidden inputs.
I have yet to find a tool that does this. So is there anything out there or am I being foolish?
I should also add that a lot of the issues stem from the bad markup I get initially and there is nothing I can do to clean it up before it gets to me. I tried Nokogiri-pretty to clean it up and it was so close to being perfect but it turned script tags in to self closing tags which is no good.
Right now I am settling with Tidying the source and then (I know this is terrible) gsub(/<input[^>]*>/, '\0'+"\n"). I love the fact that I had to concat the capture with the newline.
Tidy tends to be problematic in Ruby. It has been reported to leak memory, it isn't 1.9 compatible, etc. However, you may be able to skip Tidy altogether by using Nokogiri and the nokogiri-pretty gem.
Assuming you have a Nokogiri doc:
require 'nokogiri-pretty'
puts doc.human
In addition to other tidying, all <input> tags will be on their own line and properly indented.
Nokogiri can do that easy enough:
doc.css('input').each{|input| input.before "\n"}
Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.
I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].
The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.
The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.
In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.
Net::HTTP.get 'http://www.target-site.com', '/target-page.html'
You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.
I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.