I want to extract ruby code snippets out of plain text.
Using the gem https://github.com/Erol/yomu makes it possible to extract the text of a PDF document. Now I want to get just the well-formed ruby code out of, for instance, a ruby-programming-book.
Any idea how a regex for multi-line matches of ruby methods and classes could look like?
I tried many different expressions, but did not get the results, that I expected.
Try this
Go through the file line by line and try to parse each line as Ruby code
If a line parses as Ruby start adding more lines to it until they don't parse as Ruby code anymore
Voila, here is your first code snippet
Maybe apply some filter to exclude trivial snippets like single words
Repeat
This is the common best practice to extract source code from unstructured text like emails and what not. This has been used to scan millions of emails for research projects.
Use the ripper core library to parse Ruby code.
Related
This is the general goal I am trying to achieve:
My VB.NET program will generate some Lists that may contain booleans, integers, strings, or more lists. I want the program to output a "file" which basically contains such data. It is important that the file cannot be read by humans Okay actually, fine, human-readable data wouldn't be bad.
Afterward, I want my Ruby program to take such file and read the contents. The Lists become arrays, and integers, booleans and strings are read alright with Ruby. I just want to be able to read the file, I might not need to write it using Ruby.
In .Net you'd use a BinaryWriter, if you're using IronRuby you'd then use a BinaryReader. If you're not using IronRuby, then perhaps...
contents = open(path_to_binary_file, "rb") {|io| io.read }
Why do you not want it to be human readable? I hope it's not for security reasons...
use JSON you can use the json.net nuget package.
I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.
csv sample:
Date,128,440,1024,Mixed
6/30/2010,342,-0.26%,-0.91%,1.51%,-0.97%
6/24/2010,0.23%,0.50%,-1.34%,0.67%
i want to render this data in a multi-line graph
Well, you first need to parse the CSV. I suggest FasterCSV - the RDoc explains pretty much everything you need to know.
You'll need to have ImageMagick and RMagick installed, then you can use Gruff. Or if you've got an Internet connection on the machine you are running the script on, you can use Google Charts with this Ruby plugin. Or if you want to get back SVG, consider Scruffy.
The page about Gruff has a code sample showing how to create a multi-line graph. Basically, you need to collect together all the data you want in each line into an array. Looks basically like the primary thing you need to do is array mangling.
I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?
RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)
Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).
It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.
I am generating some XML using a builder, and would like to compare the results to some file contents. However, since the strings are so long, the output is hard to read when the strings differ.
I know there are a number of libraries for diffing strings in ruby, but is there a built in facility in rspec for generating multiline string comparison failures that are easier to read?
Okay, got it. You need to use the --diff option with the following:
actual_multiline_string.should == expected_multiline_string
NOT
actual_multiline_string.should eql(expected_multiline_string)