How to diff multiline strings with RSpec? - ruby

I am generating some XML using a builder, and would like to compare the results to some file contents. However, since the strings are so long, the output is hard to read when the strings differ.
I know there are a number of libraries for diffing strings in ruby, but is there a built in facility in rspec for generating multiline string comparison failures that are easier to read?

Okay, got it. You need to use the --diff option with the following:
actual_multiline_string.should == expected_multiline_string
NOT
actual_multiline_string.should eql(expected_multiline_string)

Related

Regex for Ruby code extraction out of plain text?

I want to extract ruby code snippets out of plain text.
Using the gem https://github.com/Erol/yomu makes it possible to extract the text of a PDF document. Now I want to get just the well-formed ruby code out of, for instance, a ruby-programming-book.
Any idea how a regex for multi-line matches of ruby methods and classes could look like?
I tried many different expressions, but did not get the results, that I expected.
Try this
Go through the file line by line and try to parse each line as Ruby code
If a line parses as Ruby start adding more lines to it until they don't parse as Ruby code anymore
Voila, here is your first code snippet
Maybe apply some filter to exclude trivial snippets like single words
Repeat
This is the common best practice to extract source code from unstructured text like emails and what not. This has been used to scan millions of emails for research projects.
Use the ripper core library to parse Ruby code.

Ruby reading VB.NET generated data

This is the general goal I am trying to achieve:
My VB.NET program will generate some Lists that may contain booleans, integers, strings, or more lists. I want the program to output a "file" which basically contains such data. It is important that the file cannot be read by humans Okay actually, fine, human-readable data wouldn't be bad.
Afterward, I want my Ruby program to take such file and read the contents. The Lists become arrays, and integers, booleans and strings are read alright with Ruby. I just want to be able to read the file, I might not need to write it using Ruby.
In .Net you'd use a BinaryWriter, if you're using IronRuby you'd then use a BinaryReader. If you're not using IronRuby, then perhaps...
contents = open(path_to_binary_file, "rb") {|io| io.read }
Why do you not want it to be human readable? I hope it's not for security reasons...
use JSON you can use the json.net nuget package.

How can I render XML character entity references in Ruby?

I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.

Extracting strings from PE files

I'm writing some code (Python, but really isn't important) that analyzes strings inside PE files. I'm looking for a command line tool I could invoke that will return the complete list of strings inside the PE file.
I know PEDUMP, but it seems to give incomplete strings.
Also, it is very important that this tool would be able to handle with different type of strings, such as C-strings (NULL terminated), Pascal-strings (length prefix), etc.
I found "string extractor" here, but it costs money and I'm not sure if it can handle different type of strings.
Do you know of any tool that answers my requirements?
There's the classic unix program strings which does exactly this.
Although strings isn't specifically designed to handle Pascal-style strings, it will dump them out anyway because they will appear to be textual data.
Some implementations of strings can handle Unicode (UTF-8 and UTF-16) strings too.
http://pedump.me can show all strings from your PE

Parsing text files and sorting in Ruby?

I would like to write a Ruby program which can parse three separate text files, each containing different delimiters, then sort them according to certain criteria.
Can someone please point me in the right direction?
It is not clear what is the data format in your files, and what criteria you used to sort, so I am not able to provide you a accurate answer.
However, basically, you might need something like this:
File.open("file_name","r").read.split(",").sort_by {|x| x.length}
You:
Opened a file using File.open.
Read the whole file and got a string. You can also read the file line-by-line using the each method.
Split the string use split. The delimiter used is ,.
Use sort_by to sort them according to the criteria specified in the block.
Enumerable#sort_by will allow you to sort an array (or other enumerable object) with a specific comparison function.
If by "text files with delimiters" you mean CSV files (character seperated values), then you can use the csv library, which is part of the standard library, to parse them. CSV gives you objects that look and feel like Ruby Hashes and Arrays, so you can use all the standard Ruby methods for sorting, filtering and iterating, including the aforementioned Enumerable#sort_by.

Resources