How to parse mailbox file in Ruby?

How to parse mailbox file in Ruby? - ruby

The Ruby gem rmail has methods to parse a mailbox file on local disk. Unfortunately this gem has broken (in Ruby 2.0.0). It might not get fixed, because folks are migrating to the gem mail.
Gem mail has method Mail.read('filename.txt'), but that parses only the first message in a mailbox.
That gem, and builtin Net::IMAP, have flooded the net with tutorials on accessing mailboxes through imap.
So, is there still a way to parse a plain old file, without imap?
As the lone rubyist in my group I'd rather not embarrass myself by resorting to http://docs.python.org/2/library/mailbox.html.
Or, worse yet, PHP's imap_open('/var/mail/www-data', ...) -- if only Net::IMAP.new accepted filenames like that.

The good news is the Mbox format is really dead simple, though it's simplicity is why it was eventually replaced. Parsing a large mailbox file to extract a single message is not specially efficient.
If you can split apart the mailbox file into separate strings, you can pass these strings to the Mail library for parsing.
An example starting point:
def parse_message(message)
Mail.new(message)
do_other_stuff!
end
message = nil
while (line = STDIN.gets)
if (line.match(/\AFrom /))
parse_message(message) if (message)
message = ''
else
message << line.sub(/^\>From/, 'From')
end
end
The key is that each message starts with "From " where the space after it is key. Headers will be defined as From: and any line that starts with ">From" is to be treated as
actually being "From". It's things like this that make this encoding method really inadequate, but if Maildir isn't an option, this is what you've got to do.

You can use tmail parsing email boxes, but it was replaced by mail, but I can't really find a class that substitutes it. So you might want to keep along with tmail.
EDIT: as #tadman pointed out, it should not be working with ruby 1.9. However you can port this class (and put it on github for everyone else use :-) )

The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a > prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.
Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.
A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.
Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.

Related

Writing hash information to file and reloading it automatically on program startup?

I wrote a little program that creates a hash called movies. Then I can add, update, delete, and display all current movies in the hash by typing the title.
Instead of having it start a new hash each time and save anything added to a file, and, when updated or deleted, update or delete the key, value pair from the file, I want the program to auto-load the file on startup and create it if it doesn't exist.
I have no idea how to go about doing this.
After reading a lot of the comments I have decided that maybe I should do this with SQL instead, seems like a much better approach!

You can't store Ruby objects directly on the disk; you will first need to convert them to some sequence of bytes (i.e. a string). This is called serialization, and there are several different ways to do it and several different formats the data could be in. I think I would recommend JSON, but you might also want to try YAML or Marshal.
Any of those libraries will allow you to convert your hash into a string and allow you to convert that same string back into a hash. Then you can use Ruby's File class to save and load that string from the disk.
This should get you pointed in the right direction. From here you can search for more specific things like "how do I convert a hash to JSON" or "how do I write a string to a file".

You have the ability to marshal your code in a few ways.
YAML if you would like to use a gem, or JSON. There is also a built in Marshal
RI tells us:
Marshal
(from ruby site)
----------------------------------------------------------------------------- The marshaling library converts collections of Ruby objects into a
byte stream, allowing them to be stored outside the currently active
script. This data may subsequently be read and the original objects
reconstituted.
Marshaled data has major and minor version numbers stored along with
the object information. In normal use, marshaling can only load data
written with the same major version number and an equal or lower minor
version number. If Ruby's ``verbose'' flag is set (normally using -d,
-v, -w, or --verbose) the major and minor numbers must match exactly. Marshal versioning is independent of Ruby's version numbers. You can
extract the version by reading the first two bytes of marshaled data.
And I will leave it at that for Marshal. But there is a bit more documentation there.
You can also use IO#puts to write to a file, and then modify that file to load later, which I use sometimes for config settings. Why use YAML or another external source, when Ruby is easy enough to have a user modify? You use YAML when it needs to be more generally accessible, as the Tin Man points out.
For example this file is the sample file, but is intended for interactive editing (with constraints, of course) but it is simply valid Ruby. And it gets read by a Ruby program, and is a valid object (in this case a Hash stored in a constant.)

How can I render XML character entity references in Ruby?

I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"

REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©

This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.

Remove Signature from Received Message

I have a python script that receives text messages from users, and processes them as a query. However, some users have signatures automatically appended to their messages, and the script incorrectly treats them as actual content. What's the best programmatic way to recognize and remove these signatures?
(I'd prefer in python, but am fine with any other language too, as well as just saying it in pseudocode)

If the signatures are appended to the body of the message such that they're actually part of the body text, then there are only two ways to remove them:
Heuristics, such as "anything following three dashes must be a signature". These may be effective if you spend some time tuning them.
A classifier. This is a lot of work to set up, and requires that you "train" it by marking some message parts as signatures. These can also be very effective, but like heuristics will never work 100% of the time.

If the signature always follows a specific pattern, you should be able to just use a regular expression to trim it off.
However, if the user can setup their signature any way they wish, and there is no leading characters (ie: -- at the beginning), this is going to be very difficult. The only reliable way to do this would be to know the content of the signature for each user in advance so you can strip it out. Imagine a worst-case scenario: Somebody could always send a blank message, with a signature that was a fully valid "query". There'd be no way for the script to differentiate that from a "query" message with no signature.

URI encoding in Yahoo mail compose link

I have link generating web app. I'd like to make it easy for users to email the links they create to others using gmail, yahoo mail, etc. Yahoo mail has a particular quirk that I need a workaround for.
If you have a Yahoo mail account, please follow this link:
http://compose.mail.yahoo.com/?body=http%3A%2F%2Flocalhost%3A8000%2Fpath%23anchor
Notice that yahoo redirects to a specific mail server (e.g. http://us.mc431.mail.yahoo.com/mc/compose). As it does, it decodes the hex codes. One of them, %23, is a hash symbol which is not legal in a query string parameter value. All info after %23 is lost.
All my links are broken, and just using another character is not an option.
Calling us.mc431.yahoo.com directly works for me, but probably not for all users, depending on their location.
I've tried setting html=true|false, putting the URL in a html tag. Nothing works. Anyone got a reliable workaround for this particular quirk?
Note: any server-based workaround is a non-starter for me. This has to be a link that's just between Yahoo and the end-user.
Thanks

Here is how i do it:
run a window.escape on those chars: & ' " # > < \
run a encodeURIComponent on the full string
it works for most of my case. though newline (\n) is still an issue, but I replace \n with space in my case and it worked fine.

I have been dealing with the same problem the last couple of hours and I found a workaround!
If you double-encode the anchor it will be interpreted correctly by Yahoo. That means change %23 to %2523 (the percent-sign is %25 encoded).
So your URI will be:
http://compose.mail.yahoo.com/?body=http%3A%2F%2Flocalhost%3A8000%2Fpath%2523anchor
The same workaround can be used for ampersand. If you only encode that as %26, then Yahoo will convert that to "&" which will discard the rest of message. Same procedure as above - change %26 to %2526.
I still haven't found a solution to the newline-problem though (%0D and %0A).

For the newline, add the newline as < BR > and double encode it also, it is interpreted successfully as new line in the new message

I think you're at the mercy of what Yahoo's server does when it issues the HTTP redirect. It seems like it should preserve the URL escaping on the redirect, but isn't. However, without knowledge of their underlying application, it's hard to say why it wouldn't. Perhaps, it's just an unintended side effect (or bug), or perhaps some of the Javascript features on that page require them to do some finagling with the hash tag.

passing rather huge arguments to ruby script, problems?

ruby somescript.rb somehugelonglistoftextforprocessing
is this a bad idea? rather should i create a separate flat file containig the somehugelonglistoftextforprocessing, and let somescript.rb read it ?
does it matter if the script argument is very very long text(1KB~300KB) ? what are some problems that can arise if any.

As long as the limits of your command-line handling code (e.g., bash or ruby itself) are not exceeded, you should have no technical problems in doing this.
Whether it's a good idea is another matter. Do you really want to have to type in a couple of hundred kilobytes every single time you run your program? Do you want to have to remember to put quotes around your data if it contains spaces?
There are a number of ways I've seen this handled which you may want to consider (this list is by no means exhaustive):
Change your code so that, if there's no arguments, read the information from standard input - this will allow you to do either
ruby somescript.rb myData
or
ruby somescript.rb <myFile.txt.
Use a special character to indicate file input (I've seen # used in this way). So,
ruby somescript.rb myData
would use the data supplied on the command line whilst
ruby somescript.rb #myFile.txt
would get the data from the file.
My advice would be to use the file-based method for that size of data and allow an argument to be used if specified. This covers both possible scenarios:
Lots of data, put it in a file so you won't have to retype it every time you want to run your command.
Not much data, allow it to be passed as an argument so that you don't have to create a file for something that's easier to type in on the command line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio