Parsing structured text in Ruby - ruby

There are several questions on SO about parsing structured text in Ruby, but none of them apply to my case.
I'm the author of the Ruby Whois library. The library includes several parsers to parse a WHOIS response and extract the properties from the content.
So far, I used two approaches:
Regular expressions for base parsers (e.g. whois.aero)
StringScanner for advanced parsers (e.g. whois.nic.it)
Regular expressions are not efficient because if I need to extract 15 properties, I need to scan the same response at least 15 times.
StringScanner is a nice library, but creating an efficient scanner is not that simple.
I was wondering if is there some other Ruby tools you suggest to implement a WHOIS record parser. I was reading about Treetop but because WHOIS records lack of a specification, I believe Treetop is not the right solution.
Any suggestion?

The obvious one is Ragel. whois records are pretty straightforward, have a limited set of key terms and such -- it should be straightforward. And Ragel parsers have proven very efficient.
Update As promised.
Okay, so why use Ragel? Basically, anything that can be described as a finite state machine can be described in Ragel, which then generates code for a highly efficient parser. This parser is much faster than a generalized regular expression engine, simply because it has a simpler program than the general parser.
Now, you could take this further, for example by using the ABNF Generator here. Then, your description to start with could be as simple as something like
WHOIS ::= RECORD*
RECORD ::= FIELDNAME ':' FIELDVALUE
FIELDVALUE ::= NAMESTRING | IPADDRESS | DOMAINNAME
(I make no claim that's particularly ABNF syntax, just a rough BNF.) The point is that you describe the parser in a more or less intuitive form, and let the generator make the exciting code part.

Related

is there a generic query language for arbitrary sets independent from a programming language?

I'm looking for a way to define queries on sets independently from a programming language or the kind of sets.
In detail this would be a language definition and implementations for common languages like Java, C++, Python etc.
As commented I'm not looking for a database or any implementation of a set-representation but only a way to define a query for elements from e.g. a std::set/vector a Python set() or any linear structure which can be seen as a set.
A close example would be something like jLinq but without being tied to JSON or javascript and with a well defined string representation.
Of course without knowing the kind of data structure you would have to implement any conditional filter for every problem and every programming language, but the way you construct query strings and how you evaluate them would be clear and you would not have to write parsers.
So what I'd like to write in Java or C++ is something like
q = query()
.created_after("14.03.2010")
.and(contains("hello")
.or(contains("hallo")))
.sort("caption")
or written as a string:
"(created_after("14.03.2010") and ( contains("hello") or contains("hallo"))) sort("caption")"
(this is not thought through - just to show what an interface could look like)
A good example for a different problem would be JSON or XML: clear language definition and parsers/tools for any platform or programming language.
I know this is an old question, but I think I know what you mean and I was actually looking for something similar. What you need is a "search query parser".
I found search-query-parser for nodejs (I'm not the author). Haven't tried it yet but looks promising.The example in the docs is very illustraring, you would receive an input string from the UI
from:hi#retrace.io,foo#gmail.com to:me subject:vacations date:1/10/2013-15/04/2014 photos
And the library would parse it to a structured json object
{
from: ['hi#retrace.io', 'foo#gmail.com'],
to: 'me',
subject: 'vacations',
date: {
from: '1/10/2013',
to: '15/04/2014'
},
text: 'photos'
}
And the from that object you could construct and issue a query command to your database. As you can see it handles lists and ranges. Right away I can't see any boolean operator (AND,OR) but I guess could be easily implemented.
Hope this helps.
RSQL is a good option these days. There are plenty of parsers available and the queries are URL friendly.

Recursive logic for parsing string into complex boolean?

I'm sure this has been done before, I just can't find it.
I need to turn something like, "((A OR B) AND C) OR D" into a database query for an attribute. Specifically I'm using Ruby Sequel. Can anyone point me at an example or utility or something that will keep me from reinventing the wheel?
You can define a grammar using ANTLR and automatically generate a Ruby parser for those type of strings. ANTLR is a parser generator and it allows you to define a grammar for a language (such as a the boolean language that you described).
After parsing, you can specify what actions need to be taken to build the desired data structure (in your case a tree data structure that captures the structure of the query).
This is not particularly a Ruby problem as ANTLR can also generate parsers for other languages. In your case it would produce a Ruby parser that you can integrate into your application to parse the strings and to produce the data structure that you need.

How to build AST by S-expression in Ruby?

I have no idea how to build S-exp.
I want to do it, because I need to build AST for my langauge.
At the beginning I used RubyParser to parse it to sexp then code gen.
But it must be ruby's subset I think.I cant define the language what I want.
Now I need to implement parser for my language.
So anyone could recommend any ruby tool that building AST for S-expression ?
Thanks!
It is not very clear from your question what exactly do you need, but simple Google search gives some interesting links to check. Maybe after checking these links, if they are not the answer to your question, you can edit question and make it more precise and concrete.
http://thingsaaronmade.com/blog/writing-an-s-expression-parser-in-ruby.html
https://github.com/aarongough/sexpistol
You might try the sxp-ruby gem at http://github.com/bendiken/sxp-ruby. I use it for SPARQL S-Expressions (SSE) and similar methods for managing Abstract Syntax Trees in Ruby.
Maybe you could have a look at this gem named Astrapi.
This is just an experiment :
describe your language elements (concepts) in a "mm" file (abstract syntax)
run astrapi on this file
astrapi generates a parser that is able to fill up your AST, from your input source expressed in s-expression (concrete syntax of your concepts).
I have put a modest documentation here.

Parsing XML, how is this actually done? [duplicate]

So, just as a fun project, I decided I'd write my own XML parser. No, not to parse a specific document, and no, not using an XML parser library. I mean writing code to parse out any XML document into a usable data structure. Just because I like the challenge. :-)
With that said, so far it's proved to be... interesting. It's not as easy to parse (especially when you start taking into account special characters, CDATA, empty tags, comments, etc.) as it initially looked.
Are there any well documented XML parsing algorithms or explanations anywhere that anyone knows of? It seems like there are well-documented Queue and Stack and BTree and etc. etc. etc. implementations everywhere, but I'm not sure I've ever seen a simple, well-documented XML parser algorithm...
I repeat: I am not looking for a pre-built parser library! I am looking for information on how to create my own pre-built parser library! Do not tell me "use expat" or "use SAX" or whatever. That's not what I'm asking for.
Antlr offers a tutorial on parsing XML. It breaks the process down into phases: lexing, parsing, tree parsing, etc. Looks pretty interesting.
I don't know if it would be "cheating" in your book, but you could try parsing your XML with a ready-built all-purpose language parser like ANTLR. The result would be a list of tokens (if you just use the lexer) or a parse tree (if you include the parser) and you could then re-build the parse tree almost 1:1 into an XML structure.
Maybe. I haven't thought about the ways in which XML might be different from "normal" ANTLR fodder like programming languages, and whether you would be able to define a suitable grammar.
VTD-XML is probably the simplest parsing technique possible...
http://expat.sourceforge.net/
Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags). An introductory article on using Expat is available on xml.com.

What a Ruby parser would you suggest to parse Ruby sources?

A parser I'm looking for should:
be Ruby parsing friendly,
be elegant by rule design,
produce user friendly parsing errors,
user documentation should be available in volume more than a calculator example,
UPD: allowing to omit optional whitespaces writing a grammar.
Fast parsing is not an important feature.
I tried Citrus but the lack of documentation and need to specify every space in rules just turned me away from it.
Treetop
Ragel
Or in case you want to parse Ruby itself:
parse_tree and ruby_parser
Edit:
I just saw your last comment about needing a subset of Ruby for your project, in that case I'd also recommend having a look at tinyrb.

Resources