So, just as a fun project, I decided I'd write my own XML parser. No, not to parse a specific document, and no, not using an XML parser library. I mean writing code to parse out any XML document into a usable data structure. Just because I like the challenge. :-)
With that said, so far it's proved to be... interesting. It's not as easy to parse (especially when you start taking into account special characters, CDATA, empty tags, comments, etc.) as it initially looked.
Are there any well documented XML parsing algorithms or explanations anywhere that anyone knows of? It seems like there are well-documented Queue and Stack and BTree and etc. etc. etc. implementations everywhere, but I'm not sure I've ever seen a simple, well-documented XML parser algorithm...
I repeat: I am not looking for a pre-built parser library! I am looking for information on how to create my own pre-built parser library! Do not tell me "use expat" or "use SAX" or whatever. That's not what I'm asking for.
Antlr offers a tutorial on parsing XML. It breaks the process down into phases: lexing, parsing, tree parsing, etc. Looks pretty interesting.
I don't know if it would be "cheating" in your book, but you could try parsing your XML with a ready-built all-purpose language parser like ANTLR. The result would be a list of tokens (if you just use the lexer) or a parse tree (if you include the parser) and you could then re-build the parse tree almost 1:1 into an XML structure.
Maybe. I haven't thought about the ways in which XML might be different from "normal" ANTLR fodder like programming languages, and whether you would be able to define a suitable grammar.
VTD-XML is probably the simplest parsing technique possible...
http://expat.sourceforge.net/
Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags). An introductory article on using Expat is available on xml.com.
Related
Official and most of yaml documentation is written in yaml itself. That is nice demonstration of language power. I get it, that's the main point of this documentation. But documenting language by using that yet unknown language is like solving puzzle. At least for me. Searching in this style of documentation really hard: "which operators can I use for string indentation?" In traditional documentating style one would use chapter say "string indentation". But here, while it's a nice demo, you need to read it all, and understand it all, which is extra subpar if you don't work with yaml daily. And yaml language spec is great, if you want to practice context (free?) grammar definition, but greatly unfit for quick search for basic questions.
My question is. Is there yaml documentation, using traditional structure, documenting most of features, not just very few? One html page, sections and paragraphs? I cannot find one, and I'm always struggling/wasting so much time trying to find something in this style of documentation. And every time I read anything, I feel I'm missing so much information, which is not shown, constantly learning X using not yet explained constructs.
In a Node.js project I wrote a query parser in ANTLR4 (JS target). The user queries have a simplified SQL-like grammar that are then processed to full SQL on the server. The query structure can be arbitrarily nested.
I am now porting this app to go. There is, at the moment, no ANTLR4 target for go. I started exploring Ragel but according to the documentation, it expects a regular grammar and does not handle recursion, except for really simple tasks like balancing parentheses.
Another solution is to use my ANTRL4 grammar with the C++ target and then link the C++ classes to go with SWIG (or something) which feels kinda hairy and last resort type of solution.
Yet another solution is to do the parsing on the client side but this would explode the amount of js needed for the client to download. Also feels a bit desparate.
So my questions are:
1) Are there any parser libraries able to handle recursive grammars usable from Go?
2) I am completely unfamiliar with ragel and as it seems quite a complicated tool I want to get this straight before investing time into learning it: Is there any way to handle some recursion (say up to a certain level) in ragel if the grammar is simple enough?
I'm working on writing a parser for a specific XML based document, which has a lot of rules and complicated interface.
I was going to write the parser in Ruby to parse it to JSON. Then realized, a lot of other people who use different languages like to use it. So I'm thinking of somehow creating a central rule system, where each language can wrap it and create it's own parser.
Any idea how to go about it?
It's unlikely to be productive for you to write your own XML parser from scratch.
As you anticipated, there has indeed been a need for parsing XML in every major language. You can likely find libraries that implement multiple parsing models in any language you need. Be aware of tree-based models such as DOM, stream-based models such as SAX, and pull-based models such as StAX. Also consider XML processing models above the parsing level: Declarative transformations (eg XSLT) and databinding (eg JAXB).
The "central rule system" you envision has also already been realized in schemas (eg, XSD, RelaxNG, Schematron, ...).
I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors.
I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is:
Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better alternatives than this hack?
There are a couple of possibilities here.
Several of these require semi-well formed XML to work. If you don't have it, I would pair clj-tagsoup with hiccup to produce the XML (parse with clj-tag-soup, which produces a form that hiccup and write out as XML) and work with that.
First, just use the native JDK capabilities. Assuming the document is well formed enough, try using clj-xpath which provides a wrapper around the native JDK parsing.
If that doesn't suffice, consider taking a more Clojure data structure based route. A simpler path could just use the output of TagSoup and a combination of maps, filters, and nths.
If you need something more advanced, consider using zippers to provide structure around the data, making it easier to manipulate. Use clojure.xml/parse and clojure.zip/xml-zip to produce the zipper, and go from there. An example can be found at http://techbehindtech.com/2010/06/25/parsing-xml-in-clojure/.
Using the native structures is my preferred route for anything complicated, as you can bring the full power of the language to bear.
If you provide a sample of why you need XPath, I can provide some sample code.
I have no idea how to build S-exp.
I want to do it, because I need to build AST for my langauge.
At the beginning I used RubyParser to parse it to sexp then code gen.
But it must be ruby's subset I think.I cant define the language what I want.
Now I need to implement parser for my language.
So anyone could recommend any ruby tool that building AST for S-expression ?
Thanks!
It is not very clear from your question what exactly do you need, but simple Google search gives some interesting links to check. Maybe after checking these links, if they are not the answer to your question, you can edit question and make it more precise and concrete.
http://thingsaaronmade.com/blog/writing-an-s-expression-parser-in-ruby.html
https://github.com/aarongough/sexpistol
You might try the sxp-ruby gem at http://github.com/bendiken/sxp-ruby. I use it for SPARQL S-Expressions (SSE) and similar methods for managing Abstract Syntax Trees in Ruby.
Maybe you could have a look at this gem named Astrapi.
This is just an experiment :
describe your language elements (concepts) in a "mm" file (abstract syntax)
run astrapi on this file
astrapi generates a parser that is able to fill up your AST, from your input source expressed in s-expression (concrete syntax of your concepts).
I have put a modest documentation here.