I'm working on writing a parser for a specific XML based document, which has a lot of rules and complicated interface.
I was going to write the parser in Ruby to parse it to JSON. Then realized, a lot of other people who use different languages like to use it. So I'm thinking of somehow creating a central rule system, where each language can wrap it and create it's own parser.
Any idea how to go about it?
It's unlikely to be productive for you to write your own XML parser from scratch.
As you anticipated, there has indeed been a need for parsing XML in every major language. You can likely find libraries that implement multiple parsing models in any language you need. Be aware of tree-based models such as DOM, stream-based models such as SAX, and pull-based models such as StAX. Also consider XML processing models above the parsing level: Declarative transformations (eg XSLT) and databinding (eg JAXB).
The "central rule system" you envision has also already been realized in schemas (eg, XSD, RelaxNG, Schematron, ...).
Related
Are there any standards for defining data transformations in a tool and format agnostic manner?
There are some obvious candidates like XPath transformations but they're specific to XML. There are hundreds of ETL tools on the market, but they're proprietary syntaxes and often rely on low-code/no-code wysiwyg formats.
Has there been any attempts to define an agnostic data transformation standard/format?
So, just as a fun project, I decided I'd write my own XML parser. No, not to parse a specific document, and no, not using an XML parser library. I mean writing code to parse out any XML document into a usable data structure. Just because I like the challenge. :-)
With that said, so far it's proved to be... interesting. It's not as easy to parse (especially when you start taking into account special characters, CDATA, empty tags, comments, etc.) as it initially looked.
Are there any well documented XML parsing algorithms or explanations anywhere that anyone knows of? It seems like there are well-documented Queue and Stack and BTree and etc. etc. etc. implementations everywhere, but I'm not sure I've ever seen a simple, well-documented XML parser algorithm...
I repeat: I am not looking for a pre-built parser library! I am looking for information on how to create my own pre-built parser library! Do not tell me "use expat" or "use SAX" or whatever. That's not what I'm asking for.
Antlr offers a tutorial on parsing XML. It breaks the process down into phases: lexing, parsing, tree parsing, etc. Looks pretty interesting.
I don't know if it would be "cheating" in your book, but you could try parsing your XML with a ready-built all-purpose language parser like ANTLR. The result would be a list of tokens (if you just use the lexer) or a parse tree (if you include the parser) and you could then re-build the parse tree almost 1:1 into an XML structure.
Maybe. I haven't thought about the ways in which XML might be different from "normal" ANTLR fodder like programming languages, and whether you would be able to define a suitable grammar.
VTD-XML is probably the simplest parsing technique possible...
http://expat.sourceforge.net/
Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags). An introductory article on using Expat is available on xml.com.
I have one large project with components in multiple languages that each depend on some of the same enum values. What solutions have you come up with to unify enums across multiple arbitrary languages? I can think of a few, but I'm looking for the best solution.
(In my implementation, I'm using Php, Java, Javascript, and SQL.)
You can put all of the enums in a text file, then use a code generator to write out the appropriate syntax for each language from that common file so that each component has the enums. Make that text file the authoritative source of information.
You can express the text file in XML but I'd think a tab-delimited flat file would work just fine.
Make them in a format that every language can understand or has a library for. I am using JSON for this at the moment.
Then you can include it with two ways:
For development: Load it from a file/URL at runtime
good for small changes you want too see immediately
slow
For productive usage: Include it in the files
using a build script
fast
no instant feedback
I would apply the dry principle and using code generator as such you could add anew language easely even if it has not enum natively existing.
I'm working on a project which will do some complicated analyzing on some user-supplied input. There will be 3 parts of the code:
1) Input supplied by user, such as keywords
2) Rules, such as if keyword 1 is repeated 3 times in keyword 5, do this, etc.
3) And the analyzing itself which executes the rules and processes the user input, and generates the output necessary based on the processing.
Naturally this will lead to a lot of spaghetti code and many, many if statements in the processing code. I want to avoid that, and keep the rules (i.e. the if statements) separately from the code which loops through the user input and generates the output.
How can I do that, i.e. what is the best way?
If you have enough rules that you want to externalize, you could try using a business rules engines, like Drools in Java.
A business rules engine is a software system that executes one or more business rules in a runtime production environment. The rules might come from legal regulation ("An employee can be fired for any reason or no reason but not for an illegal reason"), company policy ("All customers that spend more than $100 at one time will receive a 10% discount"), or other sources. (Wikipedia)
It could be a little bit overhead depending of what you're trying to do. In my company we're using such kind of tools for our quality analysis tool.
Store it in XML. Easy to parse and update.
I had designed a code generator, which can be controllable from a xml file.
For each command I had a entry in the xml. I was processing the node to generate the opcode for that command. Node itself contains the actions I need to do for getting the opcode. For some commands I had to look into database, all those things I had put in this xml file.
Well, i doubt that it is necessary to have hughe if statements if polymorphism is applied correctly.
Actually, you need a proper domain model for your rules. This goes somehow into the direction of the command pattern, depending on the complexitiy of your code maybe in combination with the state machine pattern.
Once you have your model, defining rules is instantiate them correctly.
This could be done by having an xml definition, which is parsed and transformed into your model. But the new modern and even more fancy way would be using DSLs. If you program in Java and have a certain freedom about your libraries, this would be a proper use case for Embedded DSLs with Groovy. Basically you would need a Builder which constructs your model, that's all.
You always can implement factory that will create certain strategies according to passed parameters. And then you will use those strategies in your code without any if.
If it's just detecting keywords, a finite state machine or similar. If it's doing more, then other pattern matching systems, such as rules engines.
Adding an embedded scripting language to your application might help. The rules would then be expressed in scripts, executed by the applications on processing.
The idea is that scripts are easy to change and contain high level logic that will be executed by your application in details.
There are a lot of scripting languages available to do this : lua, Python, Falcon, squirrel, angelscript, etc.
Have a look at rule engines!
The approach from Lars may also be arguable.
I am currently using a CMS which uses an ORM with its own bespoke query language (i.e. with select/where/orderby like statements). I refer to this mini-language as a DSL, but I might have the terminology wrong.
We are writing controls for this CMS, but I would prefer not to couple the controls to the CMS, because we have some doubts about whether we want to continue with this CMS in the longer term.
We can decouple our controls from the CMS fairly easily, by using our own DAL/abstraction layer or what not.
Then I remembered that on most of the CMS controls, they provide a property (which is design-time editable) where users can type in a query to control what gets populated in the data source. Nice feature - the question is how can I abstract this feature?
It then occurred to me that maybe a DSL framework existed out there that could provide me with a simple query language that could be turned into a LINQ expression at runtime. Thus decoupling me from the CMS' query DSL.
Does such a thing exist? Am I wasting my time? (probably the latter)
Thanks
this isn't going to answer your question fully, but there is an extension for LINQ that allows you to specify predicates for LINQ queries as strings called Dynamic LINQ, so if you want to store the conditions in some string-based format, you could probably build your language on top of this. You'd still need to find a way to represent different clauses (where/orderby/etc.) but for the predicates passed as arguments to these, you could use Dynamic LINQ.
Note that Dynamic LINQ allows you to parse the string, but AFAIK doesn't have any way to turn existing Expression tree into that string... so there would be some work needed to do that.
(but I'm not sure if I fully understand the question, so maybe I'm totally of :-))