I am wondering whether there exists any declarative language for arbitrarily describing the format and semantics of a data structure, that can be compiled to a specific implementation of that structure in any of a set of target languages. That is, something like a generic data definition language but geared toward describing arbitrary data structures such as vectors, lists, trees, etc., and the semantics of operations on those structures. I ask because I had an idea for a feasible implementation of this concept, and I'm just wondering whether it's worth it, and, consequently, whether it's been done before.
Another, slightly more abstract question: is there any real difference between the normative specification of a data structure (what it does) and its implementation (how it does it)? More specifically, should separate implementations of the same requirements be considered different structures?
If you felt like it, a combination of XML with XSLT could describe a data structure, and provide a matching definition in essentially any language if your choice. I've never tried to prove it formally, but my first guess would be that S-expressions are a superset of XML (modulo syntactical differences).
At least in theory, yes there are (or at least can be) differences between a description of what a data structure does, and how it does it. For an obvious example, you could describe a generic mapping from keys to values, which could use an implementation based on hash tables, skip lists, binary search trees, etc. It's mostly a question of describing it at a high enough level of abstraction to allow differences in the implementation. If you include too many requirements (complexity, ordering, etc.) you can pretty quickly rule out many implementations.
You may be interested in messaging specification / data serialization languages such as Google's Protocol Buffers as well as ASN.1. It's a slightly different slant than you're looking for, but in the same vein.
Both are ways of declaring generic messages for communications. Protocol buffers message specs "compile" to different languages, but the central protocol is consistent. ASN.1 has multiple different compilation utilities as well as different protocol representations staying logically consistent with varying literal implementations. Look at XER, PER vs. BER for example.
I'd love a specification language that would just focus on simple packed binary layout against a logical memory structure. It may be that plain C structs are the simplest common way of expressing this. I had hoped ASN.1 had some way of getting to that, but after looking at it for a bit, ASN.1 PER is close, but not quite it.
Edit: Apache Thrift and Capn' Proto may also be interesting.
There are approaches to that sort of thing in dynamic logics, which attempt to capture the semantics of programs. However, the meaning in terms of the dynamic logic is in terms of preconditions and postconditions and is agnostic with regard to the actual implementation of the list.
These data structures are inherently tied to implementation, as the only difference between a linked list and array is specifically how it is laid out in memory.
For this, there is a generic data definition language --- any high level programming language -- C, C++, java -- that specifies this. Any of them is as generic as the other, since within this context any of them could be compiled to the other.
Cozy is “a tool that synthesizes data structure implementations from very high-level specifications” and seems to be essentially the language I was actually looking for (or considering writing) when I asked this question.
It can automatically generate an implementation (in Java or C++, at the time of this writing) from a data structure specification written in its proprietary language. A specification describes the abstract state, update operations, and query operations of a data structure, as well as invariants that must be maintained and assumptions that can be used by the solver to optimise the implementation. For example, here is a partial specification for a graph data structure:
Graph:
handletype Node = { id : Int }
handletype Edge = { src : Int, dst : Int }
state nodes : Bag<Node>
state edges : Bag<Edge>
// Invariant: disallow self-edges.
invariant (sum [ 1 | e <- edges, e.val.src == e.val.dst ]) == 0;
op addNode(n : Node)
nodes.add(n);
op addEdge(e : Edge)
assume e.val.src != e.val.dst;
edges.add(e);
query out_degree(nodeId : Int)
sum [ 1 | e <- edges, e.val.src == nodeId ]
// …
Its implementation is described in Fast Synthesis of Fast Collections and Generalized Data Structure Synthesis by Calvin Loncaric, Emina Torlak, and Michael D. Ernst.
Related
I never took a formal GA course, so this question might be vague: I'm trying to see whether I'm approaching this problem well.
Usually a genome is represented as a sequence of homogeneous elements, such as binary numbers, logic gates, elementary functions, etc., which can then be assembled into a homogeneous structure like a syntax-tree for a computer program or a 3D object or whatever.
My problem involves evolving a graph of components, lets say X, Y and Z: the graph can have N nodes and each node is an instance of either X, Y or Z. Encoding such a graph structure in a genome is rather straightforward, however, I also need to attach additional information for what X, Y and Z do themselves--which is actually the main object of the GA.
So it seems like my genome should code for a heterogeneous entity: an entity which is composed both of a structure graph and a functionality specification. It is not impossible to subsume the elements (genes) which code for the structure and those that code for functionality under a single parent "gene", and then simply separate them when the entity is being assembled, but this doesn't feel like the right approach.
Is this a common problem in GA? Am I supposed to find a "lower-level" representation / genome encoding in this situation? What are the relevant considerations?
Yes you can do that with GA, but strictly speaking you will be using Genetic Programming (GP) as opposed of Genetic Algorithms. GP is considered a special case of GA where the genome representation is heterogenous. This means your individual is a "computer program" instead of just "raw data" look here and here. This means you can really get creative on what this "computer program" means, how to represent it and handle it.
Regarding the additional information, it should be fine as long as all your genetic operators consider this representation. For instance, your crossover. It could be prepared to exchange half of the tree and half of the additional information of the parents. If for some reason the additional information cannot be divided, your crossover may decide to clone it from one of the parents.
The main disadvantage of this highly tuned approach is that you probably can't use high level GA/GP frameworks out there (I'm just assuming, I don't know much about them).
I have created a big Ontology (.owl) and I'm now in the reasoning step. In fact, the problem is how to ensure a scalable reasoning for my ontology. I have searched in literature and I found that Big Data can be an adequate solution for that. Unfortunately, I found that Map-reduce can't accept as input OWL file. In addition semantic language as SWRL, SPARQL can not be used.
My questions are:
should I change the owl file with others?
How to transform Rules (SWRL for example) in an acceptable format with Map-reduce?
Thanks
"Big data can be an adequate solution to that" is too simple a statement for this problem.
Ensuring scalability of OWL ontologies is a very complex issue. The main variables involved are number of axioms and expressivity of the ontology; however, these are not always the most important characteristics. A lot depends also on the api used and, for apis where the reasoning step is separate from parsing, which reasoner is being used.
SWRL rules add another level of complexity, as they are of (almost) arbitrary complexity - so it is not possible to guarantee scalability in general. For specific ontologies and specific sets of rules, it is possible to provide better guesses.
A translation to a MapReduce format /might/ help, but there is no standard transformation as far as I'm aware, and it would be quite complex to guarantee that the transformation preserves the semantics of the ontology and of the rule entailments. So, the task would amount to rewrite the data in a way that allows you to answer the queries you need to run, but this might prove impossible, depending on the specific ontology.
On the other hand, what is the size of this ontology and the amount of memory you allocated to the task?
I hear many people referring tree as a data structure. But trees are mostly implemented using Linked Lists, or Arrays. So does it make it an abstract data type?
Given a type of structure, how can we determine whether it is a data structure or abstract data type?
If you are talking about a general Tree without specifying its implementation or any underlying data structure used, itself is an Abstract Data Type(ADT). ADT is any data type that doesn't specify its implementation.
But once you start talking about a concrete Tree with specific implementation using Linked List or Arrays, then that kind of concrete tree is a data structure.
With the above out of the way, the following may help you clear other confusions related to your question. Correct me if I'm wrong!
Data Type
The definition of data type from Wikipedia:
A data type or simply type is a classification identifying one of various types of data.
Data type is only a classification of data. It doesn't have any specifications about how those data are implemented. IMHO, data type is only a theoretical concept.
For example, any real number can be of the data type real. But along with integers, they can both be classified as a numeric data type, say number.
As I just pointed out, ADT is one kind of data type. But whether string, int can be considered as ADTs?
The answer is both yes and no.
Yes, because programming languages can have many ways to implement string and int ; but on one condition that through out all programming languages, these data types must share consistent properties.
No, because these primitive data types are not as abstract as stacks or queues. Since these data types seldom share consistent properties in every programming language, users of them must know the underlying problems like arithmetic overflow, etc.. Two languages may both have the int data type, but one ranges up to infinity and the other up to 2^32. This kind of technical detail must-knows is not what ADTs have promised. Let's look at stacks instead. In every programming language, stack can promise you with consistent procedures like pop, push. No other details on implementation level you should know about them, you just use them however you like it in every language.
Data Structure
Let's see the definition of data structure from Wiki:
A data structure is a particular way of organizing data in a computer so that it can be used efficiently.
As you can see, data structure is all about implementations. It is not conceptual but concrete. In my opinion, every piece of data in a program can by definition be considered as a data structure. A string can. An int can. And a whole bunch of other things like LinkedList_Stack or Array_Stack are all data structures.
Some of you might argue why int is a data structure? It's a data structure in a lower level from a programming language's author's view. Because programming languages can have many ways storing an int data type in a computer. The most common solution is two's complement, other alternatives are offset binary and ones' complement etc. However, from a user's view, we see int as the primitive data type which a programming language offers out of the box, we don't care its implementation. It's just the building block of one programming language. So for us users, any data constructed by these building blocks(primitive data types) of a programming language is more like a data structure. While for authors of programming languages, the building blocks are some lower level machine code, so for them int is definitely a data structure.
Put simply, whether one thing is a data structure or not really depends on how we look at it.
Via google:
In computer science, an abstract data type (ADT) is a mathematical
model for a certain class of data structures that have similar
behavior
So clearly, it is both.
From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)
I am interested in persisting individual directed graphs. This question is not asking for a full-scale graph database solution, but for a document format that I can use to save and individual arbitrary directed graph. I don't know what notation and file format would be the smartest choice.
My primary concerns are:
Expressiveness/Flexibility - I want the ability to express graphs of different types. While the standard use case would be a simple directed graph, it should be possible to express trees, cyclical graphs, multi-graphs. As a bare minimum, I would expect support for labeling and weighting of edges and nodes. Notations for describing higraphs and edge composition/hyper-edges would also be highly desirable, although I am aware that such solutions may not exist.
Type System-Independence - I am interested in representing the structural qualities of graphs. Some solutions include an extensible type system for typed edges and nodes (e.g. RDF/OWL). I would only be interested in such a representation, if there were a clearly defined canonical decomposition of typed elements into primitives (nodes/edges/attributes). What I am trying to avoid here is the ability for multiple representations of equivalent graphs, where the equivalence is not discernible.
Canonical Representation - There should be a mechanism that allows the graph to be represented canonically (in such a way that lexical equivalence of canonical-representations could be used to determine equivalence).
Presentation Independent - I would prefer a notation that is not dependent upon the presentation of the graph. This would include spatial orientation, colors, font, etc. I am only interested in representing the data. One of the features I don't like about DOT language, DGML or SVG (at least for this particular purpose) is the focus on visual representation.
Standardized / Open / Compatible - The less implementation work that I have to do, the better. If the format is standardized and reliable tools already exist for working with the format, then it is more preferable. Accompanying this requirement is another, that the format should be highly-compatible. The proprietary nature of Microsoft's DGML is a reason for my aversion, despite the Visual Studio tooling and the fact that I work primarily with .NET (now). The fact that W3C publishes RDF standards is a motivation for considering a limited subset of RDF as a representational tool. I also appreciate GXL and GraphML, because they have well documented xml schemas, thereby facilitating the ability to integrate their data with any xml-compatible software package.
Simplicity / Readability - I appreciate human-readable syntax and ease of interpretation. I also appreciate representation that simplifies parsing. For this reason, I like GML, but I am concerned it is not mainstream enough to be a realistic choice. I would also consider JSON or YAML for readability, if they were not so limited in their respective abilities to represent complex (non-DAG) structures.
Efficiency / Concise Representation - It's worth considering that whatever format I end up choosing will inevitably have to be persisted and transferred over some network. Therefore, file size is a relevant consideration.
Overview
I recognize that I will most likely be unable to find a solution that satisfies every criteria on my wishlist. I am simply asking for the file format that is closest to what I want and that doesn't limit extensibility for unsupported use cases.
ObWindyPreamble: in the RDF world, there are a gazillion different surface syntax formats to choose from. RDF itself is an abstract metamodel for data, not directly a "graph syntax". You can of course directly represent a graph in RDF (since RDF models are graphs), but given that you want to represent different kinds of graphs you may end up with having to abstract away, and actually create an RDF vocabulary for representing different types of graphs.
All in all, I'm not convinced that RDF is the best way to go for you, but if you'd choose one, I'd say that RDF's Turtle syntax is something worth looking into. It certainly ticks the readability and simplicity boxes, as well as being a standard (well, almost... W3C is working on standardizing it) and having wide (open-source) tool support.
RDF models roughly follow set semantics, which means that a canonical syntax representation can not really be enforced: two files can have information in a different order without it affecting the actual model, or even can contain duplicate information. However, if you enforce a simple sorting algorithm when producing files (something for which most RDF parsers/writers have support), you should be able to get away with doing line-based comparisons and determining graph equivalence based on surface syntax.
Just as a simple example, let's assume we have a very simple, directed, labeled graph:
A ---r1---> B ---r2---> C
You could represent this directly in RDF, as follows (using Turtle syntax):
#prefix : <http://example.org/> .
:A :r1 :B .
:B :r2 :C .
In a more abstract modeling, you could do something like this:
#prefix g: <http://example.org/graph-model/> .
#prefix : <http://example.org/> .
:A a g:Vertex .
:B a g:Vertex .
:C a g:Vertex .
:r1 a g:DirectedEdge ;
g:from :A ;
g:to :B .
:r2 a g:DirectedEdge ;
g:from :B ;
g:to :C .
The above is just a simplistic example of course, but hopefully it illustrates that this potentially meets quite a few of the things on your wish list.
By the way, if you want even simpler, N-Triples is also an RDF syntax, which is line-based and therefore easy to process in a streaming fashion. It's slightly more verbose than Turtle but it may make file comparison easier.
My thoughts:
What I'm missing is your particular practical purpose/domain.
You mention the generic JSON format next to specific formats (e.g. GraphML which is an application of XML). So I'm left with the question if you do or don't consider making your own format.
Wouldn't having a 'canonical representation that can be used to determine equivalence' solve the graph isomorphism problem?
GraphML seems to cover a lot of your theoretical requirements, so I'd suggest you create a JSON version of this. This would then also cover requirement 6.
Then, you could create a converter between the JSON format and GraphML (and possibly other formats).
For your requirement 7 it again all depends on the practical graph sizes. I mean, nowadays sending up to a few MB to a friggin mobile device is not considered much. A graph of a few MB in (about) any format you mention, is already a relatively large beast with tens of thousands of nodes & edges.
What about Trivial Graph Format: