apply graph analysis on networked data represented with RDF - algorithm

I want to run some analysis on networked data having multiple modes(i.e. multiple types of network nodes) and multiplex relations(i.e. multiples types of network edges).
The analysis is probably about SNA or applying any algorithm from graph theory, e.g. tie strength, centrality, betweenness, node distance, block, cluster, etc.
The source data is rather unstructured, therefore I should at first think about how I represent, store, and retrieve the data.
Following are some ideas. I would appreciate any feedback or further suggestion.:)
I know that there are already some great NoSQL databases, for example Neo4J, InfoGrid, for such kind of application. But for some extensibility reasons (e.g. licence, web standard...) I would like to prefer using RDF to store and represent my data. The tools to use would be SESAME or JENA.
the idea to represent network/graph data with RDF is trivial.
For example:
Network/Graph data
*Alice* ----lend 100USD----> *Bob* ----- likes ----> *Skiing*
represented with RDF
*Alice* --src--> *lend_relation* <---target--- *Bob* ---likes---> *Skiing*
|
has_value
\|/
*100USD*
[Alice src lend_relation]
[Bob target lend_relation]
[lend_relation has_value 100USD]
[Bob likes Skiing]
However, the problem is that RDF as well as SPARQL lacks of perspectives of graph model.
It is not efficient to traverse between nodes or find (the shortest) distance with RDF query.
It must be done with some extra analysis tools, for example JUNG or JGarphT,
and I must at first construct a sub graph by querying RDF storage and then convert it into the data model used by JUNG or JGraphT. If I want extra visualization (neither from JUNG nor JGraphT), then I must construct another data model for the visualization toolkit.
I don't know if that is a clear or efficient integration.
thanks again for any suggestion!

If you want to do network analysis of your RDF data with SPARQL you can have a look at SPARQL 1.1 Property Paths. I believe that in Jena/ARQ it's been already implemented ARQ - Property Paths.
Property Paths, from the new spec of SPARQL, allows you to query the RDF data model by defining graph patterns. Graph patterns that are a bit more complex than the ones you could define in SPARQL 1.0.
With this feature plus some logic at the application level you might be able to implement some interesting network analysis over your data.

Related

Do graph databases have problems with aggregation operations?

I came across multiple opinions that graph databases tend to have problems with aggregation operations. Like if you have a set of users and want to get maximum age, RDBMS will outperform graph database. Is true and if it is, what is the reason behind it? As far as I understand, key difference between relational and graph database is that each graph database node somehow includes references to the nodes it is connected to. How does it impact "get max age"-like query?
Disclaimer: most of what I have read was about Neo4j, but I suppose if these limitations exist, they should apply to any graph db.
The use of graph databases like Neo4j is recommended when dealing with connected data and complex queries.
The book Learning Neo4j by Rik Van Bruggen state that you should not use graph databases when dealing with simple, aggregate-oriented queries:
From the book:
(...) simple queries, where write patterns and read patterns align to
the aggregates that we are trying to store, are typically served quite
inefficiently in a graph, and would be more efficiently handled by an
aggregate-oriented Key-Value or Document store. If complexity is low,
the advantage of using a graph database system will be lower too.
The reason behind this is highly related to the nature of the persistence model. Its more easy to make a sum, max or avg operation over a tabled data than a data stored as graph.

How to build a knowledge graph?

I prototyped a tiny search engine with PageRank that worked on my computer. I am interested in building a Knowledge Graph on top of it, and it should return only queried webpages that are within the right context, similarly to how Google found relevant answers to search questions. I saw a lot of publicity around Knowledge Graphs, but not a lot of literature and almost no pseudocode like guideline of building one. Does anyone know good references on how such Knowledge Graphs work internally, so that there will be no need to create models about a KG?
Knowledge graph is a buzzword. It is a sum of models and technologies put together to achieve a result.
The first stop on your journey starts with Natural language processing, Ontologies and Text mining. It is a wide field of artificial intelligence, go here for a research survey on the field.
Before building your own models, I suggest you try different standard algorithms using dedicated toolboxes such as gensim. You will learn about tf-idf, LDA, document feature vectors, etc.
I am assuming you want to work with text data, if you want to do image search using other images it is different. Same for the audio part.
Building models is only the first step, the most difficult part of Google's knowledge graph is to actually scale to billions of requests each day ...
A good processing pipeline can be built "easily" on top of Apache Spark, "the current-gen Hadoop". It provides a resilient distributed datastore which is mandatory if you want to scale.
If you want to keep your data as a graph, as in graph theory (like pagerank), for live querying, I suggest you use Bulbs which is a framework which is "Like an ORM for graphs, but instead of SQL, you use the graph-traversal language Gremlin to query the database". You can switch the backend from Neo4j to OpenRDF (useful if you do ontologies) for instance.
For graph analytics you can use Spark, GraphX module or GraphLab.
Hope it helps.
I know I'm really late but first to clarify some terminology: Knowledge Graph and Ontology are similar (I'm talking in the Semantic Web paradigm). In the semantic web stack the foundation is RDF which is a language for defining graphs as triples (Subject, Predicate, Object). RDFS is a layer on top of RDF. It defines a meta-model, e.g., predicates such as rdf:type and nodes such as rdfs:Class. Although RDFS provides a meta-model there is no logical foundation for it so there are no reasoners that can validate the model or do further reasoning on it. The layer on top of RDFS is OWL (Web Ontology Language). That has a formal semantics defined by Description Logic which is a decidable subset of First Order Logic. It has more predefined nodes and links such as owl:Class, owl:ObjectProperty, etc. So when people use the term ontology they typically mean an OWL model. When they use the term Knowledge Graph it may refer to an ontology defined in OWL (because OWL is still ultimately an RDF graph) or it may mean just a graph in RDF/RDFS.
I said that because IMO the best way to build a knowledge graph is to define an ontology and then use various semantic web tools to load data (e.g., from spreadsheets) into the ontology. The best tool to start with IMO is the Protege ontology editor from Stanford. It's free and for a free open source tool very reliable and intuitive. And there is a good tutorial for how to use Protege and learn OWL as well as other Semantic Web tools such as SPARQL and SHACL. That tutorial can be found here: New Protege Pizza Tutorial (disclosure: that links to my site, I wrote the tutorial). If you want to get into the lower levels of the graph you probably want to check out a triplestore. It is a graph database designed for OWL and RDF models. The free version of Franz Inc's AllegroGraph triplestore is easy to use and supports 5M triples. Another good triplestore that is free and open source is part of the Apache Jena framework.

What is the most standard file format and notation for persisting expressive directed graphs?

I am interested in persisting individual directed graphs. This question is not asking for a full-scale graph database solution, but for a document format that I can use to save and individual arbitrary directed graph. I don't know what notation and file format would be the smartest choice.
My primary concerns are:
Expressiveness/Flexibility - I want the ability to express graphs of different types. While the standard use case would be a simple directed graph, it should be possible to express trees, cyclical graphs, multi-graphs. As a bare minimum, I would expect support for labeling and weighting of edges and nodes. Notations for describing higraphs and edge composition/hyper-edges would also be highly desirable, although I am aware that such solutions may not exist.
Type System-Independence - I am interested in representing the structural qualities of graphs. Some solutions include an extensible type system for typed edges and nodes (e.g. RDF/OWL). I would only be interested in such a representation, if there were a clearly defined canonical decomposition of typed elements into primitives (nodes/edges/attributes). What I am trying to avoid here is the ability for multiple representations of equivalent graphs, where the equivalence is not discernible.
Canonical Representation - There should be a mechanism that allows the graph to be represented canonically (in such a way that lexical equivalence of canonical-representations could be used to determine equivalence).
Presentation Independent - I would prefer a notation that is not dependent upon the presentation of the graph. This would include spatial orientation, colors, font, etc. I am only interested in representing the data. One of the features I don't like about DOT language, DGML or SVG (at least for this particular purpose) is the focus on visual representation.
Standardized / Open / Compatible - The less implementation work that I have to do, the better. If the format is standardized and reliable tools already exist for working with the format, then it is more preferable. Accompanying this requirement is another, that the format should be highly-compatible. The proprietary nature of Microsoft's DGML is a reason for my aversion, despite the Visual Studio tooling and the fact that I work primarily with .NET (now). The fact that W3C publishes RDF standards is a motivation for considering a limited subset of RDF as a representational tool. I also appreciate GXL and GraphML, because they have well documented xml schemas, thereby facilitating the ability to integrate their data with any xml-compatible software package.
Simplicity / Readability - I appreciate human-readable syntax and ease of interpretation. I also appreciate representation that simplifies parsing. For this reason, I like GML, but I am concerned it is not mainstream enough to be a realistic choice. I would also consider JSON or YAML for readability, if they were not so limited in their respective abilities to represent complex (non-DAG) structures.
Efficiency / Concise Representation - It's worth considering that whatever format I end up choosing will inevitably have to be persisted and transferred over some network. Therefore, file size is a relevant consideration.
Overview
I recognize that I will most likely be unable to find a solution that satisfies every criteria on my wishlist. I am simply asking for the file format that is closest to what I want and that doesn't limit extensibility for unsupported use cases.
ObWindyPreamble: in the RDF world, there are a gazillion different surface syntax formats to choose from. RDF itself is an abstract metamodel for data, not directly a "graph syntax". You can of course directly represent a graph in RDF (since RDF models are graphs), but given that you want to represent different kinds of graphs you may end up with having to abstract away, and actually create an RDF vocabulary for representing different types of graphs.
All in all, I'm not convinced that RDF is the best way to go for you, but if you'd choose one, I'd say that RDF's Turtle syntax is something worth looking into. It certainly ticks the readability and simplicity boxes, as well as being a standard (well, almost... W3C is working on standardizing it) and having wide (open-source) tool support.
RDF models roughly follow set semantics, which means that a canonical syntax representation can not really be enforced: two files can have information in a different order without it affecting the actual model, or even can contain duplicate information. However, if you enforce a simple sorting algorithm when producing files (something for which most RDF parsers/writers have support), you should be able to get away with doing line-based comparisons and determining graph equivalence based on surface syntax.
Just as a simple example, let's assume we have a very simple, directed, labeled graph:
A ---r1---> B ---r2---> C
You could represent this directly in RDF, as follows (using Turtle syntax):
#prefix : <http://example.org/> .
:A :r1 :B .
:B :r2 :C .
In a more abstract modeling, you could do something like this:
#prefix g: <http://example.org/graph-model/> .
#prefix : <http://example.org/> .
:A a g:Vertex .
:B a g:Vertex .
:C a g:Vertex .
:r1 a g:DirectedEdge ;
g:from :A ;
g:to :B .
:r2 a g:DirectedEdge ;
g:from :B ;
g:to :C .
The above is just a simplistic example of course, but hopefully it illustrates that this potentially meets quite a few of the things on your wish list.
By the way, if you want even simpler, N-Triples is also an RDF syntax, which is line-based and therefore easy to process in a streaming fashion. It's slightly more verbose than Turtle but it may make file comparison easier.
My thoughts:
What I'm missing is your particular practical purpose/domain.
You mention the generic JSON format next to specific formats (e.g. GraphML which is an application of XML). So I'm left with the question if you do or don't consider making your own format.
Wouldn't having a 'canonical representation that can be used to determine equivalence' solve the graph isomorphism problem?
GraphML seems to cover a lot of your theoretical requirements, so I'd suggest you create a JSON version of this. This would then also cover requirement 6.
Then, you could create a converter between the JSON format and GraphML (and possibly other formats).
For your requirement 7 it again all depends on the practical graph sizes. I mean, nowadays sending up to a few MB to a friggin mobile device is not considered much. A graph of a few MB in (about) any format you mention, is already a relatively large beast with tens of thousands of nodes & edges.
What about Trivial Graph Format:

Efficient traversal/search algorithm to fetch data from RDF?

I have my data as a RDF graph in DB and using SPARQL i am retriving the data. Now the nodes (objects) in the graphs gets huge and the traversal/search gets much slower now.
a. Can anyone suggest the efficient traversal/search algorithm to fetch the data?
As a next step, i have federated data i.e the data from external applications like SAP. In this case, the search becomes even much slower.
b. What efficient search algorithm do i use in this case?
This seems like a common issue in an large enterprise systems, and any inputs on how these problems have been solved in such systems will also be helpful.
I had a similiar problem. I was doing a lot of graph traversal using SPARQL property paths and it was too slow using an RDF based repository. I was using Jena TDB which is supposed to be fast but still it was too slow !
Like #Mikos suggested, I tried Neo4J. It then got much faster. Like Mark Watson says on this blog entry,
RDF data stores support SPARQL queries: good for matching patterns in data.
Neo4j supports arbitrary graph structures and seems best for exploring
a neighborhood of a graph: start at a node and explore the connected
nodes. (graph traversal)
I used Neo4j but you can try any tool that is built for graph traversal. I read that Allegrograph 4 is RDF based and has good graph traversal speed.
Now Im using Neo4j but I didnt give up on RDF. I still use URIs as identifiers and try to reuse the popular rdf vocabularies and relations. Later I'll add a feature to render my gaphs as RDF. I know that with Neo4j you can also use Tinkerpop to render RDF but I havent tried it myself.
Graph traversal and efficient querying is a wide-ranging problem and the approach to use is dependent on your situation. I would suggest looking at a data-store like Neo4j and complementing it with a tool like Lucene.

How can I build an incremental directed acyclic word graph to store and search strings?

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.
You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.

Resources