Efficient traversal/search algorithm to fetch data from RDF? - algorithm

I have my data as a RDF graph in DB and using SPARQL i am retriving the data. Now the nodes (objects) in the graphs gets huge and the traversal/search gets much slower now.
a. Can anyone suggest the efficient traversal/search algorithm to fetch the data?
As a next step, i have federated data i.e the data from external applications like SAP. In this case, the search becomes even much slower.
b. What efficient search algorithm do i use in this case?
This seems like a common issue in an large enterprise systems, and any inputs on how these problems have been solved in such systems will also be helpful.

I had a similiar problem. I was doing a lot of graph traversal using SPARQL property paths and it was too slow using an RDF based repository. I was using Jena TDB which is supposed to be fast but still it was too slow !
Like #Mikos suggested, I tried Neo4J. It then got much faster. Like Mark Watson says on this blog entry,
RDF data stores support SPARQL queries: good for matching patterns in data.
Neo4j supports arbitrary graph structures and seems best for exploring
a neighborhood of a graph: start at a node and explore the connected
nodes. (graph traversal)
I used Neo4j but you can try any tool that is built for graph traversal. I read that Allegrograph 4 is RDF based and has good graph traversal speed.
Now Im using Neo4j but I didnt give up on RDF. I still use URIs as identifiers and try to reuse the popular rdf vocabularies and relations. Later I'll add a feature to render my gaphs as RDF. I know that with Neo4j you can also use Tinkerpop to render RDF but I havent tried it myself.

Graph traversal and efficient querying is a wide-ranging problem and the approach to use is dependent on your situation. I would suggest looking at a data-store like Neo4j and complementing it with a tool like Lucene.

Related

Do graph databases have problems with aggregation operations?

I came across multiple opinions that graph databases tend to have problems with aggregation operations. Like if you have a set of users and want to get maximum age, RDBMS will outperform graph database. Is true and if it is, what is the reason behind it? As far as I understand, key difference between relational and graph database is that each graph database node somehow includes references to the nodes it is connected to. How does it impact "get max age"-like query?
Disclaimer: most of what I have read was about Neo4j, but I suppose if these limitations exist, they should apply to any graph db.
The use of graph databases like Neo4j is recommended when dealing with connected data and complex queries.
The book Learning Neo4j by Rik Van Bruggen state that you should not use graph databases when dealing with simple, aggregate-oriented queries:
From the book:
(...) simple queries, where write patterns and read patterns align to
the aggregates that we are trying to store, are typically served quite
inefficiently in a graph, and would be more efficiently handled by an
aggregate-oriented Key-Value or Document store. If complexity is low,
the advantage of using a graph database system will be lower too.
The reason behind this is highly related to the nature of the persistence model. Its more easy to make a sum, max or avg operation over a tabled data than a data stored as graph.

ArangoDB graph feature vs GraphQL

I am trying to design a product catalog which contains several features for the product (e.g., size, color, producer, etc).
I am also using the Arango Database as my database. The ArangoDB provides a powerful graph traversal feature. In this regard, I would like to know if it would be better to take advantage of the Arango graph traversal feature, or use the graphQL.
I have been researching about graphQL and I honestly can't see how it is more beneficial than the ArangoDB graph traversal capabilities.
The graph traversal of ArangoDB is much more performant than graphQL. The only reason to pick graphQL is that it is a little bit easier to implement.
ArangoDB has recently created a graphQL AQL generator. Maybe this could be interesting for you.
https://www.arangodb.com/2017/10/auto-generate-graphql-arangodb/

What are the different graph processing alternatives for Hadoop/Spark

Giraph, GraphX, Neo4J are some solutions today I am aware of. As this is an area all the tech-giants are working, an updated list is much appreciated. The good comparison of the tools listed above is also not seen anywhere.
Firstly, I should mention that Giraph and GraphX are for graph processing and Neo4j is a graph database. If you are going to store your graph and query it like "give me some nodes that have content 'X' with two distance neighbor having content 'Y'" solutions like Neo4j (graph database) should be applied. Otherwise, Giraph and GraphX could play graph processing role.
Unfortunately, although GraphX offer very nice APIs, for large graph size it fails when available distributed memory is not enough. This condition is mostly observed when the size of intermediate data could not be fit in the available memory.
In addition, as represented in the literatures, Giraph often got the worst place in the performance but it is more stable than GraphX.
There are other solutions like GraphLab and Titan for "Distributed Graph Processing" which are valuable to investigate.

apply graph analysis on networked data represented with RDF

I want to run some analysis on networked data having multiple modes(i.e. multiple types of network nodes) and multiplex relations(i.e. multiples types of network edges).
The analysis is probably about SNA or applying any algorithm from graph theory, e.g. tie strength, centrality, betweenness, node distance, block, cluster, etc.
The source data is rather unstructured, therefore I should at first think about how I represent, store, and retrieve the data.
Following are some ideas. I would appreciate any feedback or further suggestion.:)
I know that there are already some great NoSQL databases, for example Neo4J, InfoGrid, for such kind of application. But for some extensibility reasons (e.g. licence, web standard...) I would like to prefer using RDF to store and represent my data. The tools to use would be SESAME or JENA.
the idea to represent network/graph data with RDF is trivial.
For example:
Network/Graph data
*Alice* ----lend 100USD----> *Bob* ----- likes ----> *Skiing*
represented with RDF
*Alice* --src--> *lend_relation* <---target--- *Bob* ---likes---> *Skiing*
|
has_value
\|/
*100USD*
[Alice src lend_relation]
[Bob target lend_relation]
[lend_relation has_value 100USD]
[Bob likes Skiing]
However, the problem is that RDF as well as SPARQL lacks of perspectives of graph model.
It is not efficient to traverse between nodes or find (the shortest) distance with RDF query.
It must be done with some extra analysis tools, for example JUNG or JGarphT,
and I must at first construct a sub graph by querying RDF storage and then convert it into the data model used by JUNG or JGraphT. If I want extra visualization (neither from JUNG nor JGraphT), then I must construct another data model for the visualization toolkit.
I don't know if that is a clear or efficient integration.
thanks again for any suggestion!
If you want to do network analysis of your RDF data with SPARQL you can have a look at SPARQL 1.1 Property Paths. I believe that in Jena/ARQ it's been already implemented ARQ - Property Paths.
Property Paths, from the new spec of SPARQL, allows you to query the RDF data model by defining graph patterns. Graph patterns that are a bit more complex than the ones you could define in SPARQL 1.0.
With this feature plus some logic at the application level you might be able to implement some interesting network analysis over your data.

How can I build an incremental directed acyclic word graph to store and search strings?

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.
You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.

Resources