I need to migrate from RDBMS to GRAPH and decide to implement neo4j with gremlin. But I only have PHP as a server side language. What are the steps to implement neo4j (and gremlin?) in codeigniter environment. May be this question is too general but I'm sure many people have the same problem like me.
In general, (not sure about PHP frameworks), you will want to do the following process.
For all your "objects tables," loop through the rows and create a respective vertex. For the columns of those rows (e.g. id, name, age), add them as properties of the vertex. For instance, if you have a Person-table, then SELECT * FROM Person. Each row is a vertex with properties.
For all your "relational tables" (or that which is relational via join), loop through the rows and link your vertices. For instance SELECT personId, companyID FROM WorksFor. Each row is an edge that links a person vertex with a company vertex.
Adding vertices/edges via Gremlin is simple. The complicated aspect of your process, is the workflow you go through to create your mapping.
https://github.com/tinkerpop/gremlin/wiki/Updating-a-Graph
Finally, be sure to be smart about transaction handling so you don't blow your heap. You will want to commit your transaction every so often to have the data persisted to disk.
Not sure, but since Cypher http://docs.neo4j.org/chunked/snapshot/cypher-query-lang.html is the native Neo4j language and very SQL-like, that might be an easier option?
implementing this library should work in codeigniter too:
https://github.com/jadell/Neo4jPHP/wiki
Related
I'm endeavoring to develop an application that uses Oracle as the database back-end. The application will calculate several statistics from the various tables in the database. The front-end will most likely be a web application and this front-end will display various charts and calculated statistics. Now, I imagine that it would be more efficient to perform the calculations in the database rather than in the service layer because said calculations would need to be performed for every web request. That being the case, I'm not sure which mechanism to use. (e.g. stored procedure, function, view) To illustrate what I'm going for, suppose I want to keep statistics of student grades for many students. I would like to have a web interface that lets me view those statistics on student-by-student basis and also an all-inclusive basis. Some of the stats are dependent on aggregates (e.g. average, min, max) of all of the student grades and some stats are dependent only on an individual student. In this situation, every time a record is added or updated, the aggregates would have to be recalculated. So I am speculating that if I had a special table that held all of the calculated values I need and a trigger(s) to recalculate everything when a record is added/updated then all I would need to do from a web request point-of-view is have the service layer pull the desired values from this special table. I'm just not sure if this is the best way to go or not so I am asking the community for any input/advice. Note: Although I'm using Oracle, I'm open to using PostgreSQL or mySQL.
Thanks in advance
The scenario you are describing would be ideal for using materialized views. They can be designed to refresh automatically (and incrementally) every time the source data is updated by your application. The calculations would be built in to the view definition. No triggers required, and likely no stored procedures unless your calculations involve multiple steps. Check here: https://oracle-base.com/articles/misc/materialized-views and here: https://medium.com/oracledevs/lightning-fast-sql-with-real-time-materialized-views-12-things-developers-will-love-about-oracle-54bcc9eac358 for more info.
I was recently reading through this section in the ElasticSearch documentation (or the guide to be more precise). It says that you should try to use a non-relational database the intended way, meaning you should avoid joins between different tables because they are not designed to handle those well. This also reminds me on the section in the DynamoDB docs stating that most well-designed DynamoDB backends only require one table.
Let's take as an example a recipes database where each recipe is using several ingredients. Every ingredient can be used in many different recipes.
Option 1: The obvious way to me to model this in AppSync and DynamoDB, would be to start with an ingredients table which has one item per ingredient storing all the ingredient data, with the ingredient id as partition key. Then I have another recipes table with the partion key recipe id and an ingredients field storing all the ingredient ids in an array. In AppSync I could then query a recipe by doing a GetItem request by recipe id and then resolving the ingredients field with a BatchGetItem on the ingredients table. Let's say a recipe contains 10 ingredients on average, so this would mean 11 GetItem requests sent to the DynamoDB tables.
Option 2: I would consider this a "join like" operation which is apparently not the ideal way to use non-relational databases. So, alternatively I could do the following: Make "redundant copies" of all the ingredient data on the recipes table and not only save the ingredient id there, but also all the other data from the ingredients table. This could drastically increase disk space usage, but apparently disk space is cheap and the increase in performance by only doing 1 GetItem request (instead of 11) could be worth it. As discussed later in the ElasticSearch guide this would also require some extra work to ensure concurrency when ingredient data is updated. So I would probably have to use a DynamoDB stream to update all the data in the recipes table as well when an ingredient is updated. This would require an expensive Scan to find all the recipes using the updated ingredient and a BatchWrite to update all these items. (An ingredient update might be rare though, so the increase in read performance might be worth that.)
I would be interested in hearing your thoughts on this:
Which option would you choose and why?
The second "more non-relational way" to do this seems painful and I am worried that with more levels/relations appearing (for example if users can create menus out of recipes), the resulting complexity could get out of hand quickly when I have to save "redundant copies" of the same data multiple times. I don't know much about relational databases, but these things seem much simpler there when every data has its unique location and that's it (I guess that's what "normalization" means).
Is getRecipe in the Option 1 really 11 times more expensive (performance and cost wise) than in Option 2? Or do I misunderstand something?
Would Option 1 be a cheaper operation in a relational database (e.g. MySQL) than in DynamoDB? Even though it's a join if I understand correctly, it's also just 11 ("NoSQL intended way") GetItem operations. Could this still be faster than 1 SQL query?
If I have a very relational data structure could a non-relational database like DynamoDB be a bad choice? Or is AppSync/GraphQL a way to still make it a viable choice (by allowing Option 1 which is really easy to build)? I read some opinions that constantly working around the missing join capability when querying NoSQL databases and having to do this on the application side is the main reason why it's not a good fit. But AppSync might be a way to solve this problem. Other opinions (including the DynamoDB docs) mention performance issues as the main reason why you should always query just one table.
This is quite late, I know, but might help someone down the road.
Start with an entity relationship diagram as this will help determine your options. Even in NoSQL, there are standard ways of modeling relationships.
Next, define your access patterns. Go through all the CRUDL operations and make sure that for each operation, you can access the specific data for that operation. For example, in your option 1 where ingredients are stored in an array in a field: think through an access pattern where you might need to delete an ingredient in a recipe. To do this, you need to know the index of the item in the array. Therefore, you have to obtain the entire array, find the index of the item, and then issue another call to update the array, taking into account possible race conditions.
Doing this in your application, while possible, is not efficient. You can also code this up in your resolver, but attempting to do so using velocity template language is not worth the headache, trust me.
The TL;DR is to model your entire application's entity relationship diagram, and think through all the access patterns. If the relationship is one-to-many, you can either denormalize the data, use a composite sort key, or use secondary indexes. If many-to-many, you start getting into adjacency lists and other advanced strategies. Alex DeBrie has some great resources here and here.
My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.
I'm trying to think out loud here to understand if graphql is a likely candidate for my need.
We have a home-grown self servicing report creation tool. This is web-based. It starts with user selecting a particular report type.
The report type in itself is a base SQL query. In subsequent screens, one can select the required columns, filters, etc. As we The output of all these steps is a SQL query, which is then run on an Oracle database.
As you can see, there are lot of cons with this tool. It is tightly coupled with the Oracle OLTP tables. There are hundreds of tables.
Given the current data model, and the presence of many tables, I'm wondering if GraphQL would be the right approach to design a UI that could act like a "data explorer". If I could combine some of the closely related tables and abstract them via GraphQL into logical groups, I'm wondering if I could create a report out of them.
**Logical Group 1**
Table1
Table2
Table3
Table4
Table5
**Logical Group 2**
Table6
Table7
Table8
Table9
Table10
and so on..
Let's say, I want 2 columns from tables in Logical group 1 and 4 Columns from Logical Group 2, is this something that could be defined as a GraphQL object and retrieved to be either rendered on a screen or written to a file?
I think I'm trying to write a data modelling UI via GraphQL. Is this even a good candidate for such a need?
We have also been evaluating Looker as a possible data modelling layer. However, it seems like there could be some
Thanks.
Without understanding your data better, it is hard to say for certain, but at first glance, this does not seem like a problem that is well suited to GraphQL.
GraphQL's strength is its ability to model + traverse a graph of data. It sounds to me like you are not so much traversing a continuous graph of data as cherry picking tables from a DB. It certainly is possible, but there may be a good deal of friction since this was not its intended design.
The litmus test I would use is the following two questions:
Can you imagine your problem mapping well to a REST API?
Does your API get consumed by performance sensitive clients?
If so, then GraphQL may serve your needs well, if not you may want to look at something like https://grpc.io/
In our project we have a requirement to create dynamic notifications that "pop" in our site when a relevant rule applies.
We are based on oracle exadata as our main database.
This feature is suppose to allow the users to create dynamic rules that will be occasionally checked.
These rules may check specific fields in certain types, and may also check these fields relatively to other types field's data.
For example, if our program has a table of cars, with a location column, and another table of streets, with location column (no direct relation between those two tables), we might need to notify the users if a car is in a certain street.
Is there a good platform that can help us calculate the kind of "rules" that we want to check?
We started looking at elasticsearch and neo4j (we have a specific module that involves a graph-like relations..), but we aren't sure that they would be the right solution.
Any idea would be appreciated :)
Neo4j could help you to express your rules, but it sounds as if your disconnected data is rather queried by SQL style joins?
So if you want to express and manage your rules in predicates in the graph you can do that easily and then get a list of applicable rules to trigger queries in other databases.