Faster retrieval of relationship properties in Neo4J graph database - performance

in my data model I take a statement of a user with hashtags, each hashtag is turned into a node and their co-occurrence is the relationship between them. For each relationship I need to take into account:
the user who created it rel.user property
the time it was created - rel.timestamp property
the context it was created in - rel.context property
the statement it was made in - rel.statement property
Now, Neo4J doesn't allow relationship property indexing and so when I do the search that requires me to retrieve and evaluate those properties, it takes a very long time. Specifically, when I do a Cypher request of the kind:
MERGE hashtag1-[rel:TO
{context:"deb659c0-a18d-11e3-ace9-1fa4c6cf2894",
statement:"824acc80-aaa6-11e3-88e3-453baabaa7ed",
user:"b9745f70-a13f-11e3-98c5-476729c16049"}]->hashtag2
ON CREATE
SET
rel.uid="824f6061-aaa6-11e3-88e3-453baabaa7ed",
rel.timestamp="13947117878770000";
This request first checks if there is a relationship with those properties and if there is, it won't do anything, but if there is none, it will add a new one (with a unique ID and timestamp). So then because evaluation of each relationship has to take place – and they are not indexed –  it takes a very long time for this request to go through. Now I'm having a problem with such request because I'm dealing with about 100 nodes and 300 relations at one query (the one above is only 1 type, there are also a few others added to the query but those above are the most expensive ones).
Therefore the actual question:
Does anybody know of a good way to keep those relationship properties and to somehow make them work faster, so they can be retrieved and evaluated when needed faster? Or do you think I should use a different type of request (if yes, which?)
Thank you!

This almost looks to me as if you relationship should actually be a node, which then would be connected to nodes:
context
user
statement
tag1
tag2
tagN
Then you can have sensible merge options (.e.g merge on UID).
Currently you loose the power of the graph model for your relationships.
This is also discussed in the graph-databases book in the chapter with the email domain.

Do you already have your hashtag1 and hashtag2 nodes available?
And if so, how many rels already exist between these?
What Cypher has to do for this to work, is to go over each of those relationships and compare all 3 properties (which I'm not sure will fit into shortstring storage) so they have to be loaded if they are not in the cache. You can check your store files, if you have a large string-store file then those uid's might not fit into the property records and have to be loaded separately.
What is the memory config of your server (heap and mmio)?
All that adds up.

Related

DynamoDB Throughput vs Search time

I've just figured out a big mistake I had while creating the dynamodb structure.
I've created 11 tables, whereas one of them is the table mostly refereed to and the others are complementary tables.
For example, I have a table where I hold names (together with other info) called "Names" and another table called "NamesMappings" holding all these names added to the "Names" table so that each time a user wants to add a name to the "Names" table he first tries to put the name in "NamesMappings" and only if it succeed (therefore this name doesn't exist) he can add the name into the "Names" table. This procedure helps if the name is not unique and is not the primary key in the "Names" table and with this technique I don't have to search inside the "Names" table if the name exists, but instead I can try to add it to the "NamesMappings" table and only if it succeed I know this is a unique name.
First of all, I would like to ask you if this is a common approach or there is a better one?
Next, I figured out that with this design I soon reached to 11 tables each has 5 provisioned capacity of read and write which leads to overall 55 provisioned read and write under the free-tier. Then I understood why I get all these payments each month, because as the number of tables is getting bigger, and I leave the provisioned capacity as default (both read/write capacity are 5) I get more and more provisioned capacity.
So, what should be my conclusion from this understanding? Should I try to reduce the number of tables even if it takes more effort to preform scanning and querying inside the table? Or should I split the table same as I do but reduce the capacity of these mappings tables used only for indication if an item exists or not in another table?
If I understand your problem correctly you're missing the whole concept of NoSQL Databases.
Your Names table should have a Hash key (which is similar to a Primary key) that has a uniformly generated identifier (an UUID is a great candidate). This would automatically make this Table queryable by this unique identifier. You said, however, that you don't know the ID but you only know the Name instead. This leads me to think you could create a Global Secondary Index (GSI) on the Name attribute inside the Names table so you can also query by Name. Up to this point, your table structure should look like this:
id | name
Both of them are independently queryable, which gives you a lot of flexibility already.
Now, let's say you want to add the NameMapping attribute (which I don't know how it looks like), you can simply add it under the Names table, getting rid of the NamesMappings table, greatly reducing the number of WCUs and RCUs across your account. Your table structure should now look like this:
id | name | mappings
where mappings is, let's say, a JSON object.
Since you can only query on top level attributes in DynamoDB, you can now perform a query against the name attribute which has a GSI configured. If the query returns nothing, then name is unique. But let's say you still need some data inside the mappings object, then you could query by name and, in your code, you could apply a map/filter/reduce operation on the mappings attribute and decide what to do next.
Remember that duplication is just OK in a NoSQL world. This may look scary if you come from a purely SQL background, but data should be stored in such a way in NoSQL databases that you should be able to fetch all the needed information in one go, therefore avoiding "joins" (joins are still possible in a NoSQL database, but since there are no strong relationships between entities, you need to perform these joins manually on the code level). To give you some real context, imagine you have a Orders table where you keep track of the ordered Products and the Store that the Order belongs to: you'd save both the Products and the Store objects (and not their IDs, as it would happen in the SQL way) inside the Order object, so if you want to query for a given OrderId in the future, you wouldn't need to make extra calls (aka "joins") to the Product/Store tables to fetch the information, since everything would already be stored inside the Order object.

What would be the most appropriate data structure given these requirements?

We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.

Core Data add index to relationships?

Is there a way to add an index to a relationship in an Entity? I can see that attributes can be indexed, but not on relationships.
I have a large dataset and need to check which ones actually have or not a relationship object in a predicate:
[NSPredicate predicateWithFormat:#"relationshipField = nil"]
So I thought that it would be a good idea to index that relationship.
So my question is, can there be ? And if not, is there a performance strategy for this scenario?
You can't do this manually but you don't need to because — empirically speaking — Core Data does it automatically where it can. So that means that 'to one' relationships are automatically given indexes. Check your SQLite store for names like ZENTITY_ZRELATIONSHIP_INDEX to see the proof.
There is no meaningful way to add an index for a 'to many' relationship so Core Data couldn't do that automatically or expose the option. SQLite's format for indices is just a list of rows sorted by the thing being indexed, into which it performs a binary search. It has no way of building such a structure for a multivalued column, and multivalued columns are how Core Data stores to-many links.
Disclaimer: of course, these are all implementation-specific observations that appear in my subjective experience to apply to the current implementation of Core Data. But as a general rule I think you could safely say that Core Data will optimise for relationship lookups as best as it can.

How to Program a Spring with Hibernate web app?

I am Working on web application where i have 90 fields for a Person class which are divided in to family details,education details, personal details etc....
I want separate form for each, like for family details has-father name, mother name siblings etc... fields and so on for other
I want separate table for each detail with common reference id for all tables
My question is how many bean classes should i write? Is it with one bean class can i map from multiple forms to multiple tables?
class PersonRegister{
private Long iD;
private String emailID;
private String password;
.
.
}//for register.......
once logged in i need to maintain his/her details
Either
class person{
}
or
class PersonFamilyDetails{}
class PersonEducationDetails{}
etc
which way software developing standards specify to create?
Don't go overboard, I believe in your case single but very wide (i.e. with a lot of columns) table would be most efficient and simplest from maintenance perspective. Only thing to keep in mind is too query only for a necessary subset of columns/fields when loading lots of rows. Otherwise you'll be fetching kilobytes of unnecessary data, not needed for particular use case.
Unfortunately Hibernate doesn't have direct support for that, when designing a mapping for Person, you'll end up with huge class and even worse - Hibernate will always fetch all simple columns (and many-to-one relationships). You can however overcome this problem either by creating several views in the database containing only subset of columns or by having several Java classes mapping to the same table but only to subset of columns.
Splitting your database model into several tables is beneficial only if your schema is not normalized. E.g. when storing siblings first name and last name you may wish to have a separate Sibling table and next time some other family member is entered, you can reuse the same row. This makes database smaller and might be faster when searching by sibling.
Your question comes down to database normalization, as described in-depth by Boyce and Codd, see
http://en.wikipedia.org/wiki/Database_normalization.
The main advantage of database normalization is avoiding modification anomalies. In your case, if you got one table with for each person e.g. father-firstname and father-lastname, and you have multiple people with the same father, this data will be duplicated, and when you discover a typo in the father-lastname, you could modify it for one sibling, and not for the next.
In this simplified case, database design best practices would call for a first normalization into a separate table with father-id, father-firstname and father-lastname, and your person table having a one-to-many relation to it.
For one-to-one relations, e.g. person->personeducationdetails, there's some debate. In the original definition of 1st Normal Form, every optional field would be normalized by putting it's own table. This was later weakened by introducing 'null' in relational databases, see http://en.wikipedia.org/wiki/First_normal_form#cite_note-CoddRule-12. But still, if a whole set of columns could be null at the same time, you put them in a separate table with a one-to-one relation.
E.g. if you don't know a person's educationdetails, all of its related fields are null, so you better split them off in a separate table, and simply not have a personeducationdetails record for that person.

Performance benefit? Passing Class instances or just IDs. (ASP.net MVC3)

The small web application I am working on is becoming bigger and bigger. I've noticed that when posting forms or just calling other functions I've passed parameters that consist of IDs or a whole instance of a Model class.
In a performance stand point, is it better for me to pass the whole Model object (filled with values) or should I pass the ID, then retrieve from the database?
Thanks!
For Performance benefits, you can do lot of things, common things are
1) Fetch as many as records which are needed, e.g. customized paging, in LINQ use (skip and take methods)
2) Use Data caching in controllers and Cache dependencies for Lists which are bound with View
3) Use Compiled query to fetch records. (see here)
Apply all these and see the mark-able page load speed.
EDIt: For IDs recommendations, In this question, Both will be same performance impact if you pass only ID and fetch rest of the model from database OR pass filled model.
Do not solve problems which do not exist yet. Use a tool to measure the performance problem and then try to solve.
It is always best to consider these from the use case.
For example, if I want to get an item by ID, then I pass the ID, not the whole object with the ID filled out.
I use WCF services to host my BLL and interface to my DAL, so passing data around is a costly exercise, so I do it sparingly.
If I need to update an object, I pass the object, if I just want to perform an action on an object, such as delete or get, I use the ID.
Si

Resources