Core Data add index to relationships? - cocoa

Is there a way to add an index to a relationship in an Entity? I can see that attributes can be indexed, but not on relationships.
I have a large dataset and need to check which ones actually have or not a relationship object in a predicate:
[NSPredicate predicateWithFormat:#"relationshipField = nil"]
So I thought that it would be a good idea to index that relationship.
So my question is, can there be ? And if not, is there a performance strategy for this scenario?

You can't do this manually but you don't need to because — empirically speaking — Core Data does it automatically where it can. So that means that 'to one' relationships are automatically given indexes. Check your SQLite store for names like ZENTITY_ZRELATIONSHIP_INDEX to see the proof.
There is no meaningful way to add an index for a 'to many' relationship so Core Data couldn't do that automatically or expose the option. SQLite's format for indices is just a list of rows sorted by the thing being indexed, into which it performs a binary search. It has no way of building such a structure for a multivalued column, and multivalued columns are how Core Data stores to-many links.
Disclaimer: of course, these are all implementation-specific observations that appear in my subjective experience to apply to the current implementation of Core Data. But as a general rule I think you could safely say that Core Data will optimise for relationship lookups as best as it can.


Are Doctrine relations affecting application performance?

I am working on a Symfony project with a new team, and they decide to stop using Doctrine relations the most they can because of performances issues.
For instance I have to stock the id of my "relation" instead of using a ManyToOne relation.
But I am wondering if it is a real problem?
The thing is, it changes the way of coding to retrieve information and so on.
The performance issue most likely comes from the fact that queries are not optimised.
If you let Doctrine (Symfony component that handle the queries) do the queries itself (by using findBy(), findAll(), findOneBy(), etc), it will first fetch what you asked, then do more query as it will require data from other tables.
Lets take the most common example, a library.
One Book have one Author, but one Author can have many Books (Book <= ManyToOne => Author)
One Book is stored in one Shelf (Book <= OneToOne => Sheilf)
Now if you query a Book, Doctrine will also fetch Shelf as it's a OneToOne relation.
But it won't fetch Author. In you object, you will only have access to as this information is in the Book itself.
Thus, if in your Twig view, you do something like {{ }}, as the information wasn't fetched in the initial query, Doctrine will add an extra query to fetch data about the author of the book.
Thus, to prevent this, you have to customize your query so it get the required data in one go, like this:
public function getBookFullData(Book $book) {
->join('book.shelf', 'shelf')
->join('', 'author');
return $qb->getQuery()->getResult();
With this custom query, you can get all the data of one book in one go, thus, Doctrine won't have to do an extra query.
So, while the example is rather simple, I'm sure you can understand that in big projects, letting free rein to Doctrine will just increase the number of extra query.
One of my project, before optimisation, reached 1500 queries per page loading...
On the other hand, it's not good to ignore relations in a database.
In fact, a database is faster with foreign keys and indexes than without.
If you want your app to be as fast as possible, you have to use relations to optimise your database query speed, and optimise Doctrine queries to avoid a foul number of extra queries.
Last, I will say that order matter.
Using ORDER BY to fetch parent before child will also greatly reduce the number of query Doctrine might do on it's own.
You can also change the fetch method on your entity annotation to "optimise" Doctrine pre-made queries.
But it's not smart, and often don't really provide what we really need.
Thus, custom queries is the best choice.

What would be the most appropriate data structure given these requirements?

We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
string ProductName;
string ProductId; // or int, if ProductId is an integer
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.

Modeling many-to-many relationship in data warehouse

I have to design data warehouse model and ETL process for class at my University. My data warehouse has to store opinions / comments about a product, each record should consist of:
comment text (String)
product score ({0, 0.5, … , 4.5, 5})
comment author (String)
comment date (Date)
product recommendation ({Yes, No})
comment up votes (Int)
comment down votes (Int)
product pros (many Strings, e.g {price, design, durability, … }) and its count
product cons (many Strings, e.g {too loud, too heavy, price, … }) and
its count
In addition data warehouse should store information about product:
product category
product brand
product model
I want to create data warehouse model first, but I have problem with storing product pros and cons as it is many-to-many relationship. In normal relational database I would simply create associative table, but here I am not sure how to proceed, after all I don’t want to normalize facts table.
I am considering 3 approaches, first, which I presented in diagram below. I used bridge table method (though, I don’t know if correctly) to get rid of many-to-many relationship. I don’t know how it will impact querying performance.
Second approach I may use is boolean column method. In PROS and CONS table I can create a column for each possible value, but there can be up to 100 different pros or cons. Also number of possible pros or cons is not constant in time. Authors in their comments can list new pros or cons (that’s how it works in data source), but I can’t add new columns (I shouldn’t change data in data warehouse).
Third approach I am considering, is to keep pros in PROS table but in 1 column, where values will be separated using commas or some other delimiter e.g. “price, design, color”. It keeps things simple but hard to analyze or slice & dice.
Which approach should I use in this situation? Which is better for loading data into data warehouse, because form data source I will get all the comments and I want to only load comments that are new since last loading?
What I think is, if we can get your first option little bit modified to than what you have said here, it would be the best as I understand.
in your image you have provided, having the Pros_Bridge_Detail table is fine. The rest need to be changed.
you can remove the pros_Bridge table that holds just the count. you can actually add that column to your COMMENT fact table you have up there. That would be more efficient and easy when it comes to queries rather than querying in many tables.
you said you have many areas to give pros like price, design, durability etc. Lets put those stuff into a separate dimension.
Add a new column to your Pros_Bridge_Detail table to hold the ID of the newly created Dimension that holds the product pro types (Design, durability etc).
Now, once you add a product Pro, the Pros_Bridge_Detail table will have the pros the user give and also hold the value of regarding what the pro is given via the ID of the new dimension.
Also don't forget to store the Comment ID as well in Pros_Bridge_Detail table as that will be your link (FK) to Comments fact table you have.
Same can be done to Cons as well.
Hope you understand what I just explained and hope it helps. let know if you have any issues.

Faster retrieval of relationship properties in Neo4J graph database

in my data model I take a statement of a user with hashtags, each hashtag is turned into a node and their co-occurrence is the relationship between them. For each relationship I need to take into account:
the user who created it rel.user property
the time it was created - rel.timestamp property
the context it was created in - rel.context property
the statement it was made in - rel.statement property
Now, Neo4J doesn't allow relationship property indexing and so when I do the search that requires me to retrieve and evaluate those properties, it takes a very long time. Specifically, when I do a Cypher request of the kind:
MERGE hashtag1-[rel:TO
This request first checks if there is a relationship with those properties and if there is, it won't do anything, but if there is none, it will add a new one (with a unique ID and timestamp). So then because evaluation of each relationship has to take place – and they are not indexed –  it takes a very long time for this request to go through. Now I'm having a problem with such request because I'm dealing with about 100 nodes and 300 relations at one query (the one above is only 1 type, there are also a few others added to the query but those above are the most expensive ones).
Therefore the actual question:
Does anybody know of a good way to keep those relationship properties and to somehow make them work faster, so they can be retrieved and evaluated when needed faster? Or do you think I should use a different type of request (if yes, which?)
Thank you!
This almost looks to me as if you relationship should actually be a node, which then would be connected to nodes:
Then you can have sensible merge options (.e.g merge on UID).
Currently you loose the power of the graph model for your relationships.
This is also discussed in the graph-databases book in the chapter with the email domain.
Do you already have your hashtag1 and hashtag2 nodes available?
And if so, how many rels already exist between these?
What Cypher has to do for this to work, is to go over each of those relationships and compare all 3 properties (which I'm not sure will fit into shortstring storage) so they have to be loaded if they are not in the cache. You can check your store files, if you have a large string-store file then those uid's might not fit into the property records and have to be loaded separately.
What is the memory config of your server (heap and mmio)?
All that adds up.

Hbase Schema Nested Entity

Does anyone have an example on how to create an Hbase table with a nested entity?
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.
