We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.
Related
I'm working on a syncing process between offline-first databases and a central server. As a simple example, there are items and departments and an item belongs to a department. Each client can modify any of the entities.
I know for text documents there are algorithms/technology for handling conflicts like OT and CRDT:
https://en.wikipedia.org/wiki/Operational_transformation
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
Differences between OT and CRDT
But I'm wondering if you can either use these for more complex structures like you might have in a database. In my case, let's keep it simple and say you have :
items - id, name, department_id
departments - id, name
Changes in properties like "name" in individual elements are manageable (maybe using a version, delta, timestamp). Deletes are a little tricker, but you might just discard the name change because the element is deleted.
And it's even more tricky when there are relations. What happens when one client moves items to a department and the other deletes the department.
At a certain level, some of these conflicts are similar to those that could happen in text using OT. Someone changes a title and someone else deletes it. Or someone adds an element to a bulleted list and someone else moves the list to a different part of the document.
My question is, can you use OT or CRDT for relational data and if so, how would you do it? If not, are there other similar algorithms or techniques to handle conflicts in relational data?
Is there a way to add an index to a relationship in an Entity? I can see that attributes can be indexed, but not on relationships.
I have a large dataset and need to check which ones actually have or not a relationship object in a predicate:
[NSPredicate predicateWithFormat:#"relationshipField = nil"]
So I thought that it would be a good idea to index that relationship.
So my question is, can there be ? And if not, is there a performance strategy for this scenario?
You can't do this manually but you don't need to because — empirically speaking — Core Data does it automatically where it can. So that means that 'to one' relationships are automatically given indexes. Check your SQLite store for names like ZENTITY_ZRELATIONSHIP_INDEX to see the proof.
There is no meaningful way to add an index for a 'to many' relationship so Core Data couldn't do that automatically or expose the option. SQLite's format for indices is just a list of rows sorted by the thing being indexed, into which it performs a binary search. It has no way of building such a structure for a multivalued column, and multivalued columns are how Core Data stores to-many links.
Disclaimer: of course, these are all implementation-specific observations that appear in my subjective experience to apply to the current implementation of Core Data. But as a general rule I think you could safely say that Core Data will optimise for relationship lookups as best as it can.
in my data model I take a statement of a user with hashtags, each hashtag is turned into a node and their co-occurrence is the relationship between them. For each relationship I need to take into account:
the user who created it rel.user property
the time it was created - rel.timestamp property
the context it was created in - rel.context property
the statement it was made in - rel.statement property
Now, Neo4J doesn't allow relationship property indexing and so when I do the search that requires me to retrieve and evaluate those properties, it takes a very long time. Specifically, when I do a Cypher request of the kind:
MERGE hashtag1-[rel:TO
{context:"deb659c0-a18d-11e3-ace9-1fa4c6cf2894",
statement:"824acc80-aaa6-11e3-88e3-453baabaa7ed",
user:"b9745f70-a13f-11e3-98c5-476729c16049"}]->hashtag2
ON CREATE
SET
rel.uid="824f6061-aaa6-11e3-88e3-453baabaa7ed",
rel.timestamp="13947117878770000";
This request first checks if there is a relationship with those properties and if there is, it won't do anything, but if there is none, it will add a new one (with a unique ID and timestamp). So then because evaluation of each relationship has to take place – and they are not indexed – it takes a very long time for this request to go through. Now I'm having a problem with such request because I'm dealing with about 100 nodes and 300 relations at one query (the one above is only 1 type, there are also a few others added to the query but those above are the most expensive ones).
Therefore the actual question:
Does anybody know of a good way to keep those relationship properties and to somehow make them work faster, so they can be retrieved and evaluated when needed faster? Or do you think I should use a different type of request (if yes, which?)
Thank you!
This almost looks to me as if you relationship should actually be a node, which then would be connected to nodes:
context
user
statement
tag1
tag2
tagN
Then you can have sensible merge options (.e.g merge on UID).
Currently you loose the power of the graph model for your relationships.
This is also discussed in the graph-databases book in the chapter with the email domain.
Do you already have your hashtag1 and hashtag2 nodes available?
And if so, how many rels already exist between these?
What Cypher has to do for this to work, is to go over each of those relationships and compare all 3 properties (which I'm not sure will fit into shortstring storage) so they have to be loaded if they are not in the cache. You can check your store files, if you have a large string-store file then those uid's might not fit into the property records and have to be loaded separately.
What is the memory config of your server (heap and mmio)?
All that adds up.
Does anyone have an example on how to create an Hbase table with a nested entity?
Example
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
Books
isbn
title
etc...
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Thanks...
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.
So just getting started with Azure tables- haven't played with them before so wanted to check it out.
My understanding is that I should be thinking of this as object storage, rather than a database, which is cool. But I'm a bit confused on a couple points...
First, if I have one to many object relationships, what should the partitionkey of the root object look like? For example, let's say I have a University object, which is one to many to Student objects, and say Student objects are one to many to Classes. For a new student, should its partitionkey be 'universityId'? Or 'universityId + studentId'? I read in the msdn docs that the RowKey is supposed to be an id specific to the item I am adding, which also sounds like studentId.
And then would both the partitionkey and rowkey for a new University just be universityId?
I also read that Azure Tables are not for storing lists- I take it that does not refer to storing an object that contains a List...?
And anyone have any links to code samples using asp mvc 3 or 4 and razor with azure tables? This is my end goal, would be cool to see what someone who actually knows what they are doing does :)
Thanks!
You're definitely right that Azure Tables is closer to an object store than a database. You do have some ability to query on non-key columns, and to do logic in queries. But you shouldn't plan on using those features for anything performance critical.
Because queries are only fast if you specify at least a PartitionKey (and preferably a RowKey or range or RowKeys) that heavily influences how you lay out your tables. The decisions you make at the beginning will have big performance implications later. As a rough analogy, I like to think about them like a SQL Server table with the primary key as (PartitionKey + RowKey), that can never have another index. That's not completely accurate, but it'll get you thinking in the right direction.
First, if I have one to many object relationships, what should the partitionkey of the root object look like?
I would probably use the UniversityId as the PartitionKey. That's generally a safe place to start.
For a new student, should its partitionkey be 'universityId'? Or 'universityId + studentId'?
How do you plan to query the students? If you're always going to have their UniversityId & StudentId I would probably make them the PartitionKey and RowKey, respectively. If you're mostly going to query based on StudentId, I would use that as the PartitionKey instead.
would both the partitionkey and rowkey for a new University just be universityId?
That's a viable choice. You can also use a constant value (eg "UNIVERSITY") for the RowKey, if you've really got nothing else to put there.
I also read that Azure Tables are not for storing lists- I take it that does not refer to storing an object that contains a List...?
I'm not entirely sure what that means. Clearly you can store a collection of objects in a table, that's what they're for. You can't directly store a list in an entity property. So if your Student has a property of typee List, that can't be stored directly. But you could serialize it to XML or binary, and store that.
I don't have any code samples handy, unfortunately. This may be a good time to abstract your data logic into its own layer, rather than putting it in your MVC controllers. We've found that a well-abstracted data layer can make unit testing your logic very easy. If you create some interfaces for your tables, it's very easy to create mock objects using just a List and some LINQ.