I've been doing a fair amount of work with Couch DB in my spare time recently and really enjoy using it. I find it to be much more flexible than using a relational database, but it's not without it's disadvantages.
One big disadvantage is the lack of dynamic queries / view generation... So you have to do a fair amount of work in planning and justifying your views, as you can't put that logic into your application code as you might do with SQL.
For example, I wrote a login scheme based on a JSON document template that looked a little bit like this:
{
"_id": "blah",
"type": "user",
"name": "Bob",
"email": "bob#theaquarium.com",
"password": "blah",
}
To prevent the creation of duplicate accounts, I wrote a very basic view to generate a list of user names to lookup as keys:
emit(doc.name, null)
This seemed reasonably efficient to me. I think it's way better than dragging out an entire list of documents (or even just a reduced number of fields for each document). So I did exactly the same thing to generate a list of email addresses:
emit(doc.email, null)
Can you see where I'm going with this question?
In a relational database (with SQL) one would simply make two queries against the same table. Would this technique (of equating a view to the product of an SQL query) be in some way analogous?
Then there's the performance / efficiency issue... Should those two views really be just one? Or is the use of a Couch DB view with keys and no associated value an effective practice? Considering the example above, both of those views would have uses outside of a login scheme... If I ever need to generate a list of user names, I can retrieve them without an additional overhead.
What do you think?
First, you certainly can put the view logic into your application code - all you need is an appropriate build or deploy system that extracts the views from the application and adds them to a design document. What is missing is the ability to generate new queries on the fly.
Your emit(doc.field,null) approach certainly isn't surprising or unusual. In fact, it is the usual pattern for "find document by field" queries, where the document is extracted using include_docs=true. There is also no need to mix the two views into one, the only performance-related decision is whether the two views should be placed in the same design document: all views in a design document are updated when any of them is accessed.
Of course, your approach does not actually guarantee that the e-mails are unique, even if your application tries really hard. Imagine the following circumstances with two client applications A and B:
A: queries view, determines that `test#email.com` does not exist.
B: queries view, determines that `test#email.com` does not exist.
A: creates account with `test#email.com`
B: creates account with `test#email.com`
This is a rare occurrence, but nonetheless possible. A better approach is to keep documents that use the email address as the key, because access to single documents is transactional (it's impossible to create two documents with the same key). Typical example:
{
_id: "test#email.com",
type: "email"
user: "000000001"
}
{
_id: "000000001",
type: "user",
email: "test#email.com",
firstname: "Test",
...
}
EDIT: a reservation pattern only works if two clients attempting to create an account for a given e-mail will reliably try to access the same document. If you randomly generate a new identifier, then client A will create and reserve document XXXX while client B will create and reserve document YYYY, and you will end up with two different documents that have the same e-mail.
Again, the only way to perform a transactional "check if it exists, create if it does not" operation is to have all clients alter a single document.
Related
We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.
Suppose you have a GraphQL layer, written on node.js using graphql-js, that communicates with a SQL database. Suppose you have the following simple types and fields:
Store
A single brick-and-mortar location for a chain of grocery stores.
Fields:
id: GraphQLID
region: StoreRegion
employees: GraphQLList(Employee)
StoreRegion
A GraphQLEnumType containing the list of regions into which the chain divides its stores.
Values:
NORTHEAST
MIDATLANTIC
SOUTHEAST
...
Employee
Represents a single employee working at a store.
Fields:
id: GraphQLID
name: GraphQLString
salary: GraphQLFloat
Suppose the API exposes a store query that accepts a Region and returns a list of Store objects. Now suppose the client sends this query:
{
store(region: NORTHEAST) {
employees {
name
salary
}
}
}
Hopefully this is all pretty straightforward so far.
So here's my question, and I hope (expect, really) that it's something that has a common solution and I'm just having trouble finding it because my Google-Fu is weak today: is there a good way that can I write the resolvers for these types such that I can wrap up all the requested fields for all the employees from all the returned stores into a single SQL query statement, resulting in one round-trip to the database of the form:
SELECT name,salary FROM employees WHERE id IN (1111, 1133, 2177, ...)
rather than making one request per employee or even one request per store?
This is really a concrete instance of a more general question: is there a good way to combine resolvers to avoid making multiple requests in cases where they could be easily combined?
I'm asking this question in terms of graphql-js because that's what I'm hoping to work with, and since I figure that would allow for more specific answers, but if there's a more implementation-agnostic answer, that would be cool too.
So, basically you are wondering how you can combine multiple resolvers into fewer database queries. This is trying to solve what they call the N+1 query problem. Here’s at least two ways you can solve this.
DataLoader: This is a more general solution and it's created by Facebook. You could use this to batch multiple queries that query a single item of a single type into a single query that queries multiple items of a single type. In your example you would batch all employees into a single DB query and you would still have a separate query for getting the store. Here's a video by Ben Awad that explains DataLoader
JoinMonster: Specifically for SQL. Will do JOINs to make one SQL query per graphql query. Here's a video by Ben Awad explaining JoinMonster
I just ran into an interesting situation about relationships and databases. I am writing a ruby app and for my database I am using postgresql. I have a parent object "user" and a related object "thingies" where a user can have one or more thingies. What would be the advantage of using a separate table vs just embedding data within a field in the parent table?
Example from ActiveRecord:
using a related table:
def change
create_table :users do |i|
i.text :name
end
create_table :thingies do |i|
i.integer :thingie
i.text :discription
end
end
class User < ActiveRecord::Base
has_many :thingies
end
class Thingie < ActiveRecord::Base
belongs_to :user
end
using an embedded data structure (multidimensional array) method:
def change
create_table :users do |i|
i.text :name
i.text :thingies, array: true # example contents: [[thingie,discription],[thingie,discription]]
end
end
class User < ActiveRecord::Base
end
Relevant Information
I am using heroku and heroku-posgres as my database. I am using their free option, which limits me to 10,000 rows. This seems to make me want to use the multidimensional array way, but I don't really know.
Embedding a data structure in a field can work for simple cases but it prevents you from taking advantage of relational databases. Relational databases are designed to find, update, delete and protect your data. With an embedded field containing its own wad-o-data (array, JSON, xml etc), you wind up writing all the code to do this yourself.
There are cases where the embedded field might be more suitable, but for this question as an example I will use a case that highlights the advantages of a related table approch.
Imagine a User and Post example for a blog.
For an embedded post solution, you would have a table something like this (psuedocode - these are probably not valid ddl):
create table Users {
id int auto_increment,
name varchar(200)
post text[][],
}
With related tables, you would do something like
create table Users {
id int auto_increment,
name varchar(200)
}
create table Posts {
id auto_increment,
user_id int,
content text
}
Object Relational Mapping (ORM) tools: With the embedded post, you will be writing the code manually to add posts to a user, navigate through existing posts, validate them, delete them etc. With the separate table design, you can leverage the ActiveRecord (or whatever object relational system you are using) tools for this which should keep your code much simpler.
Flexibility: Imagine you want to add a date field to the post. You can do it with an embedded field, but you will have to write code to parse your array, validate the fields, update the existing embedded posts etc. With the separate table, this is much simpler. In addition, lets say you want to add an Editor to your system who approves all the posts. With the relational example this is easy. As an example to find all posts edited by 'Bob' with ActiveRecord, you would just need:
Editor.where(name: 'Bob').posts
For the embedded side, you would have to write code to walk through every user in the database, parse every one of their posts and look for 'Bob' in the editor field.
Performance: Imagine that you have 10,000 users with an average of 100 posts each. Now you want to find all posts done on a certain date. With the embedded field, you must loop through every record, parse the entire array of all posts, extract the dates and check agains the one you want. This will chew up both cpu and disk i/0. For the database, you can easily index the date field and pull out the exact records you need without parsing every post from every user.
Standards: Using a vendor specific data structure means that moving your application to another database could be a pain. Postgres appears to have a rich set of data types, but they are not the same as MySQL, Oracle, SQL Server etc. If you stick with standard data types, you will have a much easier time swapping backends.
These are the main issues I see off the top. I have made this mistake and paid the price for it, so unless there is a super-compelling reason do do otherwise, I would use the separate table.
what if users John and Ann have the same thingies? the records will be duplicated and if you decide to change the name of thingie you will have to change two or more records. If thingie is stored in the separate table you have to change only one record. FYI https://en.wikipedia.org/wiki/Database_normalization
Benefits of one to many:
Easier ORM (Object Relational Mapping) integration. You can use it either way, but you have to define your tables with native sql. Having distinct tables is easier and you can make use of auto-generated mappings.
Your space limitation of 10,000 rows will go further with the one to many relationship in the case that 2 or more people can have the same "thingies."
Handle users and thingies separately. In some cases, you might only care about people or thingies, not their relationship with each other. Some examples, updating a username or thingy description, getting a list of all thingies (or all users). Selecting from the single table can make it harding to work with.
Maintenance and manipulation is easier. In the case that a user or a thingy is updated (name change, email address update, etc), you only need to update 1 record in their table instead of writing update statements "where user_id=?".
Enforceable database constraints. What if a thingy is not owned by anyone? Is the user column now nillable? It would have to be in the single table case, so you could not enforce a simple "not nillable" username, for example.
There are a lot of reasons of course. If you are using a relational database, you should make use of the one to many by separating your objects (users and thingies) as separate tables. Considering your limitation on number of records and that the size of your dataset is small (under 10,000), you shouldn't feel the down side of normalized data.
The short truth is that there are benefits of both. You could, for example, get faster read times from the single table approach because you don't need complicated joins.
Here is a good reference with the pros/cons of both (normalized is the multiple table approach and denormalized is the single table approach).
http://www.ovaistariq.net/199/databases-normalization-or-denormalization-which-is-the-better-technique/
Besides the benefits other mentioned, there is also one thing about standards. If you are working on this app alone, then that's not a problem, but if someone else would want to change something, then the nightmare starts.
It may take this guy a lot of time to understand how it works alone. And modifing something like this will take even more time. This way, some simple improvement may be really time consuming. And at some point, you will be working with other people. So always code like the guy who works with your code at the end is the brutal psychopath who knows where you live.
I have a newsfeed with comments. I'm storing comments in MongoDB. Newsfeed possibly could grow very large in future so I need high speed.
comments: [
{user_id: 34, user_name: "John", text: "..."}
]
As you can see, I'm storing info about user as well because Mongo's docs say "when you need speed, use embeds".
But user can change his name anytime.
In that case user's name under each of his comments in newsfeed would be wrong.
Should I use references (DBref) to "User" collection by _id instead of embeds? And how much slower would it be? Is that slowdown big enough to be worried about it?
I'm just wondering how all big social networks are doing that. When I change my user's name it instantly updates in all my posts in newsfeed.
Storing DBRefs won't gain you any benefit vs. storing simple user ids. It's basically the same id, only with a collection name.
If you want quick efficient reads - embed.
When user changes his name, you can write this fact down and then run a nightly job that'll update his cached name in all comments.
If you want instantaneous name updates - you should reference. But in this case you're paying with more complex code and more queries to the database.
Users of our system are able to submit un-validated contact data. For example:
Forename: null
Surname: 231
TelephoneNumber: not sure
etc
This data is stored in a PendingContacts table.
I have another table - ApprovedContacts. This table has a variety of constraints to improve consistency and integrity. This table shouldn't contain any dirty or incomplete data.
I need a process to move data from one table to another. Structure of both tables is nearly identical, however, one table has the constraints, when another doesn't.
I have two states: Pending and Approved, gut feeling tells me that I should use a state pattern details here. In theory this should allow me to change contact's state from Pending to Approved, depending on whether the contact has been successfully validated. Problem is that I don't see how is this going to work.
Am I going in a right direction or should I be looking at something completely different?
Presentation layer is in MVC 3, so I have view models for pending contacts and approved contacts, as well as domain models for pending contacts and approved contacts. My view models are generally DTOs with some validation routines, but now my view models represent state too. This doesn't seem right.
For example, all contacts must have a state and they can be saved and removed , so I need an interface for that:
public interface IContactViewModelState
{
void Save(ContactViewModel item);
}
I then add an implementation for saving pending contacts into the PendingContacts table:
public class PendingContactViewModelState: IContactViewModelState
{
public void Save(ContactViewModel item)
{
// Save to the pending contacts table
// I don't like this as my view model now references data access layer
}
}
Short answer: no, because you only have two states. You'd use a state pattern to help deal with complex situations with many states and rules. The only reason you might want to go with a full-blown state pattern based implementation is if you there's a very high chance such a situation is imminent.
If the result of a success transition to Approved is the record ending up in the approved table then you really just need to decide where you want to enforce the constraints. This decision will/can be based on many factors including the likely frequency of change (to the constraints) and where other logic resides.
A lot of patterns (but not all) tend to deal with how to structure an application, but here I think it's just a case of deciding where and how implementing some logic. In other words - you might just be accidentally over-analyzing the problem - it's easily done :)