Technology for database access system - protocol-buffers

I am currently designing system which should allow access to database. Assumptions are as follows:
Database should has access layer. The access layer should provide objects that represents database tables. (This would be done using some ORM framework).
Client which want to get data from database, should get object from access layer first, and then get data using those objects.
Clients could use Python, Java or C++.
Access layer is based on Java.
There won't be to many clients, but they will be opearating on large amounts of data.
The question which is hard for me is what technology should be used for passing object between acces layer and clients. I consider using ZeroC ICE, Apache Thrift or Google Protocol Buffers.
Does anyone have opinion which one is worth using?
This is my research for Protocol Buffers:
Advantages:
simple to use and easy to start
well documented
highly optimized
defining object data structure in java-like language
automatically generating implementation of setters and getters and build methods for Python, Java and C++
open-source bidnings for other languages
object could be extended without affecting old version of an applications
there are many of open-source RpcChanel and RpcController implementation (not tested)
Disadvantages:
need to implement object transfer
objects structure have to be defined before use, so we can't add some fields on the fly (Updated: there are posibilities to do that, see the comments)
if there is a need for reading one object's filed, we have to parse whole file (in contrast, in XML we could ignore chosen tags)
if we want to use RPC for invoke object methods, we need to define services and deliver RpcChanel and RpcController implementation
This is my research for Apache Thrift:
Advantages:
provide compiler that generates source code for supported languages (classes, all things that are important)
allow defining optional fields in the structures ( when we do not set value on a field, the size of transfered data is lower)
enable point out some methods that are "one way" (returning nothing and client after invokation do not wait for answer from server about completion processing of query)
support collections (maps, lists, sets), objects, primitives serialization (deserialization), constants, enumerations, exceptions
most of problems, errors are solved and explained
provide different methods of serialization: (TBinaryProtocol...) and different ways of exchanging data: (TBufferedTransport, TZlibTransport... )
compiler produces classes (structures) for languages thaw we can extend by adding some new methods.
possible to add fields to protocol(server as well as client) and remove other- old code and new one can properly interact(some rules in update)
enable asynchronous calls
easy to use
Disadvantages:
documentation - contains some errors that sometimes it is really hard to get to know what is the source of the problem
not allways problems are well taged (when we look for solution in the Internet).
not support overloading for service methods
tutorials cover only simple examples of thrift usage
hard to start
ICE ZeroC:
Is better than Protocol Buffers, because I wouldn't need to implement object passing by myself via e.g. sockets. ICE also gives ServantLocators which can provide management of connections.
The question is: whether ICE is much slower and less efficient than the PB?

Related

protocol buffers in web application architecture -- when are they not worth the trouble?

I am new to web development, and I've seen many sites preaching the benefits of using protocol buffers -- for example: https://codeclimate.com/blog/choose-protocol-buffers/.
I'm not sure if some of the benefits apply to my use case:
Having a unified schema out of the .proto file: If I validate my data in the front and back-end, which I should do anyway, a unified schema is enforced explicitly. I don't see any added benefit in this regard from using protocol buffers.
Auto generating the setters and getters from the .proto file: This looks like a nice selling point. But, I wouldn't need any setters and getters if I don't use protocol buffers in the first place. I found them really cumbersome to work with:
They remove capitalization, which alters the original variable names
They are unnatural to work with. For example, in c++ I would want work with just a plain old data structure, but instead I have to do something like ptr_message->shouldBeStruct1().shouldBeStructArray(20).shouldBeInt();
Easy Language Interoperability: I really doubt it is good practice to design my data consuming code so that it works for a protobuf message rather than a struct. So, I would need to parse the protobuf into a plain data struct first.
The only potential benefit I see is the reduced data size when transmitting on the wire. But, does this really justify the overhead of additional middleware to work with protocol buffers? What am I missing?

What is the difference between Falcor and GraphQL?

GraphQL consists of a type system, query language and execution
semantics, static validation, and type introspection, each outlined
below. To guide you through each of these components, we've written an
example designed to illustrate the various pieces of GraphQL.
- https://github.com/facebook/graphql
Falcor lets you represent all your remote data sources as a single
domain model via a virtual JSON graph. You code the same way no matter
where the data is, whether in memory on the client or over the network
on the server.
- http://netflix.github.io/falcor/
What is the difference between Falcor and GraphQL (in the context of Relay)?
I have viewed the Angular Air Episode 26: FalcorJS and Angular 2 where Jafar Husain answers how GraphQL compares to FalcorJS. This is the summary (paraphrasing):
FalcorJS and GraphQL are tackling the same problem (querying data, managing data).
The important distinction is that GraphQL is a query language and FalcorJS is not.
When you are asking FalcorJS for resources, you are very explicitly asking for finite series of values. FalcorJS does support things like ranges, e.g. genres[0..10]. But it does not support open-ended queries, e.g. genres[0..*].
GraphQL is set based: give me all records where true, order by this, etc. In this sense, GraphQL query language is more powerful than FalcorJS.
With GraphQL you have a powerful query language, but you have to interpret that query language on the server.
Jafar argues that in most applications, the types of the queries that go from client to server share the same shape. Therefore, having a specific and predictable operations like get and set exposes more opportunities to leverage cache. Furthermore, a lot of the developers are familiar with mapping the requests using a simple router in REST architecture.
The end discussion resolves around whether the power that comes with GraphQL outweighs the complexity.
I have now written apps with both libraries and I can agree with everything in Gajus' post, but found some different things most important in my own use of the frameworks.
Probably the biggest practical difference is that most of the examples and presumably work done up to this point on GraphQL has been concentrated on integrating GraphQL with Relay - Facebook's system for integrating ReactJS widgets with their data requirements. FalcorJS on the other hand tends to act separately from the widget system which means both that it may be easier to integrate into a non-React/Relay client and that it will do less for you automatically in terms of matching widget data dependencies with widgets.
The flip side of FalcorJS being flexible in client side integrations is that it can be very opinionated about how the server needs to act. FalcorJS actually does have a straight up "Call this Query over HTTP" capability - although Jafar Husain doesn't seem to talk about it very much - and once you include those, the way the client libraries react to server information is quite similar except that GraphQL/Relay adds a layer of configuration. In FalcorJS, if you are returning a value for movie, your return value better say 'movie', whereas in GraphQL, you can describe that even though the query returns 'film', you should put that in the client side datastore as 'movie'. - this is part of the power vs complexity tradeoff that Gajus mentioned.
On a practical basis, GraphQL and Relay seems to be more developed. Jafar Husain has mentioned that the next version of the Netflix frontend will be running at least in part on FalcorJS whereas the Facebook team has mentioned that they've been using some version of the GraphQL/Relay stack in production for over 3 years.
The open source developer community around GraphQL and Relay seems to be thriving. There are a large number of well-attended supporting projects around GraphQL and Relay whereas I have personally found very few around FalcorJS. Also the base github repository for Relay (https://github.com/facebook/relay/pulse) is significantly more active than the github repository for FalcorJS (https://github.com/netflix/falcor/pulse). When I first pulled the Facebook repo, the examples were broken. I opened a github issue and it was fixed within hours. On the other hand, the github issue I opened on FalcorJS has had no official response in two weeks.
Lee Byron one of the engineer behind GraphQL did an AMA on hashnode, here is his answer when asked this question:
Falcor returns Observables, GraphQL just values. For how Netflix wanted to use Falcor, this makes a lot of sense for them. They make multiple requests and present data as it's ready, but it also means that the client developer has to work with the Observables directly. GraphQL is a request/response model, and returns back JSON, which is trivially easy to then use. Relay adds back in some of the dynamicism that Falcor presents while maintaining only using plain values.
Type system. GraphQL is defined in terms of a type system, and that's allowed us to built lots of interesting tools like GraphiQL, code generators, error detection, etc. Falcor is much more dynamic, which is valuable in its own right but limits the ability to do this kind of thing.
Network usage. GraphQL was originally designed for operating Facebook's news feed on low end devices on even lower end networks, so it goes to great lengths to allow you to declare everything you need in a single network request in order to minimize latency. Falcor, on the other hand, often performs multiple round trips to collect additional data. This is really just a tradeoff between the simplicity of the system and the control of the network. For Netflix, they also deal with very low end devices (e.g. Roku stick) but the assumption is the network will be good enough to stream video.
Edit: Falcor can indeed batch requests, making the comment about the network usage inaccurate. Thanks to #PrzeoR
UPDATE: I've found the very useful comment under my post that I want to share with you as a complementary thing to the main content:
Regarding lack of examples, you can find the awesome-falcorjs repo userful, there are different examples of a Falcor's CRUD usage:
https://github.com/przeor/awesome-falcorjs ... Second thing, there is a book called "Mastering Full Stack React Development" which includes Falcor as well (good way to learn how to use it):
ORGINAL POST BELOW:
FalcorJS (https://www.facebook.com/groups/falcorjs/) is much more simpler to be efficient in comparison to Relay/GraphQL.
The learning curve for GraphQL+Relay is HUGE:
In my short summary: Go for Falcor. Use Falcor in your next project until YOU have a large budget and a lot of learning time for your team then use RELAY+GRAPHQL.
GraphQL+Relay has huge API that you must be efficient in. Falcor has small API and is very easy to grasp to any front-end developer who is familiar with JSON.
If you have an AGILE project with limited resources -> then go for FalcorJS!
MY SUBJECTIVE opinion: FalcorJS is 500%+ easier to be efficient in full-stack javascript.
I have also published some FalcorJS starter kits on my project (+more full-stack falcor's example projects): https://www.github.com/przeor
To be more in technical details:
1) When you are using Falcor, then you can use both on front-end and backend:
import falcor from 'falcor';
and then build your model based upon.
... you need also two libraries which are simple to use on backend:
a) falcor-express - you use it once (ex. app.use('/model.json', FalcorServer.dataSourceRoute(() => new NamesRouter()))). Source: https://github.com/przeor/falcor-netflix-shopping-cart-example/blob/master/server/index.js
b) falcor-router - there you define SIMPLE routes (ex. route: '_view.length'). Source:
https://github.com/przeor/falcor-netflix-shopping-cart-example/blob/master/server/router.js
Falcor is piece of cake in terms of learning curve.
You can also see documentation which is much simpler than FB's lib and check also the article "why you should care about falcorjs (netflix falcor)".
2) Relay/GraphQL is more likely like a huge enterprise tool.
For example, you have two different documentations that separately are talking about:
a) Relay: https://facebook.github.io/relay/docs/tutorial.html
- Containers
- Routes
- Root Container
- Ready State
- Mutations
- Network Layer
- Babel Relay Plugin
- GRAPHQL
GraphQL Relay Specification
Object Identification
Connection
Mutations
Further Reading
API REFERENCE
Relay
RelayContainer
Relay.Route
Relay.RootContainer
Relay.QL
Relay.Mutation
Relay.PropTypes
Relay.Store
INTERFACES
RelayNetworkLayer
RelayMutationRequest
RelayQueryRequest
b) GrapQL: https://facebook.github.io/graphql/
2Language
2.1Source Text
2.1.1Unicode
2.1.2White Space
2.1.3Line Terminators
2.1.4Comments
2.1.5Insignificant Commas
2.1.6Lexical Tokens
2.1.7Ignored Tokens
2.1.8Punctuators
2.1.9Names
2.2Query Document
2.2.1Operations
2.2.2Selection Sets
2.2.3Fields
2.2.4Arguments
2.2.5Field Alias
2.2.6Fragments
2.2.6.1Type Conditions
2.2.6.2Inline Fragments
2.2.7Input Values
2.2.7.1Int Value
2.2.7.2Float Value
2.2.7.3Boolean Value
2.2.7.4String Value
2.2.7.5Enum Value
2.2.7.6List Value
2.2.7.7Input Object Values
2.2.8Variables
2.2.8.1Variable use within Fragments
2.2.9Input Types
2.2.10Directives
2.2.10.1Fragment Directives
3Type System
3.1Types
3.1.1Scalars
3.1.1.1Built-in Scalars
3.1.1.1.1Int
3.1.1.1.2Float
3.1.1.1.3String
3.1.1.1.4Boolean
3.1.1.1.5ID
3.1.2Objects
3.1.2.1Object Field Arguments
3.1.2.2Object Field deprecation
3.1.2.3Object type validation
3.1.3Interfaces
3.1.3.1Interface type validation
3.1.4Unions
3.1.4.1Union type validation
3.1.5Enums
3.1.6Input Objects
3.1.7Lists
3.1.8Non-Null
3.2Directives
3.2.1#skip
3.2.2#include
3.3Starting types
4Introspection
4.1General Principles
4.1.1Naming conventions
4.1.2Documentation
4.1.3Deprecation
4.1.4Type Name Introspection
4.2Schema Introspection
4.2.1The "__Type" Type
4.2.2Type Kinds
4.2.2.1Scalar
4.2.2.2Object
4.2.2.3Union
4.2.2.4Interface
4.2.2.5Enum
4.2.2.6Input Object
4.2.2.7List
4.2.2.8Non-null
4.2.2.9Combining List and Non-Null
4.2.3The __Field Type
4.2.4The __InputValue Type
5Validation
5.1Operations
5.1.1Named Operation Definitions
5.1.1.1Operation Name Uniqueness
5.1.2Anonymous Operation Definitions
5.1.2.1Lone Anonymous Operation
5.2Fields
5.2.1Field Selections on Objects, Interfaces, and Unions Types
5.2.2Field Selection Merging
5.2.3Leaf Field Selections
5.3Arguments
5.3.1Argument Names
5.3.2Argument Uniqueness
5.3.3Argument Values Type Correctness
5.3.3.1Compatible Values
5.3.3.2Required Arguments
5.4Fragments
5.4.1Fragment Declarations
5.4.1.1Fragment Name Uniqueness
5.4.1.2Fragment Spread Type Existence
5.4.1.3Fragments On Composite Types
5.4.1.4Fragments Must Be Used
5.4.2Fragment Spreads
5.4.2.1Fragment spread target defined
5.4.2.2Fragment spreads must not form cycles
5.4.2.3Fragment spread is possible
5.4.2.3.1Object Spreads In Object Scope
5.4.2.3.2Abstract Spreads in Object Scope
5.4.2.3.3Object Spreads In Abstract Scope
5.4.2.3.4Abstract Spreads in Abstract Scope
5.5Values
5.5.1Input Object Field Uniqueness
5.6Directives
5.6.1Directives Are Defined
5.7Variables
5.7.1Variable Uniqueness
5.7.2Variable Default Values Are Correctly Typed
5.7.3Variables Are Input Types
5.7.4All Variable Uses Defined
5.7.5All Variables Used
5.7.6All Variable Usages are Allowed
6Execution
6.1Evaluating requests
6.2Coercing Variables
6.3Evaluating operations
6.4Evaluating selection sets
6.5Evaluating a grouped field set
6.5.1Field entries
6.5.2Normal evaluation
6.5.3Serial execution
6.5.4Error handling
6.5.5Nullability
7Response
7.1Serialization Format
7.1.1JSON Serialization
7.2Response Format
7.2.1Data
7.2.2Errors
AAppendix: Notation Conventions
A.1Context-Free Grammar
A.2Lexical and Syntactical Grammar
A.3Grammar Notation
A.4Grammar Semantics
A.5Algorithms
BAppendix: Grammar Summary
B.1Ignored Tokens
B.2Lexical Tokens
B.3Query Document
It's your choice:
Simple sweet and short documented Falcor JS VERSUS Huge-enterprise-grade tool with long and advanced documentation as GraphQL&Relay
As I said before, if you are a front-end dev who grasp idea of using JSON, then JSON graph implementation from Falcor's team is best way to do your full-stack dev project.
In short, Falcor or GraphQL or Restful solve the same problem - provide a tool to query/manipulate data effectively.
How they differ is in how they present their data:
Falcor wants you to think their data as a very big virtual JSON tree, and uses get, set and call to read, write data.
GraphQL wants you to think their data as a group of predefined typed objects, and uses queries and mutations to read, write data.
Restful wants you to think their data as a group of resources, and uses HTTP verbs to read, write data.
Whenever we need to provide data for user, we end up with something liked: client -> query -> {a layer translate query into data ops} -> data.
After struggling with GraphQL, Falcor and JSON API (and even ODdata), I wrote my own data query layer. It's simpler, easier to learn, and more equivalent with GraphQL.
Check it out at:
https://github.com/giapnguyen74/nextql
It also integrates with featherjs for real time query/mutation.
https://github.com/giapnguyen74/nextql-feathers
OK, just start from a simple but important difference, GraphQL is a query based while Falcor is not!
But how they help u?
Basically, they both helping us to manage and querying data, but GraphQL has a req/res Model and return the data as JSON, basically the idea in GraphQL is having a single request to get all your data in one goal... Also, have exact response by having an exact request, So something to run on low-speed internet and mobile devices, like 3G networks... So if you have many mobile users or for some reasons you'd like to have less requests and faster response, use GraphQL... While Faclor is not too far from this, so read on...
On the other hand, Falcor by Netflix, usually have extra request (usually more than once) to retrieve all your data, eventhough they trying to improving it to a single req... Falcor is more limited for queries and doesn't have pre-defined query helpers like range and etc...
But for more clarification, let's see how each of them introduce itself:
GraphQL, A query language for your API
GraphQL is a query language for APIs and a runtime for fulfilling
those queries with your existing data. GraphQL provides a complete and
understandable description of the data in your API, gives clients the
power to ask for exactly what they need and nothing more, makes it
easier to evolve APIs over time, and enables powerful developer tools.
Send a GraphQL query to your API and get exactly what you need,
nothing more and nothing less. GraphQL queries always return
predictable results. Apps using GraphQL are fast and stable because
they control the data they get, not the server.
GraphQL queries access not just the properties of one resource but
also smoothly follow references between them. While typical REST APIs
require loading from multiple URLs, GraphQL APIs get all the data your
app needs in a single request. Apps using GraphQL can be quick even on
slow mobile network connections.
GraphQL APIs are organized in terms of types and fields, not
endpoints. Access the full capabilities of your data from a single
endpoint. GraphQL uses types to ensure Apps only ask for what’s
possible and provide clear and helpful errors. Apps can use types to
avoid writing manual parsing code.
Falcor, a JavaScript library for efficient data fetching
Falcor lets you represent all your remote data sources as a single
domain model via a virtual JSON graph. You code the same way no matter
where the data is, whether in memory on the client or over the network
on the server.
A JavaScript-like path syntax makes it easy to access as much or as
little data as you want, when you want it. You retrieve your data
using familiar JavaScript operations like get, set, and call. If you
know your data, you know your API.
Falcor automatically traverses references in your graph and makes
requests as needed. Falcor transparently handles all network
communications, opportunistically batching and de-duping requests.

gson vs protocol buffer

What are the pros and cons of protocol buffer (protobuf) over GSON?
In what situation protobuf is more appropriate than GSON?
I am sorry for a very generic question.
Both json (via the gson library) and protobuf are portable between platorms; but
protobuf is smaller (bandwidth) and cheaper (CPU) to read/write
json is human readable / editable (protobuf is binary; hard to parse without library support)
protobuf is trivial to merge fragments - just concatenate
json is easily passed to web page clients
the main java version of protobuf needs contract-definition (.proto) and code-generation; gson seems to allow arbitrary pojo usage (there are protobuf implementations that work on such objects, but not for java afaik)
If performance is key : protubuf
For use with a web page (JavaScript), or human readable: json (perhaps via gson)
If you want efficiency and cross-platform you should send raw messages between applications containing the information that is necessary and nothing more or less.
Serialising classes via Java's own mechanisms, gson, protobufs or whatever, creates data that contains not only the information you wish to send, but also information about the logical structures/hierarchies of the data structures that has been used to represent the data inside your application.
This makes those classes and data mapping dual purpose, one to represent the data internally in the application, and two to be transmitted to another application. Those two roles can be conflicting there is an onus on the developer to remember that the classes, collections and data layout he is working with at any time will also be serialised.

Google protocol buffers and stl vectors, maps and boost shared pointers

Does google protocol buffers support stl vectors, maps and boost shared pointers? I have some objects that make heavy use of stl containers like maps, vectors and also boost::shared_ptr. I want to use google protocol buffers to serialize these objects across the network to different machines.
I want to know does google protobuf support these containers? Also if I use apache thrift instead, will it be better? I only need to serialize/de-serialize data and don't need the network transport that apache thrift offers. Also apache thrift not having proper documentation puts me off.
Protocol buffers directly handles an intentionally small number of constructs; vectors map nicely to the "repeated" element type, but how this is presented in C++ is via "add" methods - you don't (AFAIK) just hand it a vector. See "Repeated Embedded Message Fields" here for more info.
Re maps; there is no inbuilt mechanism for that, but a key/value pair is easily represented in .proto (typically key = 1, value = 2) and then handled via "repeated".
A shared_ptr itself would seem to have little meaning in a serialized file. But the object may be handled (presumably) as a message.
Note that in the google C++ version the DTO layer is generated, so you may need to map between them and any existing object model. This is usually pretty trivial.
For some languages/platforms there are protobuf variants that work against existing object models.
(sorry, I can't comment on thrift - I'm not familiar with it)

Repository pattern with "modern" data access strategies

So I was searching the web looking for best practices when implementing the repository pattern with multiple data stores when I found my entire way of looking at the problem turned upside down. Here's what I have...
My application is a BI tool pulling data from (as of now) four different databases. Due to internal constraints, I am currently using LINQ-to-SQL for data access but require a design that will allow me to change to Entity Framework or NHibernate or the next data access du jour. I also hold steadfast to decoupled layers in my apps using an IoC framework (Castle Windsor in this case).
As such, I've used the Repository pattern to abstract the actual data access code from my business layer. As a result, my business object is coded against some I<Entity>Repository interface and the IoC Container is used to manage the actual implementation. In this case, I would expect to have a concrete Linq<Entity>Repository that implements the interface using LINQ-to-SQL to do the work. Later I could replace this with an EF<Entity>Repository with no changes required to my business layer.
Also, because I'm coding against the interface, I can easily mock the repository for unit testing purposes.
So the first question that I have as I begin coding the application is whether I should have one repository per DataContext or per entity (as I've typically done)? Let's say one database contains Customers and Sales with the expected relationship. Should I have a single OrderTrackingRepository with methods that work with both entities or have a separate CustomerRepository and a different SalesRepository?
Next, as a BI tool, the primary interface is for reporting, charting, etc and often will require a "mashup" of data across multiple sources. For instance, the reality is that one database contains customer information while another handles sales information and a third holds other financial information but one of my requirements is to display aggregated information that spans all three. Plus, I have to support dynamic filtering in the UI. Obviously working directly against the LINQ-to-SQL or EF DataContext objects (Table<Entity>, for instance) will allow me to pretty much do anything. What's the best approach to expose that same functionality to my business logic when abstracting the DAL with a repository interface?
This article: link text indicates that EF4 has turned this approach around and that the repository is nothing more than an IQueryable returned from the EF DataContext which brings up a whole other set of questions.
But, I think I've rambled on enough...
UPDATE (Thanks, Steven!)
Okay, let me put a more tangible (for me, at least) example on the table and clarify a few points that will hopefully lead to an approach I can better wrap my head around.
While I understand what Steven has proposed, I have a team of developers I have to consider when implementing such things and I'm afraid they will get lost in the complexity (yes, a real problem here!).
So, let's remove any direct tie-in with Linq-to-Sql because I don't want a solution that is dependant upon the way L2S works - or even EF, for that matter. My intent has been to abstract away the data access technology being used so that I can change it as needed without requiring collateral changes to the consuming code in my business layer. I've accomplished this in the past by presenting the business layer with IRepository interfaces to work against. Perhaps these should have been named IUnitOfWork or, more to my liking, IDataService, but the goal is the same. These interfaces typically exposed methods such as Add, Remove, Contains and GetByKey, for example.
Here's my situation. I have three databases to work with. One is DB2 and contains all of the business information for a customer (franchise) such as their info and their Products, Orders, etc. Another, SQL Server database contains their financial history while a third SQL Server database contains application-specific information. The first two databases are shared by multiple applications.
Through my application, the customer may enter/upload their financial information for a given time period. When entered, I have to perform the following steps:
1.Validate the entered data against a set of static rules. For example, the data must contain a legitimate customer ID value (in the case of an upload). This requires a lookup in the DB2 database to verify that the supplied customer ID exists and is current.
2.Next I have to validate the data against a set of dynamic rules which are contained in the third (SQL Server) database. An example may be that a given value cannot exceed a certain percentage of another value.
3.Once validated, I persist the data to the second SQL Server database containing the financial data.
All the while, my code must have loosely-coupled dependencies so I may mock them in my unit tests.
As part of the analysis, I know that I have three distinct data stores to work with and about a half-dozen or so entities (at this time) that I am working with. In generic terms, I presume that I would have three DataContexts in my application, one per data store, with the entities exposed by the appropriate data context.
I could then create a separate I{repository|unit of work|service} for each entity that would be consumed by my business logic with a concrete implementation that knows which data context to use. But this seems to be a risky proposition as the number of entities increases, so does the number of individual repository|UoW|service types.
Then, take the case of my validation logic which works with multiple entities and, thereby, multiple data contexts. I'm not sure this is the most efficient way to do this.
The other requirement that I have yet to mention is on the reporting side where I will need to execute some complex queries on the data stores. As of right now, these queries will be limited to a single data store at a time, but the possibility is there that I might need to have the ability to mash data together from multiple sources.
Finally, I am considering the idea of pulling out all of the data access stuff for the first two (shared) databases into their own project and have been looking at WCF Data Services as a possible approach. This would give me the basis for a consistent approach for any application making use of this data.
How does this change your thinking?
In your case I would recommend returning IEnummerables's for your data queries for the repo. I usually aggregate calls from multiple repo's through a service class that represents the domain problem and encapsulates my business logic. To keep it clean I try keep my repros focused on the domain problem. I liken my Datacontext to a repo, and extract an interface using a T4 template to make life easier for mocking. But there is nothing stopping you using a traditional repo that encapsulates your calls. Doing it this way will allow you to switch ORM's at any stage.
EDIT: IQueryable IS NOT THE ANSWER! :-)
I have also done a lot of work in this area, and INITIALLY came to the same conclusion, however it is NOT a good solution. The point of the Repo is to abstract queries into discrete chunks of work. Exposing IQueryable is too adhoc and raises some issues later down the line. You loose your ability to scale. You loose your ability to optimize queries (Lets say I want to move to a highly optimized stored proc). You loose your ability to use IoC for the repo to switch out data access layers (switch the project from SQL to Mongo). You loose your ability to provide effective data caching in the Repo (Which is a major strength in the Repo pattern). I would recommend taking a CLOSE look as to WHY we have a Repo pattern. It isn't simply an "ORM" mapping layer. What made this really clear to me was the CQRS pattern.
Further to this allowing the ad-hoc nature of IQueryable opens you to misfitting reuse of queries. It is GENERALLY not a good idea to reuse queries, since query to query you see slight deviations, which ends up with 2 byproducts: Queries become too broad and inefficient. Queries become riddled with unmaintainable IF THEN statements to cater for the deviations.
IQueryable is easy, but opens you up to an unmaintainable mess.
Look at this SO answer. I think it shows a simplified model of what you want. IQueryable<T> is indeed our new Repository :-). DataContext and ObjectContext are our Unit of Work.
UPDATE 2:
Here is a blog post that describes the model you might be looking for.
UPDATE 3
It would be wise to hide the shared databases behind a service. This will solve several problems:
This will make the database private to the service, which makes it much easier to change the implementation when needed.
You can put the needed validation logic (for database 1) in that service and can create tests for that validation logic in that project.
Clients accessing that service can assume correctness of the service, and its validation logic.
The result of this is that your application will send data to the service to validate it. Call the service to fetch data. Query its own private database (database 3) and join the data of the three data source locally together. I've never been a fan of using cross-database or even cross-server (in your situation) database calls and letting the database join everything together. Transactions will be promoted to distributed-transactions and it's hard to predict how many data the servers will exchange.
When you abstract the shared databases behind the service, things get easier (at least from your application's point of view). Your application calls services it trusts which limits the amount of code in that application and the amount of tests. You still want to mock the calls to such a service, but that would be pretty easy. It should also solve the problem of validating over multiple data sources.
Validation is always a hard part. I'm very familiar with Validation Application block, and love it for it's flexibility. It isn't however an easy framework, but you might take a peek at what you can do with it. For instance, I've written several articles about integration with O/RM tools and how to 'embed' a context (context as in DataContext/Unit of Work) in Validation Application Block.
Please have a look at my IRepository pattern implementation using EF 4.0.
My solution has the following features:
supports connections to multiple dbs
One repository per entity
Support for execution of queries
Unit of work pattern implementation
Support for validating entities using VAB guidance
Common operations are kept at base class level. High use of OOPS techniques for code re-usability and ease of maintenance.

Resources