Separate Data Access Layers for Distributed Compute - spring

Overview
Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car objects are accessed via the CarService. These services are powered through Spring Data Repositories and access data (models) stored in both PostgreSQL and Elasticsearch.
We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.
Problem
After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD and Beam requires PCollection and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.
My Question
Has anyone encountered this before? What was your strategy?
Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?

In software engineering, multi-tier architecture is a client-server architecture in which, the presentation, the application processing and the data management are logically separate processes, crosscutting concern or logical separation helps for performance, scalability and maintenance.
keep in mind that tiers are at logical levels which means that may or may not be many physical layers.
If you are going with Image 1 then no need new DAO layers but in Image 2 , i will suggest create separate project and use EAI pattern to communicate both projects
Image 1:
In image 1 you can process data and keep into database and use same DAO layer to get data
Image 2:
In image 2 You can create separate layer where you have to submit your jobs and collect results directly into your spring code .
https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html
Apacge Spark or Bigdata has diffrent archetecture styles ,plesae read following links .
http://lambda-architecture.net/
http://milinda.pathirage.org/kappa-architecture.com/
https://mapr.com/solutions/zeta-enterprise-architecture/
What are the differences between kappa-architecture and lambda-architecture

Related

Use Cases of NIFI

I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.

Hadoop use-case scenario

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well.
We are creating a product which would be used to analyse data from social media. So the input data would be a large volume of tweets, facebook posts, user profiles, YouTube data and data from blogs etc. On top of this I would be having a web application to help me view and analyse this data. As the requirement makes it clear, I would be needing a sort of real time system. So if I have a tweet coming in, I would like to have it available to my web app readily for processing. Batch data processing may not be a suitable choice for my application.
My questions are:
Is a Hadoop engine a good choice for me?
What are the parameter I should base my decision on?
Is it also a good option to use a Multi Cluster MySQL engine as opposed to Hadoop?
Is there any benchmarking in terms of Size and velocity of data in which Hadoop becomes a good choice?
Hadoop is not appropriate for near real time / interactive analysis. Hadoop was designed to do big batch processing of say a few hours of data plus. I used to use Hadoop to process any dataset that was around 10 GB or more (which is still a bit overkill), once it get's to 100 GB then you defo want something like Hadoop.
Now my recommendation would be for Spark as this is much more modern, much faster, more flexible, more powerful, and has a SparkStreaming module for achieving closer to real time analysis. Read all about it! https://spark.apache.org/
In this case I prefer the Lambda Architecture.
With Lambda Architecture you have two routes: A fast route with a noSQL database for the current informations, and a batch route with hadoop-hdfs for the archive data, and with a merge component you can merge the two datasources in one query, so you receive a whole amount of data, which is near real time.
http://lambda-architecture.net/
Image about lambda architecture: http://i.stack.imgur.com/eofRW.png
We created a PoC Project with Lambda Architecture (also for Twitter analysis), and its working fine.
Spark will be the best solution for your problem.You can also look other in-memory databases.

How to best represent database views/summary info in "3-Tiered" application

This is basically asking the same question as in How to handle views in a multilayer-application. However, that post didn't receive much feedback.
Here's the problem: we have built a 3-tiered web application with the following tiers:
-Data Access (using repositories)
-Service
-UI (MVC 3)
DTO's are passed between the UI (Controller) Layer and Service Layer. Heavier Domain Models, containing a lot of domain-level logic, are passed between the Service and Data Access Layers. Everything is decoupled using IOC and the app follows SOLID principals (or tries too) --a big happy decoupled family!
Currently the DTO->Domain Model and Domain Model->DTO conversion happens all in the service layer.
So, finally to my question:
We are going to need to start displaying more complex read-only subsets of information, (i.e. summary views joining multiple entities doing rollup totals, etc). So what is the best practice for representing this type of read-only data in the n-tiered system? Having to map read-only Domain Model types to DTO types in this case doesn't make sense to me. In most cases, there would be no difference between the 2 types anyway. My thought would be to "break" the layering boundaries for these read-only types, having the Data Access Layer serve up the DTO's directly and pass those through to the Service Layer and on to the UI.
Can anyone point me in the right direction?
Much Thanks!
Your thought on breaking the layering for reading and then displaying values make sense completely. After all, the architecture/design of the system should help you and not the other way around.
Displaying report-like data to the user should be queried simply from the database and pushed to the view; no domain/dto conversion, especially if you're in a web app. You will save yourself a lot of trouble by doing this.
Personally, I had some attempts to go through these mappings just to display some read only data and it worked poorly; the performance, the unnecessary mappings, the odd things I had to do just to display some kind of report-like views. In this case, you'll likely have your domain model and a read model. You can look up CQRS pattern, it might guide you away from thinking that you want to use the same data model for both writes and reads.
So, to answer you question, I believe that in this case the best way would be to skip layering and read DTOs directly from the database through a thin layer.

CF Project getting too big, what shall one do?

A simple billing system (on top of ColdBox MVC) is ballooning into a semi-enterprisey inventory + provisioning + issue-tracking + profit tracking app. They seem to be doing their own thing yet they share many things including a common pool of Clients and Staff (login's), and other intermingled data & business logic.
How do you keep such system modular? from a maintenance, testability & re-usability stand point?
single monolithic app? (i.e. new package for the base app)
ColdBox module? not sure how to make it 'installable' and what benefits does it bring yet.
Java Portlet? no idea, just thinking outside the box
SOA architecture? through webservice API calls?
Any idea and/or experience you'd like to share?
I would recommend you break the app into modular pieces using ColdBox Modules. You can also investigate on separate business logic into a RESTful ColdBox layer also and joining the system that way also. Again, it all depends on your requirements and needs at the moment.
Modules are designed to break monolithic applications into more manageable parts that can be standalone or coupled together.
Stop thinking about technology (e.g. Java Portals, ColdBox modules, etc...) and focus on architecture. By this I mean imagining how you can explain your system to an observer. Start by drawing a set of boxes on a whiteboard that represent each piece - inventory, clients, issue tracking, etc... - and then use lines to show interactions between those systems. This focuses you on a separation of concerns, that is grouping together like functionality. To start don't worry about the UI, instead focus on algorithms and data.
If you we're talking about MVC, that step is focusing on the model. With that activity complete comes the hard part, modifying code to conform to that diagram (i.e the model). To really understand what this model should look like I suggest reading Domain Driven Design by Eric Evans. The goal is arriving at a model whose relationships are manageable via dependency injection. Presumably this leaves you with a set of high level CFCs - services if you will - with underlying business entities and persistence management. Their relationships are best managed by some sort of bean container / service locator, of which I believe ColdBox has its own, another example is ColdSpring.
The upshot of this effort is a model that's unit testable. Independent of of the user interface. If all of this is confusing I'd suggest taking a look at Working Effectively with Legacy Code for some ideas on how to make this transition.
Once you have this in place it's now possible to think about a controller (e.g. ColdBox) and linking the model to views through it. However, study whatever controller carefully and choose it because of some capability it brings to the table that your application needs (caching is an example that comes to mind). Your views will likely need to be reimagined as well to interact with this new design, but what you should have is a system where the algorithms are now divorced from the UI, making the views' job easy.
Realistically, the way you tackle this problem is iteratively. Find one system that can easily be teased out in the fashion I describe, get it under unit tests, validate with people as well, and continue to the next system. While a tedious process, I can assure it's much less work than trying to rewrite everything, which invites disaster unless you have a very good set of automated validation ahead of time.
Update
To reiterate, the tech is not going to solve your problem. Continued iteration toward more cohesive objects will.
Now as far as coupled data, with an ORM you've made a tradeoff, and monolithic systems do have their benefits. Another approach would be giving one stateful entity a reference to another's service object via DI, such that you retrieve it through that. This would enable you to mock it for the purpose of unit testing and replace it with a similar service object and corresponding entity to facilitate reuse in other contexts.
In terms of solving business problems (e.g. accounting) reuse is an emergent property where you write multiple systems that do roughly the same thing and then figure out how to generalize. Rarely if ever in my experience do you start out writing something to solve some business problem that becomes a reusable component.
I'd suggest you invest some time in looking at Modules. It will help with partitioning your code into logical features whilst retaining the integration with the Model.
Being ColdBox there is loads of doc's and examples...
http://wiki.coldbox.org/wiki/Modules.cfm
http://experts.adobeconnect.com/p21086674/
You need to get rid of the MVC and replace it with an SOA architecture that way the only thing joining the two halves are the service requests.
So on the server side you have the DAO and FACADE layers. And the client side can be an MVC or what ever architecture you want to use sitting somewhere else. You can even have an individual client for each distinct business.
Even for the server side you can break the project down into multiple servers: what's common between all businesses and then what's distinct between all of them.
The problem we're facing here luckily isn't unique.
The issue here seems not to be the code itself, or how to break it apart, but rather to understand that you're now into ERP design and development.
Knowing how best to develop and grow an ERP which manages the details of this organization in a logical manner is the deeper question I think you're trying to get at. The design and architecture itself of how to code from this flows from an understanding of the core functional areas you need.
Luckily we can study some existing ERP systems you can get a hold of to see how they tackled some of the problems. There's a few good open source ERP's, and what brought this tip to my mind is a full cycle install of SAP Business One I oversaw (a small-mid size ERP that bypasses the challenges of the big SAP).
What you're looking for is seeing how others are solving the same ERP architecture you're facing. At the very least you'll get an idea of the tradeoffs between modularization, where to draw the line between modules and why.
Typically an ERP system handles everything from the quote, to production (if required), to billing, shipping, and the resulting accounting work all the way through out.
ERPS handle two main worlds:
Production of goods
Delivery of service
Some businesses are widget factories, others are service businesses. A full featured out of the box ERP will have one continuous chain/lifecycle of an "order" which gets serviced by a number of steps.
If we read a rough list of the steps an ERP can cover, you'll see the ones that apply to you. Those are probably the modules you have or should be breaking your app into. Imagine the following steps where each is a different document, all connected to the previous one in the chain.
Lead Generation --> Sales Opportunities
Sales Opportunities --> Quote/Estimate
Quote Estimate --> Sales Order
Sales Order --> Production Order (Build it, or schedule someone to do the work)
Production order --> Purchase orders (Order required materials or specialists to arrive when needed)
Production Order --> Production Scheduling (What will be built, when, or Who will get this done, when?)
Production Schedule --> Produce! (Do the work)
Produced Service/Good --> Inventory Adjustments - Convert any raw inventory to finished goods if needed, or get it ready to ship
Finished Good/Service --> Packing Slip
Packing Slip items --> Invoice
Where system integrators come in is using the steps required, and skipping over the ones that aren't used. This leads to one thing for your growing app:
Get a solid data security strategy in place. Make sure you're confortable that everyone can only see what they should. Assuming that is in place, it's a good idea to break apart the app into it's major sections. Modules are our friends. The order to break them up in, however, will likely have a larger effect on what you do than anything.
See which sections are general, (reporting, etc) that could be re-used between multiple apps, and which are more specialized to the application itself. The features that are tied to the application itself will likely be more tightly coupled already and you may have to work around that.
For an ERP, I have always preferred a transactional "core" module, which all the other transaction providers (billing pushing the process along once it is defined).
When I converted a Lotus Notes ERP from the 90's to the SAP ERP, the Lotus Notes app was excellent, it handled everything as it should. THere were some mini-apps built on the side that weren't integrated as modules which was the main reason to get rid of it.
If you re-wrote the app today, with today's requirements, how would you have done it differently? See if there's any major differences from what you have. Let the app fight for your attention to decide what needs overhauling / modularization first. ColdBox is wonderful for modularization, whether you're using plugin type modules or just using well separated code you won't go wrong with it, it's just a function of developer time and money available to get it done.
The first modules I'd build / automate unit testing on are the most complex programatically. Chances are if you're a decent dev, you don't need end to end unit testing as of yesterday. Start with the most complex, move onto the core parts of the app, and then spread into any other areas that may keep you up at night.
Hope that helped! Share what you end up doing if you don't mind, if anything I mentioned needs further explanation hit me up on here or twitter :)
#JasPanesar

Repository pattern with "modern" data access strategies

So I was searching the web looking for best practices when implementing the repository pattern with multiple data stores when I found my entire way of looking at the problem turned upside down. Here's what I have...
My application is a BI tool pulling data from (as of now) four different databases. Due to internal constraints, I am currently using LINQ-to-SQL for data access but require a design that will allow me to change to Entity Framework or NHibernate or the next data access du jour. I also hold steadfast to decoupled layers in my apps using an IoC framework (Castle Windsor in this case).
As such, I've used the Repository pattern to abstract the actual data access code from my business layer. As a result, my business object is coded against some I<Entity>Repository interface and the IoC Container is used to manage the actual implementation. In this case, I would expect to have a concrete Linq<Entity>Repository that implements the interface using LINQ-to-SQL to do the work. Later I could replace this with an EF<Entity>Repository with no changes required to my business layer.
Also, because I'm coding against the interface, I can easily mock the repository for unit testing purposes.
So the first question that I have as I begin coding the application is whether I should have one repository per DataContext or per entity (as I've typically done)? Let's say one database contains Customers and Sales with the expected relationship. Should I have a single OrderTrackingRepository with methods that work with both entities or have a separate CustomerRepository and a different SalesRepository?
Next, as a BI tool, the primary interface is for reporting, charting, etc and often will require a "mashup" of data across multiple sources. For instance, the reality is that one database contains customer information while another handles sales information and a third holds other financial information but one of my requirements is to display aggregated information that spans all three. Plus, I have to support dynamic filtering in the UI. Obviously working directly against the LINQ-to-SQL or EF DataContext objects (Table<Entity>, for instance) will allow me to pretty much do anything. What's the best approach to expose that same functionality to my business logic when abstracting the DAL with a repository interface?
This article: link text indicates that EF4 has turned this approach around and that the repository is nothing more than an IQueryable returned from the EF DataContext which brings up a whole other set of questions.
But, I think I've rambled on enough...
UPDATE (Thanks, Steven!)
Okay, let me put a more tangible (for me, at least) example on the table and clarify a few points that will hopefully lead to an approach I can better wrap my head around.
While I understand what Steven has proposed, I have a team of developers I have to consider when implementing such things and I'm afraid they will get lost in the complexity (yes, a real problem here!).
So, let's remove any direct tie-in with Linq-to-Sql because I don't want a solution that is dependant upon the way L2S works - or even EF, for that matter. My intent has been to abstract away the data access technology being used so that I can change it as needed without requiring collateral changes to the consuming code in my business layer. I've accomplished this in the past by presenting the business layer with IRepository interfaces to work against. Perhaps these should have been named IUnitOfWork or, more to my liking, IDataService, but the goal is the same. These interfaces typically exposed methods such as Add, Remove, Contains and GetByKey, for example.
Here's my situation. I have three databases to work with. One is DB2 and contains all of the business information for a customer (franchise) such as their info and their Products, Orders, etc. Another, SQL Server database contains their financial history while a third SQL Server database contains application-specific information. The first two databases are shared by multiple applications.
Through my application, the customer may enter/upload their financial information for a given time period. When entered, I have to perform the following steps:
1.Validate the entered data against a set of static rules. For example, the data must contain a legitimate customer ID value (in the case of an upload). This requires a lookup in the DB2 database to verify that the supplied customer ID exists and is current.
2.Next I have to validate the data against a set of dynamic rules which are contained in the third (SQL Server) database. An example may be that a given value cannot exceed a certain percentage of another value.
3.Once validated, I persist the data to the second SQL Server database containing the financial data.
All the while, my code must have loosely-coupled dependencies so I may mock them in my unit tests.
As part of the analysis, I know that I have three distinct data stores to work with and about a half-dozen or so entities (at this time) that I am working with. In generic terms, I presume that I would have three DataContexts in my application, one per data store, with the entities exposed by the appropriate data context.
I could then create a separate I{repository|unit of work|service} for each entity that would be consumed by my business logic with a concrete implementation that knows which data context to use. But this seems to be a risky proposition as the number of entities increases, so does the number of individual repository|UoW|service types.
Then, take the case of my validation logic which works with multiple entities and, thereby, multiple data contexts. I'm not sure this is the most efficient way to do this.
The other requirement that I have yet to mention is on the reporting side where I will need to execute some complex queries on the data stores. As of right now, these queries will be limited to a single data store at a time, but the possibility is there that I might need to have the ability to mash data together from multiple sources.
Finally, I am considering the idea of pulling out all of the data access stuff for the first two (shared) databases into their own project and have been looking at WCF Data Services as a possible approach. This would give me the basis for a consistent approach for any application making use of this data.
How does this change your thinking?
In your case I would recommend returning IEnummerables's for your data queries for the repo. I usually aggregate calls from multiple repo's through a service class that represents the domain problem and encapsulates my business logic. To keep it clean I try keep my repros focused on the domain problem. I liken my Datacontext to a repo, and extract an interface using a T4 template to make life easier for mocking. But there is nothing stopping you using a traditional repo that encapsulates your calls. Doing it this way will allow you to switch ORM's at any stage.
EDIT: IQueryable IS NOT THE ANSWER! :-)
I have also done a lot of work in this area, and INITIALLY came to the same conclusion, however it is NOT a good solution. The point of the Repo is to abstract queries into discrete chunks of work. Exposing IQueryable is too adhoc and raises some issues later down the line. You loose your ability to scale. You loose your ability to optimize queries (Lets say I want to move to a highly optimized stored proc). You loose your ability to use IoC for the repo to switch out data access layers (switch the project from SQL to Mongo). You loose your ability to provide effective data caching in the Repo (Which is a major strength in the Repo pattern). I would recommend taking a CLOSE look as to WHY we have a Repo pattern. It isn't simply an "ORM" mapping layer. What made this really clear to me was the CQRS pattern.
Further to this allowing the ad-hoc nature of IQueryable opens you to misfitting reuse of queries. It is GENERALLY not a good idea to reuse queries, since query to query you see slight deviations, which ends up with 2 byproducts: Queries become too broad and inefficient. Queries become riddled with unmaintainable IF THEN statements to cater for the deviations.
IQueryable is easy, but opens you up to an unmaintainable mess.
Look at this SO answer. I think it shows a simplified model of what you want. IQueryable<T> is indeed our new Repository :-). DataContext and ObjectContext are our Unit of Work.
UPDATE 2:
Here is a blog post that describes the model you might be looking for.
UPDATE 3
It would be wise to hide the shared databases behind a service. This will solve several problems:
This will make the database private to the service, which makes it much easier to change the implementation when needed.
You can put the needed validation logic (for database 1) in that service and can create tests for that validation logic in that project.
Clients accessing that service can assume correctness of the service, and its validation logic.
The result of this is that your application will send data to the service to validate it. Call the service to fetch data. Query its own private database (database 3) and join the data of the three data source locally together. I've never been a fan of using cross-database or even cross-server (in your situation) database calls and letting the database join everything together. Transactions will be promoted to distributed-transactions and it's hard to predict how many data the servers will exchange.
When you abstract the shared databases behind the service, things get easier (at least from your application's point of view). Your application calls services it trusts which limits the amount of code in that application and the amount of tests. You still want to mock the calls to such a service, but that would be pretty easy. It should also solve the problem of validating over multiple data sources.
Validation is always a hard part. I'm very familiar with Validation Application block, and love it for it's flexibility. It isn't however an easy framework, but you might take a peek at what you can do with it. For instance, I've written several articles about integration with O/RM tools and how to 'embed' a context (context as in DataContext/Unit of Work) in Validation Application Block.
Please have a look at my IRepository pattern implementation using EF 4.0.
My solution has the following features:
supports connections to multiple dbs
One repository per entity
Support for execution of queries
Unit of work pattern implementation
Support for validating entities using VAB guidance
Common operations are kept at base class level. High use of OOPS techniques for code re-usability and ease of maintenance.

Resources