How to overcome data mismatch on several database

How to overcome data mismatch on several database - algorithm

In my system I have more than one project, each project connect with individual DB .When Insert transaction occur in any project then record insert on all of the db,but when update event occur in any project then respective update occur only it’s DB not impact rest of the project db.it’s my system process.After continue this process data become difference on each db.With out change this process what I do to overcome this data mismatch problem.
Suppose on system-1 transaction activity :
Transaction -->Update -->Modification occur only on system1 db not in system-2,sytem-3 db
Any type of suggestion will be acceptable,if have any query please ask,thanks in advanced.

I'm currently working in almost the same Project architecture. Our solution is to create Orchestration module that will manage Single_entry_point module. Last one is responsible to unify the information from the Upstream (cluster of different DataBases and Service systems) and after it to upload/distribute it to a Downstream (Single_Data_Warehouse). By doing so - you can guarantee that all your information is actual in every moment. The Orchestrator communicates with Service massages when dealing with all other modules.
This design is based on Pipes and Filters Pattern concept.
I think that in your case, you can only add logic for Update DB information and reuse all that you have at this point. If you spend some time on such Single_entry_point module, which to deal with not only Insert, but with Transaction Update too.
When it comes to Databases “eyeballing” validation (done by SQL scripting) you definatelly have to consider the use of Informatica. To be more specific - when data as it is being moved into production systems. The data in your production systems has to be right in order to support your business decision making. Informatica Data Validation Option provides the ETL testing automation and management capabilities to ensure that your production systems are not compromised by the data update process.
If you find that this options doesn't suits your needs, here are resources I found about this topic:
database-synchronization-an-overview-of-approaches
MSDN Synchronizing Databases
how-to-synchronize-databases-in-different-servers-in-sql-server-2008
sql-comparison-sdk-synchronizing-databases

Related

Microservice architecture - is database shared across all instances of the service?

I understand that microservice architecture suggests that each service should have its own private database. But when such a service is scaled, then is it one db per service instance or one db shared by all service instances?

Your first statement may be misleading to some: "each service should have its own private database."
Your architecture should be careful about sharing a single set of tables across multiple services-- that sharing frequently leads to a shared schema dependency, which creates a tight coupling that makes it difficult to update the schema without updating many of the services that share that schema at the same time.
However, sharing a single database instance (or database cluster) doesn't mean your services are accessing the same tables or even the same schema within the database. And if they aren't accessing the same tables, they aren't coupled. (Relying on the same database instance isn't coupling any more than relying on the same network. Don't confuse coupling with shared infrastructure.)
Frequently, multiple instances of the same service share the same database. In my opinion, there is nothing inherently wrong with this, but there are some things to be aware of. If you go this route, you need to be very careful when making changes to the data schema. Because multiple versions of that service may be accessing the data at the same time during updates, any schema changes need to compatible to at least any two adjacent versions. If you add a column or table, that's fine. The older version won't attempt to use it, so there will be no problem. (Note too, that the older version won't populate it either.) Removing a column or table is another problem entirely and to make that kind of breaking change, you will likely need to do it in several smaller steps to ensure that the older version of the service isn't broken. It can be done, it's just tougher.

A general rule of microservice development is that each microservice
should manage its own data. In an ideal world, the data managed by
each service would be completely independent. There would be no need
to propagate data changes made in one service to other services.
In the real world, however, complete data independence is impossible.
There will always be overlaps between the data used in different
services, Consequently, as an architect, you need to think carefully about
sharing data and managing data consistency. You need to think about
the microservices as an interacting system rather than as individual
units.
This means:
You should isolate data within each system service with as little
data sharing as possible.
If data sharing is mavoidable, you should design microservices so
that most sharing is read-only, with a minimal number of
services responsible for data updates.
If services are replicated in your system, you must include a
mechanism that can keep the database copies used by replica
services consistent.

Good question indeed. I would answer it like: "at least a database per microservice (not instance)"
A concern is the scalability of the databse itself, i.e. can service instances outscale the database?
If so, you could opt for e.g. an in-memory database or a sidecar for your microservice. The database would be ephemeral and you would need to populate it after the pod/container (re)starts. So the state not really lives in the database.
Apache Kafka is a tool that fits this spot, as it would allow you to populate the database after the service comes up and also provides the tooling to synchronize state for all currently running and future instances. But successfully implementing a Event-Sourcing with Kafka is not a trivial task, but you could come the conclusion that you don't need databases at all.
So the question remains, can service instances really outscale the database?
The answer would be "no" more often than not.
So by having a database instance per microservice (physically or logically) already gives you a lot in terms of "loose coupling and cohesive behaviour" as you don't share databases.
Another concern are breaking changes to the database between versions of the microservice. If things go wrong you could find yourself being unable to rollback. An ephemeral database could sync itself up in a compatible way.
Some say they change database technologies throughout the lifetime of a microservice, I never had the neccessity to do so, but an in-memory/sidecar approach would fit here very well.

I presume you share one database with all instances of one microservice. So that one update is available for every instance of the same microservice immediately. You may use one database instance per microservice instance to avoid the database as a single point of failure. But you would have to keep in sync every database which, it seems like an unnecesary overload for the database and application. I assume the database is able to keep a group of db instances in sync (every insert,update, delete is properly propagated).

INFA SFDC Connector refresh staging table

I have a problem in INFA where the metadata definitions from source/target tables gets out of sync with the actual definition in the underlying database. I'm working with SFDC connector in INFA, and the SFDC object is changed my integration fails. Is it possible to script to update my source/target table metadata before processing?

No, this is not possible, I'm afraid. This would require to also refresh mappings and session. By design, Informatica Source/Target definitions are not related with any particular implementation. This might cause a lot of issues if you'd use one mapping in multiple session to access many different servers. If one of them would change, this would impact every other one.
I guess you can have some additional session that would perform simple read operation to make sure the structure has not been modified before actually starting the main ETL part.
Well, to be honest: other solution would be to have a tool that checks source and target definition, generates the whole mapping/session/workflow, does the import operation on repository with conflict resolution and runs the workflow afterwards. This is possible, but... very complex.

Multiple programs updating the same database

I have a website developed with ASP.NET MVC, Entity Framework Code First and SQL Server.
The website has entities that each have a history of statuses that we defined (NEW, PACKED, SHIPPED etc.)
The DB contains a table in which a completely separate system inserts parcel tracking data.
I have to read this data tracking data and, following certain business rules, add to the existing status history of my entities.
The best way I can think of is to write an independent Windows service to poll the tracking data every so often and update my entity statuses from that. However, that makes me concerned about DB concurrency issues.
Please could someone advise me on the best strategy for this scenario?
Many thanks

There are different ways to do it. It also depends on the response time you need. If you need to update your system as soon as the tracking system updates the record then a trigger is the preferred way. Alternative way is to schedule a job which will run every 15/30mins and sync the 2 systems.
As for the concurrency issue you can use a concurrency token field. Entity framework has support for this.

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?

This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

How do you manage schema upgrades to a production database?

This seems to be an overlooked area that could really use some insight. What are your best practices for:
making an upgrade procedure
backing out in case of errors
syncing code and database changes
testing prior to deployment
mechanics of modifying the table
etc...

Liquibase
liquibase.org:
it understands hibernate definitions.
it generates better schema update sql than hibernate
it logs which upgrades have been made to a database
it handles two-step changes (i.e. delete a column "foo" and then rename a different column to "foo")
it handles the concept of conditional upgrades
the developer actually listens to the community (with hibernate if you are not in the "in" crowd or a newbie -- you are basically ignored.)
http://www.liquibase.org

opinion
the application should never handle a schema update. This is a disaster waiting to happen. Data outlasts the applications and as soon as multiple applications try to work with the same data ( the production app + a reporting app for example) -- chances are they will both use the same underlying company libraries... and then both programs decide to do their own db upgrade ... have fun with that mess.

I am a big fan of Red Gate products that help creating SQL packages to update database schemas. The database scripts can be added to source control to help with versioning and rollback.

In general my rule is: "The application should manage it's own schema."
This means schema upgrade scripts are part of any upgrade package for the application and run automatically when the application starts. In case of errors the application fails to start and the upgrade script transaction is not committed. The downside to this is that the application has to have full modification access to the schema (this annoys DBAs).
I've had great success using Hibernates SchemaUpdate feature to manage the table structures. Leaving the upgrade scripts to only handle actual data initialization and occasional removing of columns (SchemaUpdate doesn't do that).
Regarding testing, since the upgrades are part of the application, testing them becomes part of the test cycle for the application.
Afterthought: Taking on board some of the criticism in other posts here, note the rule says "it's own". It only really applies where the application owns the schema as is generally the case with software sold as a product. If your software is sharing a database with other software, use other methods.

That's a great question. ( There is a high chance this is going to end up a normalised versus denormalised database debate..which I am not going to start... okay now for some input.)
some off the top of my head things I have done (will add more when I have some more time or need a break)
client design - this is where the VB method of inline sql (even with prepared statements) gets you into trouble. You can spend AGES just finding those statements. If you use something like Hibernate and put as much SQL into named queries you have a single place for most of the sql (nothing worse than trying to test sql that is inside of some IF statement and you just don't hit the "trigger" criteria in your testing for that IF statement). Prior to using hibernate (or other orms') when I would do SQL directly in JDBC or ODBC I would put all the sql statements as either public fields of an object (with a naming convention) or in a property file (also with a naming convention for the values say PREP_STMT_xxxx. And use either reflection or iterate over the values at startup in a) test cases b) startup of the application (some rdbms allow you to pre-compile with prepared statements before execution, so on startup post login I would pre-compile the prep-stmts at startup to make the application self testing. Even for 100's of statements on a good rdbms thats only a few seconds. and only once. And it has saved my butt a lot. On one project the DBA's wouldn't communicate (a different team, in a different country) and the schema seemed to change NIGHTLY, for no reason. And each morning we got a list of exactly where it broke the application, on startup.
If you need adhoc functionality , put it in a well named class (ie. again a naming convention helps with auto mated testing) that acts as some sort of factory for you query (ie. it builds the query). You are going to have to write the equivalent code anyway right, just put in a place you can test it. You can even write some basic test methods on the same object or in a separate class.
If you can , also try to use stored procedures. They are a bit harder to test as above. Some db's also don't pre-validate the sql in stored procs against the schema at compile time only at run time. It usually involves say taking a copy of the schema structure (no data) and then creating all stored procs against this copy (in case the db team making the changes DIDn't validate correctly). Thus the structure can be checked. but as a point of change management stored procs are great. On change all get it. Especially when the db changes are a result of business process changes. And all languages (java, vb, etc get the change )
I usually also setup a table I use called system_setting etc. In this table we keep a VERSION identifier. This is so that client libraries can connection and validate if they are valid for this version of the schema. Depending on the changes to your schema, you don't want to allow clients to connect if they can corrupt your schema (ie. you don't have a lot of referential rules in the db, but on the client). It depends if you are also going to have multiple client versions (which does happen in NON - web apps, ie. they are running the wrong binary). You could also have batch tools etc. Another approach which I have also done is define a set of schema to operation versions in some sort of property file or again in a system_info table. This table is loaded on login, and then used by each "manager" (I usually have some sort of client side api to do most db stuff) to validate for that operation if it is the right version. Thus most operations can succeed, but you can also fail (throw some exception) on out of date methods and tells you WHY.
managing the change to schema -> do you update the table or add 1-1 relationships to new tables ? I have seen a lot of shops which always access data via a view for this reason. This allows table names to change , columns etc. I have played with the idea of actually treating views like interfaces in COM. ie. you add a new VIEW for new functionality / versions. Often, what gets you here is that you can have a lot of reports (especially end user custom reports) that assume table formats. The views allow you to deploy a new table format but support existing client apps (remember all those pesky adhoc reports).
Also, need to write update and rollback scripts. and again TEST, TEST, TEST...
------------ OKAY - THIS IS A BIT RANDOM DISCUSSION TIME --------------
Actually had a large commercial project (ie. software shop) where we had the same problem. The architecture was a 2 tier and they were using a product a bit like PHP but pre-php. Same thing. different name. anyway i came in in version 2....
It was costing A LOT OF MONEY to do upgrades. A lot. ie. give away weeks of free consulting time on site.
And it was getting to the point of wanting to either add new features or optimize the code. Some of the existing code used stored procedures , so we had common points where we could manage code. but other areas were this embedded sql markup in html. Which was great for getting to market quickly but with each interaction of new features the cost at least doubled to test and maintain. So when we were looking at pulling out the php type code out, putting in data layers (this was 2001-2002, pre any ORM's etc) and adding a lot of new features (customer feedback) looked at this issue of how to engineer UPGRADES into the system. Which is a big deal, as upgrades cost a lot of money to do correctly. Now, most patterns and all the other stuff people discuss with a degree of energy deals with OO code that is running, but what about the fact that your data has to a) integrate to this logic, b) the meaning and also the structure of the data can change over time, and often due to the way data works you end up with a lot of sub process / applications in your clients organisation that needs that data -> ad hoc reporting or any complex custom reporting, as well as batch jobs that have been done for custom data feeds etc.
With this in mind i started playing with something a bit left of field. It also has a few assumptions. a) data is heavily read more than write. b) updates do happen, but not at bank levels ie. one or 2 a second say.
The idea was to apply a COM / Interface view to how data was accessed by clients over a set of CONCRETE tables (which varied with schema changes). You could create a seperate view for each type operation - update, delete, insert and read. This is important. The views would either map directly to a table , or allow you to trigger of a dummy table that does the real updates or inserts etc. What i actually wanted was some sort of trappable level indirection that could still be used by crystal reports etc. NOTE - For inserts , update and deletes you could also use stored procs. And you had a version for each version of the product. That way your version 1.0 had its version of the schema, and if the tables changed, you would still have the version 1.0 VIEWS but with NEW backend logic to map to the new tables as needed, but you also had version 2.0 views that would support new fields etc. This was really just to support ad hoc reporting, which if your a BUSINESS person and not a coder is probably the whole point of why you have the product. (your product can be crap but if you have the best reporting in the world you can still win, the reverse is true - your product can be the best feature wise, but if its the worse on reporting you can very easily loose).
okay, hope some of those ideas help.

These are all weighty topics, but here is my recommendation for updating.
You did not specify your platform, but for NANT build environments I use Tarantino. For every database update you are ready to commit, you make a change script (using RedGate or another tool). When you build to production, Tarantino checks if the script has been run on the database (it adds a table to your database to keep track). If not, the script is run. It takes all the manual work (read: human error) out of managing database versions.

I've heard good things about iBATIS 3 Schema Migrations System:
User Guide: http://svn.apache.org/repos/asf/ibatis/java/ibatis-3/trunk/doc/en/iBATIS-3-Migrations.pdf

As Pat said, use liquibase. Especially when you have several developers with their own dev databases
making changes that will become part of the production database.
If there's only one dev, as on one project I'm on now(ha), I just commit the schema changes as SQL text files into a CVS repo, which I check out in batches on the production server when the code changes go in.
But liquibase is better organized than that!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio