Syncing data between services using Kafka JDBC Connector - microservices

I have a system with a microservice architecture. It has two services: Service A and Service B each with it's own database like in the following diagram.
As far as I understand having a separate database for each service is a better approach. In this design each service is the owner of its data, it's responsible for creating, updating, deleting and enforcing constraints.
In order to have Service A data in Database B I was thinking of using JDBC Kafka Connector, but I am not sure if Table1 and Table2 in Database B should enforce constraints from Database A.
If the constraint, like the foreign key from Table2 to Table1 should exist in Database B then, is there a way to have the connector know about this?
What are other common or better ways to sync data or solve this problem?
The easiest solution seems to sync per table without any constraints in Database B. That would make things easier but it could also lead to a situation where Service's A data in Service B is inconsistent. For example having entries in Table2 that point to a non-existing entry in Table1

If the constraint, like the foreign key from Table2 to Table1 should
exist in Database B then, is there a way to have the connector know
about this?
No unfortunately the "Kafka JDBC Connector" does not know about constraints.
Based on your question I assume that Table1 and Table2 are duplicated tables in Database B which exist in Database A. In Database A you have constraints which you are not sure you should add in Database B?
If that is the case then I am not sure if using "Kafka JDBC Connector" to sync data is the best choice.
You have a couple options:
Enforce the usage of Constraints like Foreign Keys in Database B but you would need to update it from your application level and not through "Kafka JDBC Connector". So for this option you can not use "Kafka JDBC Connector". You would need to write some small service/worker to read the data from that Kafka topic and populate your database tables. This way you control what is saved to the db and you can validate the constraints even before trying to save to your database. But the question here is do you really need to have the Constraints? They are important in micro-service-A but do you really need them in micro-service-B as it is just a copy of the data?
Not use constraints and allow temporary inconsistency. This is common in micro-services world. When working with Distributed systems you always have to think about the CAP Theorem. So you take into account that some data might at some point be inconsistent but you have to make sure that you will eventually bring it back to consistent state. This means you would need to develop on your application level some cleanup/healing mechanism which will recognize this data and correct it. So Db constraints do not necessary have to be enforced on data which the micro-service does not own and is considered as External data to that micro-service Domain.
Rethink your design. Usually we duplicate data in micro-service-B from micro-service-A in order to avoid coupling between the services so that the service micro-service-B can live and operate even when the micro-service-A is down or not running for some reason. We also do it to reduce load from micro-service-B to micro-service-A for every operation which needs data from Table1 and Table2. Table1 and Table2 are owned by micro-service-A and micro-service-A is the only source of truth for this data. Micro-service-B is using a duplicate of that data for its operations.
Looking at your databases design following questions might help you figuring out what would be the best option for you system:
Is it necessary to duplicate the data in micro-service-B?
If I duplicate the data do I need both tables and do I need all their columns/data in micro-service-B? Usually you just store/duplicate only a subset of the Entity/Table that you need.
Do I need the same table structure in micro-service-A as in micro-service-A? You have to decide this based on your Domain but very often you Denormalize your tables and change them in order to fit the needs of micro-service-B operations. As usually all these design decisions depend on your application Domain and use case.

Related

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

Datavault: How to get hashes for foreign key relationships (populating link tables)

I've read the data vault book end to end, but I'm still trying to resolve one specific thing related to how you'd populate the link tables (how to get all hashes for that). From the blog of scalefree: massively parallel processing, it demonstrates that satellites and hubs can be loaded in full parallel fashion, but it doesn't go into a lot of detail related to the link tables.
Links require hash keys, thus in some way 'business keys' from multiple tables to establish the relationships, that's what they do, they record relations between hubs. There aren't very good explanations or in-depth explanations how you would retrieve the business keys of related entities when populating these link tables.
For a specific table like 'customer' things are easy for hub and satellite: just convert the business key to a hash and load both of them in parallel.
But a customer details table or a transaction table from an OLTP need some kind of join to happen to look up the business key for the customer or to look up all the related entities in the transaction (product, customer, store, etc), because those tables do not typically store (all) business key(s) as an attribute in the table.
If I assume that staging is loaded incrementally and truncated, then staging doesn't necessarily have all the entities loaded to be able to perform joins there. How to resolve this dilemma and create a design that works?
Join on tables in the source OLTP systems to generate the business keys from there and propagate them as hashes from there? (this ends up wrong if the business key was chosen incorrectly)
Use a persistent staging area, so never truncate? (then it's always possible to join on any table in there to resolve)
Use some kind of index for surrogate keys -> business keys and perform a lookup from there? (minimizes I/O a bit further and is a mix between incremental staging and persistent staging).
some other method...?
Essentially, what is the best practice for generating the hashes for all foreign key relations of your OLTP systems?
I talked to an expert about this and this is the answer I accepted from him:
The only sensible two ways to produce hashes for tables that do not have all the columns necessary to produce a business key for that table is:
In the case where you have a full load of all the tables that have the business keys (yet maybe incremental for a link table), join to the relevant source tables having the business keys in staging. This is ok, because you can guarantee you have all the data in staging at that moment.
In the case where you have incremental loads for tables having business keys, you must use a persistent staging area (PSA) to do this for you.
It is considered bad practice to join tables in source system queries in order to generate the business keys. The reason is that the data warehouse should have as little operational impact as possible.

GraphQL as an abstraction for a data modelling tool

I'm trying to think out loud here to understand if graphql is a likely candidate for my need.
We have a home-grown self servicing report creation tool. This is web-based. It starts with user selecting a particular report type.
The report type in itself is a base SQL query. In subsequent screens, one can select the required columns, filters, etc. As we The output of all these steps is a SQL query, which is then run on an Oracle database.
As you can see, there are lot of cons with this tool. It is tightly coupled with the Oracle OLTP tables. There are hundreds of tables.
Given the current data model, and the presence of many tables, I'm wondering if GraphQL would be the right approach to design a UI that could act like a "data explorer". If I could combine some of the closely related tables and abstract them via GraphQL into logical groups, I'm wondering if I could create a report out of them.
**Logical Group 1**
Table1
Table2
Table3
Table4
Table5
**Logical Group 2**
Table6
Table7
Table8
Table9
Table10
and so on..
Let's say, I want 2 columns from tables in Logical group 1 and 4 Columns from Logical Group 2, is this something that could be defined as a GraphQL object and retrieved to be either rendered on a screen or written to a file?
I think I'm trying to write a data modelling UI via GraphQL. Is this even a good candidate for such a need?
We have also been evaluating Looker as a possible data modelling layer. However, it seems like there could be some
Thanks.
Without understanding your data better, it is hard to say for certain, but at first glance, this does not seem like a problem that is well suited to GraphQL.
GraphQL's strength is its ability to model + traverse a graph of data. It sounds to me like you are not so much traversing a continuous graph of data as cherry picking tables from a DB. It certainly is possible, but there may be a good deal of friction since this was not its intended design.
The litmus test I would use is the following two questions:
Can you imagine your problem mapping well to a REST API?
Does your API get consumed by performance sensitive clients?
If so, then GraphQL may serve your needs well, if not you may want to look at something like https://grpc.io/

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Disadvantages of consolidating databases?

In an organization that has two applications each with its own Oracle database instance, what are the disadvantages of consolidating the two databases into one database with two schemas?
Backups and replicating the database are bigger and slower, probably. What else?
Some background:
The two databases are the "gold source" for their respective data. Each is critical to the operation of the organization and each is actually used by several appliations, tools, and reports (but each database is principally "owned" by one application). The need to join data across the databases, to relate entities in one to entities in the other, comes up frequently. For this reason there are DB links connecting the two and some cross-database materialized views to help with performance. There is an effort underway to reduce data duplication and these materialized views are under discussion. Some in the organization want to phase out DB links and materialized views and introduce more web services to make the data available across applications. My concern is that there are too many situations that require complex joins of data across the two databases so services that expose the data won't perform. Another approach for reducing DB links and materialized views is to consolidate the schemas into one database, but I want to make sure I'm not forgetting any critical disadvantages to that approach.
In a single consolidated database, you will lose some flexibility from a DBA point of view:
A database obviously can have only one version (10.2.0.5 for example), which means that upgrades and patches will affect all schemas -- this may be a bad thing in case of multiple vendor app requirement mismatch.
Similarly, some administrative tasks (restore database A to point in time t) may be more complicated with a single database.
Overall, you will have less administration tasks (a single backup, single patching...) but each task will be more critical since they will have a global effect.
On the development side, beware of namespace collisions: some features are global over a single database, for example:
directories,
public synonyms,
DB link
Schemas
This means that you will have some work to do if you want to consolidate two databases that have public synonyms with the same name that points to two different things.
Could have something to do with licence costs - scaling up vs. scaling out.
The biggest concern I would have is that all your code will need to be rewritten to account for the new database and schemas. Or at least looked at. This courl introduce new bugs. I don't know how Oracle handles refernces to different databases, so I'll use an example of what I mean using SQL Server syntax. If I was joining to two tables onthe same server in different databases my select would be something like this:
SELECT a.field1, b.field2
FROM database1.dbo.table1 a
JOIN database2.dbo.table2 b
ON a.myid = b.myFK
To go your your new consolidated idea, you would want to write:
SELECT a.field1, b.field2
FROM schema1.table1 a
JOIN schema2.table2 b
ON a.myid = b.myFK
You will need to be especially careful of any tables that have the same name in both databases now, this could cause some sneaky bugs.
Note these are not difficult changes but all SQL hitting your database would have to be examined to see if it will work or adjusted if not.
I'm not sure if just putting them in the same database would do it either. You might need to consolidate some tables to avoid the duplication across applications. (In this case add fields to reference the old id numbers for things people are used to looking up by id like person_id that may appear on old paperwork, so they can be researched) This is a fairly major rewrite with all the attendant possibilites to make things worse due to new bugs.
If you go down this path, I highly recommend that you read a book on refactoring datbases before you decide how to design.
its hard to tell by just the information provided, big in db world would be 100gb or more, so 2 dbs would be 200GB. if both db are not bigger than 100GB then size should not be a huge factor in the decision, replication and sync can be done on changes only and backups should not be a big difference (again this depends on specifics such as when backups are done or if downtime is possible or backups are done during non-peak times)
Other than that other factors are:
naming collisions in dbo's such as keys, foreign key names, table names, etc. some renaming of tables, store procedures names too.

Resources