Data model migration on Kafka connectors - apache-kafka-connect

I'm using Debezium and JDBCSinkConnector to copy data from multiple databases into another DB. I would like to have the ability to upgrade some of the models from time to time and let's say not all DBs together, but only to upgrade the sink DB and some time later also the source DBs. And let's say I have a version column in the tables or environment variable to reconfigure the connectors. I've considered creating a combination of SMTs for upgrading from each version and run them depending on the source version, using some predicates. But I'm not sure this is a good practice or going to work at all. Haven't been able to find another solution for this.
What is the best way to implement the required migrations (such as added/removed columns or values manipulations, etc) "on the fly" via Kafka Connectors?

Related

How do I integrate Liquibase within an existing CI/CD pipeline in large organization?

We are working in a very big organization, many Databases (of many types), many schemas, many users.
Does LB has to work with some Source Control (for locking the files
when many users exist in the organization and using the same DB,
same Schema, etc)?
What is the best practice of working with LB in a very big
organization, many concurrent users?
Can SQLCL general sql format type or just xml format type?
Is there some integration with SQL Developer? I mean, suppose a user
changes an objects via sql developer, what happens then?
We get this type of question all the time, after folks get a handle of how to automate DB changes, next step is typically to add it into an existing CI/CD workflow.
Yes, Liquibase works with any source control. Most users are using
Git. But you can use Git, TFS, SVN, CVS... Once you are up and
running with Liquibase, you just need to make sure that your scripts
are in source control and you are good to go.
Besides 3rd party source control tools, Liquibase has tracking tables called "DATABASECHANGELOG" tables that keep track of the changes applied to your database when using Liquibase deployments.
Here is some more information about getting started and How Liquibase Works. https://www.liquibase.org/get_started/how-lb-works.html
Liquibase has one more table that it uses internally called "DATABASECHANGELOGLOCK" table.
This table was designed to prevent multiple Liquibase users running deployments concurrently - potentially leaving the Database in a bad state. Once the Liquibase deployment (the liquibase update command) is done, the "DATABASECHANGELOGLOCK" will allow the next Liquibase user to deploy.
You can use both SQL and XML formats (or even JSON and YAML formats).
When using SQL, you have a few options:
Best option is to use Formatted SQL changeLogs https://www.liquibase.org/documentation/sql_format.html
https://www.liquibase.org/get_started/quickstart_sql.html
You can use plain raw SQL files referenced from an XML changeLog
https://www.liquibase.org/documentation/changes/sql_file.html
When using XML, can find all the available change types (also called changeSets) available in the following page (on the left of the page)
https://www.liquibase.org/documentation/changes/
XML changeLog are more agnostic and sometimes can be used for different Database platforms when doing migrations. Also, many of the change types in XML have the ability to be rolled back automatically. The reason that this is possible with XML is because Liquibase uses it own built in functions to figure out inverse statements like "create table" to be "drop table".
For each of those changeSets you can find out if they are auto rollback eligible (at the bottom of the page). For example, create table changeSet will be Auto Rollback = yes.
https://www.liquibase.org/documentation/changes/create_table.html

how to design greenplum database constructure

i am working on designing constructure in Greenplum database.
we have many clinets which need to store data for them.
there are two ways to design database constructure. we build one database and different schemas in this database for each clients
or build different databases for each clients, which way is better?
waht is nmore ,we need to migrate databases or schemas from dev environment to environment
Thanks
William
William,
Either way will work. If you are keeping multi-tenants in Greenplum and there is no data sharing, you might be better off keeping them in separate databases - easier for security and for backups. If there is a requirement that they share some common data, then using multiple schemas in one database is the better option.
I am not sure what version of Greenplum you are on, but you should be able to backup a schema from dev and restore it with gprestore using the --redirect option to put it in the database you want it to be in.
Jim McCann
Pivotal

ETL tool migration : Best Practices in Parallel Run

I am new to ETL migration. I have worked with Talend, but not yet faced the task to migrate large ETL project from one tool to another(IBM Data Manager to Informatica PowerCenter or Informatica Developer).
I am looking to general guidlines for migrate jobs from one tool to another one, and of course for my specific case.
I will be more clear:
The Databases Sources and Targes will be the same, what I have to migrate is the ETL part itself.
The approach will be the parallel run as suggested at this blog :
Parallel Run
In my case I have not to migrate the all DWH instead only the ETL as the old software will become a legacy one and the new one is from another Vendor(luckly both of them can export XML ).
I am looking for the pratical approch for parallel run, indeed I am been suggested to copy the Sources and Targes Tables in the orginal Database schema, but it does not look to me the best way to go(even not pratical when a schema has many tables).
The DWH I am woking of course has several DBS instance in Oracle and some in SQL Server, a Test server and a Production one, as well as for each, a Staging, Storage and a Data Mart area.
As from this related question and its answer, I am thinking to copy each schema on the go for each project.
Staging in ETL: Best Practices
Looking to have guidlines references, but my specific case is the migration from IBM Data Manager to Informatica PowerCenter
The approach depends on various criteria and personal preferences. Either way you will need to either duplicate parts or all of the source and destination systems. On one extreme you can use two instances of the entire system. If you have complex upstream processes that are part of the test, or you have massive numbers of tables and processes, and you have the bandwidth and resources to duplicate your system then this approach may be optimal.
At the other extreme, if any complex processes occur within the ETL tool itself, or you are simply loading tables and need to check they are loaded correctly, then making copies of the tables and pointing your new or old tool to the table copies may be the way to go. This method is very simple and easy to setup.
Keep in mind this forum is not meant to replace blogs and in-depth tech articles on those techniques.

Using liquibase versioning table definitions, not change sets

I'd like to have my version only the latest table definition in my repository, (no change sets), and have liquibase figure out which changes are needed when patching my databases. Please take note that I have a very big database schema (1000+ tables) installed in hundreds of customer sites, with different versions each one, and I really don't know which objects each version has
How can I make a liquibase-based installer for my application, given my set of table definitions, and hundreds of databases with about 12 different versions of objects on each one?
To be more specific, I'd like liquibase to compare my table definitions with the production database, and emit the alter table statements required to make the database current with my latest version.
I could contribute code if necessary in order to get this done
Liquibase and tools like it (for example flyway) are primarily designed to support database migrations. A migration is where every change to the DB is tracked so that it can be replayed on target environments thereby keeping them in sync with development (although time-shifted). It's all about keeping your schema under revision control.
Your use case is a little different. If I understand correctly you're trying to retrofit Liquibase onto a series of environments that you are not 100% certain match your application's current schema?
I would only recommend migration tools like liquibase if you intend to use them going forwards. If all you want is a DB diff tool, I would suggest you look elsewhere.
To perform an initial sync then I would suggest you investigate the diffLog command, coupled with changeLogSync command to initialize liquibase on the target DB.
comparing databases and genrating sql script using liquibase

Database Crawling in GSA

I could see there are two ways to index a database records in GSA.
Content Sources > Databases
Using DB connector
As per my understanding, Content Sources > Databases does not support automatic recrawl. We have to manually sync after any changes occured in DB records. Is that correct?
Also, Would using DB connectors help in automatic recrawl?
I would like to check DB in every 15 minutes for the changes and update the index accordingly. Please suggest the viable apporach to achieve this.
Thanks in advance.
You are correct that Content Sources > Databases does not support any sort of automated recrawl.
Using either the 3.x Connector or 4.x Adaptor for Databases supports automatic recrawls. If you are looking to index the rows of databases only and not using it to feed a list of URLs to index then I would go with the 4.x Database Adaptor as it is new.
The Content Sources > Databases approach is good for data that doesn't change often where a manual sync is acceptable. That said though, it's easy enough to write a simple client that logs in to the admin console and hits the 'Sync' link periodically.
However, if you want frequent updates like every 15m I'd definitely go with the 4.x plexi-based adaptor, not because it's newer but because it's better. Older versions of the 3.x connector were a bit flaky (although the most recent versions are much better).
What flavour DB are you looking to index?

Resources