I have specified a list of tables to be replicated in Debezium, using the "table.include.list" configuration.
However when creating new tables in the source db, that have not been selected for replication, they are in fact being replicated.
How can I change this behaviour of Debezium, to only replicate the tables specified?
Related
I have multiple Fivetran destination tables in the Snowflake. Those tables were created by the Fivetran itself and the Fivetran currently writes data into the tables. Now I would like to stop syncing data in one of the tables and start writing to the table from a different source. Would I experience any troubles with this? Should I do something else in to make it possible?
What you mention is not possible because of how Fivetran works. Connector sources write to one destination schema and one only. Swapping destination tables between connectors is not a feature as of now.
We are trying to migrate oracle tables to hive and process them.
Currently the tables in oracle has primary key,foreign key and unique key constraints.
Can we replicate the same in hive?
We are doing some analysis on how to implement it.
Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.
As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.
In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.
Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.
Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.
I have a Kafka server that works fine for sync a table between server. My DB is PostgreSQL and I 'm using JDBC sink/source connector.
Now my question is How can I read data from two table in Source and Insert data to Four different table in Sink side.
example:
Source table: Users, Roles
Sink tables: Workers, Managers, Employers, ...
In parent server all users are available in Users table and have relation with Role table. in other side I want to insert data to specific table according to it's role
For the JDBC Sink you need one topic per target table. Thus you need four topics, one per target table, populated with the joined data. This join needs to happen at some point in the pipeline. Options would be :
As part of the JDBC Source, using the query option of the connector. Build four connectors, each with the necessary query to populate each target topic with the join that is done on the postgres side in SQL.
As a streaming application e.g. in Kafka Streams or KSQL. The JDBC source would pull in the source users and roles tables and you'd perform the join as each record flowed through.
I want an overall better way to copy all the tables and their data from one production database schema and put it into a dev database schema on a different database in a different subnet using bash for unloads and loads.
It is important that the schema name on the dev database is and can be different.
Table structure for both schemas is the same, only the database name, schema name and data changes.
It is important that the solution requires minimum manual manipulation. Copying files across manually is acceptable but editing file contents to change data is not, unless this can be scripted to do automatically.
Currently we run a very long series of scripted exports for each table individually to ixf lobs, followed by a very long series of carefully placed scripted loads, being careful to load the data in order, parents before child.
Unload example:
export to CLIENT.ixf of ixf lobs to $LOCATION lobfile CLIENT_lobs modified by lobsinfile select * from CLIENT;
Load example:
load from CLIENT.ixf of ixf lobs from $LOCATION modified by lobsinfile replace into CLIENT statistics no copy no indexing mode autoselect allow no access check pending cascade deferred;
I have looked at db2move, but I cannot find how to specify database and schema name during load as it appears to only be supported in the unload/export.
db2look looks promising, but does this export the data too or just the table names?
We are trying to migrate oracle tables to hive and process them.
Currently the tables in oracle has primary key,foreign key and unique key constraints.
Can we replicate the same in hive?
We are doing some analysis on how to implement it.
Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.
As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.
In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.
Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.
Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.