Airflow <--> Greenplum - greenplum

Is it possible to establish a connection from Airflow to Greenplum?Keeping in mind that Greenplum is based on PostgreSQL, would it be possible to establish a connection to the Greenplum master server?

Andrea,
I think you can use Airflow to run ETLs on your analytic data within Greenplum.
The "no" answer that Jon provided was apparently in regard to using Greenplum as your backend metadata store, used internally by Airflow for keeping track of its DAGs and tasks. The code that Jon used as an example is how Airflow creates tables it uses for its backend metadata store, which has nothing to do with the contents of your Greenplum data warehouse you want to manage.
I suspect you are instead interested in Greenplum for your high-volume analytic data, not for the Airflow backend. So the answer is almost certainly yes!
You might even get by using the standard PostgreSQL hook and operator:
I say this since it appears that Greenplum can use the standard PostgreSQL Python API:
https://gpdb.docs.pivotal.io/4330/admin_guide/managing/access_db.html
If Airflow's standard PostgreSQL hook & operator do not work for you, it is easy to create your own. See for example my work with the Snowflake data warehouse; it was a simple matter to integrate the Snowflake Python connector into airflow.
https://github.com/aberdave/airflow-snowflake
Failing all that, you should be able to use ODBC or JDBC to manage date in Greenplum via Airflow. I went with Python for my work with Snowflake, since it was so easy to customize their Python connector.

No. A quick look at the Airflow github repo shows that they are using primary key constraints plus an additional column with a unique constraint which isn't supported in Greenplum.
For example:
op.create_table(
'user',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('username', sa.String(length=250), nullable=True),
sa.Column('email', sa.String(length=500), nullable=True),
sa.PrimaryKeyConstraint('id'),
sa.UniqueConstraint('username')
)
You can't have a primary key on (id) and another unique constraint on (username) in Greenplum.
Their github repo also doesn't have any mention of other MPP database platforms like Netezza and Teradata. Maybe Airflow is for small data, data science but that sounds like an oxymoron.

Related

Data model migration on Kafka connectors

I'm using Debezium and JDBCSinkConnector to copy data from multiple databases into another DB. I would like to have the ability to upgrade some of the models from time to time and let's say not all DBs together, but only to upgrade the sink DB and some time later also the source DBs. And let's say I have a version column in the tables or environment variable to reconfigure the connectors. I've considered creating a combination of SMTs for upgrading from each version and run them depending on the source version, using some predicates. But I'm not sure this is a good practice or going to work at all. Haven't been able to find another solution for this.
What is the best way to implement the required migrations (such as added/removed columns or values manipulations, etc) "on the fly" via Kafka Connectors?

Take Data From Oracle to Cassandra in every day

We want to take tables from Oracle to Cassandra every day. Because tables is updated in Oracle everyday. So when i searched this , i find these options:
Extract oracle tables as a file , then write Cassandra
Using sqoop to get tables from oracle, write Map Reduce job and insert into Cassandra ?
I am not sure which way is the appropriate ? Also is there another options ?
Thank you.
Option 1
Extracting oracle tables as a file and then writing to Cassandra manually everyday can be tiresome process unless if you are scheduling a cron job. I have tried this before, but if the process fails then logging it might be an issue. If you are using this process and exporting to CSV and trying to write to cassandra then I would suggest using cassandra bulk loader (https://github.com/brianmhess/cassandra-loader)
Option 2
I haven't worked with this, so can't speak about this.
Option 3 (I use this)
I use an open source tool, Pentaho Data Integration (Spoon) (https://community.hitachivantara.com/docs/DOC-1009855-data-integration-kettle) to solve this problem. It's fairly a simple process
spoon. You can automate this process by using a carte server (spoon server) which has logging capabilities as well as automatic restarting if the process failed in between.
Let me know if you found any other solution that worked for you.

How to load oracle table data into kafka topic?

How to load oracle table data into kafka topic? i did some research and got to know,i should use CDC tool,but all CDC tools are paid version ,can anyone suggest me how to achieve this ?
You'll find this article useful: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
It details all of your options and currently-available tools. In short, you can do bulk (or query-based CDC) with the Kafka Connect JDBC Connector, or you can use a log-based CDC approach with one of several CDC tools that support Oracle as a source, including Attunity, GoldenGate, SQ Data, and IBM's IIDR.
You'll generally find that if you've paid for your database (e.g. Oracle, DB2, etc) you're going to have to pay for a log-based CDC tool. Open source CDC tools are available for open source databases. For example, Debezium is open source and works great with MongoDB, MySQL, and PostgreSQL.
You might be interested in the Debezium project, which provides open-source CDC connectors for a variety of databases. Amongst others, we provide one for Oracle DB. Note that this connector currently is based on the XStream API of Oracle, which itself requires a separate license, but we hope to add a fully free alternative soon.
Disclaimer: I'm the lead of Debezium
Please refer to kafka jdbc source connector . Below is link
https://docs.confluent.io/current/connect/connect-jdbc/docs/index.html
You don't need a Change Data Capture (CDC) tool in order to load data from Oracle Table into a Kafka topic.
You can use Kafka Confluent's JDBC Source Connector in order to load the data.
However, if you need to capture deletes and updates you must use a CDC tool for which you need to pay a licence. Confluent has certified the following CDC tools (Source connectors):
Attunity
Dbvisit
Striim
Oracle GoldenGate
As others have mentioned, CDC requires paid products. If you'd just like to try something out, Striim is available for free for the first 30 days.
https://www.striim.com/instant-download/
The 'free' options which include JDBC..but you would be introducing a significant load on your database if you actually want to use triggers to capture changes.
disclaimer: i work at striim
There's a custom Kafka source connector for Oracle database which is based on logminer here:
https://github.com/erdemcer/kafka-connect-oracle
This project is in development.
You might be interested in OpenLogReplicator. It is an open source GPL-licensed tool written completely in C++. It reads binary format of Oracle Redo logs and sends them to Kafka.
It is very fast - you can achieve low latency without much effort, since it operates fully in memory. It supports all Oracle database versions since 11.2.0.1 and requires no additional licensing.
It can work on the database host, but you can also configure it to read the redo logs using sshfs from another host - with minimal load of the database.
disclaimer: I am the author of this solution

Oracle -> Postgresql Log-Based replication

(I do not code on my own, to make things clear)
I am looking for a solution that would allow to replicate data between a, master, Oracle 11g DB and a new PostgreSQL DB. Those are 2 different applications but the need to exchange data in real-time. There are some trigger-based ways but there is quite a big concern that this can affect the master DB efficiency - which we can't do.
I have also come across some log-based solutions, like HVR, but the cost is way too high for 500MB of data to be replicated.
Maybe anyone of You had a similar issue and found a way to deal with it?
Any kind of tips and help will be really appreciated as I am quite short on time
Oracle Archive Logs have different format than Postgres Write Ahead Logs. Despite the general similarity in concept of Oracle Streams, SQL Log Shipping, Postgres Streaming Replication etc, transaction logs <> redo logs <> xlogs and you can't use one provider logs to roll on the other provider engine.
Moreover you can't roll logs over same DB provider different version because of difference in binary format.
Something alike logical replication you can get with Postgres Logical Decoding, Oracle GoldenGate, Heterogeneous Database Replication, AWS DMS. But none of above gives you "Log-Based replication" between different db vendors
You can use a product that specializes in change data capture based data integration. Striim, GoldenGate, Attunity allow you to do CDC from Oracle. Striim also allows you to do CDC from PostgreSQL and write to Oracle as well.
https://striim.com
https://attunity.com

Is there a way for the Oracle Data Integrator to extract data from MongoDB

I'm trying to move snapshots of data from our MongoDB into our Oracle BI data store.
From the BI team I've been asked to make the data available for ODI, but I haven't been able to find an example of that being done.
Is it possible and what do I need to implement it?
If there is a more generic way of getting MongoDB data into Oracle then I'm happy to propose that as well.
Versions
MongoDB: 2.0.1
ODI: 11.1.1.5
Oracle: 11.2g
Edit:
This is something that will be queried once a day, maybe twice but at this stage the BI report granularity is daily
In ODI, under the Topology tab and Physical Architecture sub-tab, you can see all technologies that are supported out of the box. MongoDB is not one of them. There are also no Knowledge Modules available for importing/exporting from/to MongoDB.
ODI supports implementing your own technologies and your own Knowledge Modules.
This manual will get you started with developing your won Knowledge module, and in one of the other manuals i'm sure you can find an explanation on how to implement your own technologies. (Ctrl-F for "Data integrator")
If you're lucky, you might find someone else who has already implemented it. Your best places to look would be The Oracle Technology Network Forum, or a forum related to MongoDB.
Instead of creating a direct link, you could also take an easier workaround. Export the data from the MongoDB to a format that ODI supports, and MongoDB can extract to. CSV or XML maybe? Then load the data trough ODI into the oracle database. I think... that will be the best option, unless you have to do this frequently...
Look at the blog post below for an option;
https://blogs.oracle.com/dataintegration/entry/odi_mongodb_and_a_java
Cheers
David

Resources