We want to take tables from Oracle to Cassandra every day. Because tables is updated in Oracle everyday. So when i searched this , i find these options:
Extract oracle tables as a file , then write Cassandra
Using sqoop to get tables from oracle, write Map Reduce job and insert into Cassandra ?
I am not sure which way is the appropriate ? Also is there another options ?
Thank you.
Option 1
Extracting oracle tables as a file and then writing to Cassandra manually everyday can be tiresome process unless if you are scheduling a cron job. I have tried this before, but if the process fails then logging it might be an issue. If you are using this process and exporting to CSV and trying to write to cassandra then I would suggest using cassandra bulk loader (https://github.com/brianmhess/cassandra-loader)
Option 2
I haven't worked with this, so can't speak about this.
Option 3 (I use this)
I use an open source tool, Pentaho Data Integration (Spoon) (https://community.hitachivantara.com/docs/DOC-1009855-data-integration-kettle) to solve this problem. It's fairly a simple process
spoon. You can automate this process by using a carte server (spoon server) which has logging capabilities as well as automatic restarting if the process failed in between.
Let me know if you found any other solution that worked for you.
Related
In my last project we were working on a requirement where huge data (40 million rows) needs to read and for each row we need to trigger a process. As part of design we used multithreading where each thread fetch data for a given partition using Jdbc Cursor with a configurable fetch size. However when we ran the job in the application in the Prod environment, we observed that it is slow as it is taking more time in querying data from database.
As we had very tight time lines on completion of job execution, we have come up with work around where the data is exported from SQL Developer in csv file format and split in to small files. These files are provided to job. This has improved the job performance significantly and helped completing the job on time.
As mentioned above we have used manual step to export the data to the file. If this need to automate this step, executing exporting step from Java App for instance, which one of the below options (which are suggested on the web) will be faster.
sqlplus (Java making native call to sqlplus)
sqlcl parallel spool
pl/sql procedure with utl_file and dbms_parallel_execute
Below link gives some details on the above but does not have stats.
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9536328100346697722
Please note that currently I don't have access to this Oracle environment, so could not test from my side. Also I am an application developer and don't have much expertise on DB side.
So looking for advise from some one who have worked on similar use case earlier or have relevant expertise.
Thanks in advance.
Is it possible to establish a connection from Airflow to Greenplum?Keeping in mind that Greenplum is based on PostgreSQL, would it be possible to establish a connection to the Greenplum master server?
Andrea,
I think you can use Airflow to run ETLs on your analytic data within Greenplum.
The "no" answer that Jon provided was apparently in regard to using Greenplum as your backend metadata store, used internally by Airflow for keeping track of its DAGs and tasks. The code that Jon used as an example is how Airflow creates tables it uses for its backend metadata store, which has nothing to do with the contents of your Greenplum data warehouse you want to manage.
I suspect you are instead interested in Greenplum for your high-volume analytic data, not for the Airflow backend. So the answer is almost certainly yes!
You might even get by using the standard PostgreSQL hook and operator:
I say this since it appears that Greenplum can use the standard PostgreSQL Python API:
https://gpdb.docs.pivotal.io/4330/admin_guide/managing/access_db.html
If Airflow's standard PostgreSQL hook & operator do not work for you, it is easy to create your own. See for example my work with the Snowflake data warehouse; it was a simple matter to integrate the Snowflake Python connector into airflow.
https://github.com/aberdave/airflow-snowflake
Failing all that, you should be able to use ODBC or JDBC to manage date in Greenplum via Airflow. I went with Python for my work with Snowflake, since it was so easy to customize their Python connector.
No. A quick look at the Airflow github repo shows that they are using primary key constraints plus an additional column with a unique constraint which isn't supported in Greenplum.
For example:
op.create_table(
'user',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('username', sa.String(length=250), nullable=True),
sa.Column('email', sa.String(length=500), nullable=True),
sa.PrimaryKeyConstraint('id'),
sa.UniqueConstraint('username')
)
You can't have a primary key on (id) and another unique constraint on (username) in Greenplum.
Their github repo also doesn't have any mention of other MPP database platforms like Netezza and Teradata. Maybe Airflow is for small data, data science but that sounds like an oxymoron.
I'm trying to load data from Oracle table to Cassandra table by using Pentaho Data Integration 5.1(Community Edition). But I'm not getting whether connection has been established between oracle and cassandra. I'm using Cassandra 2.2.3 and Oracle 11gR2.
I've added following jars in lib folder of data-integration
--cassandra-thrift-1.0.0
--apache-cassandra-cql-1.0.0
--libthrift-0.6.jar
--guava-r08.jar
--cassandra_driver.jar
Please anyone can help me to figure out how to check whether connection has been established in Pentaho.
There are some ways to debug if a connection is established to a database, I don't know if all of them are valid for cassandra, but I'll add a especial one for that.
1) The test button
By simply clicking the test button on the connection edit screen.
2) Logs with high details may help
Another way to test is running you transformation with a high detail log:
sh pan.sh -file=my_cassandra_transformation.ktr -level=Rowlevel
3) The input preview
For cassandra, in especific, I would try just to create a simple read operation using Cassandra Input step and clicking in the 'preview' button.
4) The controlled output test
Or maybe you can try with a simplier transformation first, to make sure it's running fine. Eg.
Do we have any utility to sync data between Oracle & Neo4J database. I want to use Neo4j in readonly mode & all writes will happen to oracle DB.
I think this depends on how often you want to have the data synced. Are you looking for a periodic sync/ETL process (say hourly or daily), or are looking for live updates into Neo4j?
I'm not aware of tools designed for this, but it's not terribly difficult to script yourself.
A periodic sync is obviously easiest. You can do that directly using the Java API and connecting via JDBC to Oracle. You could also just dump the data from Oracle as a CSV and import into Neo4j. This would be done similiarly to how data is imported from PostreSQL in this article: http://neo4j.com/developer/guide-importing-data-and-etl/
There is a SO response for exporting data from Oracle using sqlplus/spool:
How do I spool to a CSV formatted file using SQLPLUS?
If you're looking for live syncing, you'd probably do this either through monitoring the transaction log or by adding triggers onto your tables, depending on the complexity of your data.
I have installed cdh4.4. And hive client is working properly and i am able to create, and display all the hive tables.
But when i use tools like talend i am getting the error 10001 table not found.
Can anybody tell where i am going wrong?
This is problem is due to the reason that the tool talend searches the default database.
Hence give database.tablename in the table field. This will solve the problem.
Regards,
Nagaraj