Move data from Oracle to Cassandra and/or MongoDB - oracle

At work we are thinking to move from Oracle to a NoSQL database, so I have to make some test on Cassandra and MongoDB. I have to move a lot of tables to the NoSQL database the idea is to have the data synchronized between this two platforms.
So I create a simple procedure that make selects into the Oracle DB and insert into mongo. Some of my colleagues point that maybe there is an easier(and more professional) way to do it.
Anybody had this problem before? how do you solve it?

If your goal is to copy your existing structure from Oracle to a NoSQL database then you should probably reconsider your move in the first place. By doing that you are losing any of the benefits one sees from going to a non-relational data store.
A good first step would be to take a long look at your existing structure and determine how it can be modified to affect positive impact on your application. Additionally, consider a hybrid system at the same time. Cassandra is great for a lot of things, but if you need a relational system and already are using a lot of Oracle functionality, it likely makes sense for most of your database to stay in Oracle, while moving the pieces that require frequent writes and would benefit from a different structure to Mongo or Cassandra.
Once you've made the decisions about your structure, I would suggest writing scripts/programs/add a module to your existing app, to write the data in the new format to the new data store. That will give you the most fine-grained control over every step in the process, which in a large system-wide architectural change, I would want to have.

You can also consider using components of Hadoop ecosystem to perform this kind of (ETL) task .For that you need to model your Cassandra DB as per the requirements.
Steps could be to migrate your oracle table data to HDFS (using SQOOP preferably) and then writing Map-Reduce job to transform this data and insert into Cassandra Data Model .

Related

oracle schema sharing , is it possible?

Trying to understand if there is any such concept like this in Oracle Database.
Let's say I have two Databases, Database_A & Database_B
Database_A has schema_A, is there a way I can attach this schema to Database_B?
What I mean by this is if there is a job populating a TABLE_A in schema_A, I can see that read-only view in Database_B. We are trying to split a big Oracle database into two smaller databases and have a vast PL/SQL code, and trying to minimize the refactoring here.
Sharding might be what you're looking for. The schemas and tables will still logically exist on all databases, but you can arrange the data to be physically stored in specific databases. There might be a way to setup shardspaces, tablespaces, and user default tablespaces in a way where each schema's data is automatically stored in a specific database.
But I haven't actually used sharding. From what I've read, it seems to be designed for massive distributed OLTP systems, and it is likely complicated to administer. I'd guess this feature isn't worth the hassle unless you have petabytes of data.

storing data in secondary database

Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.

How to retrieve the data from database without using apache jackrabbit datastore?

I have integrated the jack rabbit with Oracle database and I am storing the
Data using Jackrabbit, if I don't want to retrieve the data using the
Jackrabbit, in what way I can get the data. In database data is storing in
blob type.
The way Jackrabbit stores the data in the DB is an implementation detail, and it does not magically map this into a "nice" DB schema if that's what you mean. (The hierarchical nature and all the JCR features make this impossible). It's a bit like having a Unix file system and then asking how can I read the low level inodes etc. from the file system implementation - you really should not.
Last but not least note that while it is running nothing else (except for a Jackrabbit cluster setup) must write to the DB (the tables used by Jackrabbit) as this will easily lead to data corruption.
As #TedTrippin already mentioned above, an ORM framework would make things much easier. But if you really want to do it manually in Oracle, the approach would be:
Study the code of the OCM http://jackrabbit.apache.org/jcr/object-content-mapping.html, then get the content according to the logic of associations and relations from Oracle, probably not in one but multiple queries per document; eventually with user-defined functions, which are supported in Oracle and might make things easier.
Would be interesting to know the background of your questions. You tagged it with "Spring" and "CMS". I don't see any reason why you would want to access the data directly from Oracle, it's tedious. In case you want to provide an API for the content to an external system, or in case you have lost a CMS that was once in front of and just using the Jackrabbit repo as a content store, you could still use such ORM / OCM framework standalone to make it easier to access the data.

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Big Data transfer between different systems

We have different set of data into different systems like Hadoop, Cassandra, MongoDB. But our analytic team want to get the stitched data from different systems. For example customer information with demographic will be in one system, their transactions will be in another system. Analytic should able to query to get data like from US users what was the volume of transaction. We need to develop an application to provide ease way to interact with different system. What is the best way to do?
Another requirement:
If we want to provide their custom workspace in a system like MongoDB, they can easily place with it. What is the best strategy to pull data from one system to another system on demand?
Any pointer or common architecture used to solve this kind of problem will be really helpful.
I see two questions here:
How can I consolidate data from different systems into one system?
How can I create some data in Mongo for people to experiment with?
Here we go ... =)
I would pick one system and target that for consolidation. In other words, between Hadoop, Cassandra and MongoDB, which one does your team have the most experience with? Which one do you find easiest to query with? Which one do you have set up to scale well?
Each one has pros and cons to scale, storage and queryability.
I would pick one and then pump all data to that system. At a recent job, that ended up being MongoDB. It was easy to move data to Mongo and it had by far the best query language. It also had a great community and setting up nodes was easier than Hadoop, etc.
Once you have solved (1), you can trim your data set and create a scaled down sandbox for people to run ad-hoc queries against. That would be my approach. You don't want to support the entire data set, because it would likely be too expensive and complicated.
If you were doing this in a relational database, I would say just run a
select top 1000 * from [table]
query on each table and use that data for people to play with.

Resources