We have an Oracle Database that resides tables. We would like to implement a new project as I mentioned in title; Oracle to Cassandra real-time replication.
But this new Cassandra environment would be as a reporting service. From the application (in-house), datas is inserted to Oracle production environment. Then our custom service (or what ever) will read delta and insert to Cassandra (this would be like Goldengate may be).
Briefly, does the Cassandra will answer our needs for this scenario?
In our case, we have 20 oracle DBs in different locations (these 20 dbs has similar implementation) 1 central report DB that is daily refresh from these 20 DBs. We use "outdated" snapshot technology, every night our central single report DB (REPORTDB) with fast refresh option, we gather the daily delta from these 20 dbs within oracle ss. we need a structure that reads data from 20 dbs and real-time injection to new cassandra database just like REPORDB
These days you can run spark jobs on Cassandra, thanks to Datastax so yes it can be used as a reporting tool. It's best utilized as a key value store if your number of writes are high compared to your reads.
Reading delta is not real time so you should try using Oracle's AQs. I've been doing real time replication of Oracle to Cassandra using Oracle's AQ and Apache Storm for almost 4 years now and it's running flawlessly.
I don't understand this Oracle/Cassandra architecture running alongside.
Either Oracle suits your needs then you should stick with it. Or it doesn't and you need scalability/high availability then switch to Cassandra.
Can you elaborate on the reasons that make you choose Cassandra for the reporting service ?
Related
Now we have java project with PostgreSQL database on spring boot 2 with Spring Data JPA (Hibernate).
Requirements to new architecture:
On N computers we have workplace. Each workplace use the same program with different configuration (configured client for redistributed database).
Computers count is not big - amount 10/20 PCs. Database must be scalable (a lot of data can be stored at the disk ~1/2 Tb).
Every day up to 1 million rows can be inserted into database from one workplace.
Each workplace works with redistributed database - it means, that each node must be able to read/write data, modified by each other. And make some decision based on data, modified by another workplace at runtime(Transactional).
Datastore(disk database archive) must be able to archived and copied as backup snapshot.
Project must be portable to new architecture with Spring Data JPA 2 and database backups with liquibase. Works on windows/ Linux.
The quick overview shows me that the most popular redistributed FREE database at now are:
1) Redis
2) Apache Ignite
3) Hazelcast
I need help in understanding way to architect described system.
First of all, I'm tried to use redis and ignite. Redis start easily - but it works like simple IMDG(in memory data grid). But I need to store all the data in persistent database(at disk, like ignite persistence). There is a way to use redis with existing PostgreSQL database? Postgres synchronized with all nodes and Redis use in memory cache with fresh data, produced by each workplace. Each 10 minutes data flushed at disk.
1) This is possible? How?
Also I'm tried to use Ignite - but my project works on spring boot 2. Spring data 2. And Ignite last released version is 2.6 and spring data 2 support will appears only in apache ignite 2.7!
2) I have to download 2.7 version nightly build, but how can I use it in my project? (need to install to local Maven repository?)
3) And after all, what will be the best architecture in that case? Datastore provider stores persistent data at disk, synchronized with each workspace In-memory cache and persist in-memory data to disk by timeout?
What will be the best solution and which database I should to choose?
(may be something works with existing PostgreSQL?)
Thx)
Your use case sounds like a common one with Hazelcast. You can store your data in memory (i.e. in an Hazelcast IMap), use a MapStore/MapLoader to persist changes to your database, or read from database. Persisting changes can be done in a write-through or write-behind manner based on your configuration. Also there is spring boot and spring-jpa integration available.
Also the amount of data you want to store is pretty big for 10-20 machines, so you might want to look into hazelcast High-Density Memory Store option to be able to store large amounts of data in commodity hardware without having GC problems.
Following links should give you further idea:
https://opencredo.com/spring-booting-hazelcast/
https://docs.hazelcast.org//docs/3.11/manual/html-single/index.html#loading-and-storing-persistent-data
https://hazelcast.com/products/high-density-memory-store/
Ignite is not suitable for that options, because JPA 1 supports only.
Redis isn't supports SQL queries.
Our choiсe is plain PostgreSQL master with slave replication. May be cockroachDB applies also.
Thx for help))
We are switching from Oracle to Hadoop due to slow performance with Oracle DB, built a universe with Cloudera Simba ODBC connections scheduled a report expecting a faster performance compare to Oracle DB but the report took more than 2 hours, took the same query and ran in HUE SQL editor the result got back in less than 2 mins
We tested in DEV, TEST, & PROD, & also tried switching to JDBC connection no difference, we feel its the network's latency issue and opened a Case with SAP
Point to note here that our Hadoop servers and BO servers are in two different locations NCAL and SCAL, we have 3.5 million records to pull
I am looking for some tested advice here on this issue if anyone has already faced such issue
I am pretty new to Oracle Golden Gate, wanted to understand if it possible to create a bidirectional sync between Oracle 12x and Cassandra(DSE) using Oracle Golden Gate? Searched several places in internet but most examples are replicating data between Oracle databases. I started wondering if it is even possible to do so. Can anyone help me with any documentation?
There is a separate module called Oracle GoldenGate for BigData. It supports many NoSQL replication targets.
One of the supported BigData databases is also Apache Cassandra.
There is a separate manual explaining how to use it.
There is no separate module that allows you to connect Apache Cassandra as the source of your replication. If you need such replication you need to provide some intermediate step. The source of replication for Oracle GoldenGate can only be a database (Oracle, TimesTen, DB2, Informix, MySQL, MS SQL Server, NonStop SQL/MX, SAP/Sybase ASE, Teradata) or a JMS queue.
I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note:
My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards
Sanjeeb
Sqoop
Sqoop is capable of performing full and incremental loading from Oracle/Teradata.
Sqoop does parallel copy of data from source systems.
Sqoop scripts can be custom genrated and scheduled by Oozie.
Open source solution for any size cluster. No license cost.
Informatica
Best Interface in ETL Industry to manage mappings.
Does not provide parallel copy options. Provides Hive mode for parallel processing. Basically converts transformation into Hive queries for execution. Also supports push downs to generate MR code.
Licensing cost per node. If you plan 500 Hadoop nodes for future data storage you need to pay 10 times as compared with 50 node cluster when you scale cluster.
Informatica BDE is relatively new product in market. INFA Developer will be usefull for working on Big data. There are challenges in supporting all latest Hadoop platform features on Informatica, also traditional RDBMS features like Sequence generation, Stateful mapping,Sessions, Lookup Transformation in Informatica BDE.
Informatica MDM does not support Hadoop.
If price is criteria for decision making, go for Sqoop. If you want to leverage flexibility of switching Hadoop plaftorm tools, use Sqoop(Sqoop project is also thinking of moving over Spark).
If you are tied to Informatica for some reason, go for Informatica. But most Informatica developers want to move to Hadoop technologies.
Although this was asked an year ago, sharing new features in Informatica
Informatica BDM version 10.1 supports Sqoop connectivity i.e. you can use Sqoop to read the data from RDBMS and load it into Hadoop/Hive
Also, there are many new features in BDM version 10.2, especially the parameterization support in the developer tool and dynamic mappings.
Tool versus handcoding was always there.
Informatica tool gives enterprise level solution which is easier to maintain.
BDM 10.1.1 supports sqoop with spark engine. Spark 2.0.1 is supported in this version so performance its pretty good.
BDM 10.2 is just released with new features like stateful variable support which was missing in earlier versions.
SQOOP must be used for the Data exchange. You have lot of options with which you can have an optimal performance. Also if you are trying to exchange the data between RDBMS(Teradata / Oracle) <-> Informatica <-> Hadoop cluster then the data would first need to be brought to the Informatica Server which may involve additional I/O.
If the data processing must be done within hive Informatica BDE must be used.
I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.