Which distributed database I need to choose for medium data project

Which distributed database I need to choose for medium data project - spring-boot

Now we have java project with PostgreSQL database on spring boot 2 with Spring Data JPA (Hibernate).
Requirements to new architecture:
On N computers we have workplace. Each workplace use the same program with different configuration (configured client for redistributed database).
Computers count is not big - amount 10/20 PCs. Database must be scalable (a lot of data can be stored at the disk ~1/2 Tb).
Every day up to 1 million rows can be inserted into database from one workplace.
Each workplace works with redistributed database - it means, that each node must be able to read/write data, modified by each other. And make some decision based on data, modified by another workplace at runtime(Transactional).
Datastore(disk database archive) must be able to archived and copied as backup snapshot.
Project must be portable to new architecture with Spring Data JPA 2 and database backups with liquibase. Works on windows/ Linux.
The quick overview shows me that the most popular redistributed FREE database at now are:
1) Redis
2) Apache Ignite
3) Hazelcast
I need help in understanding way to architect described system.
First of all, I'm tried to use redis and ignite. Redis start easily - but it works like simple IMDG(in memory data grid). But I need to store all the data in persistent database(at disk, like ignite persistence). There is a way to use redis with existing PostgreSQL database? Postgres synchronized with all nodes and Redis use in memory cache with fresh data, produced by each workplace. Each 10 minutes data flushed at disk.
1) This is possible? How?
Also I'm tried to use Ignite - but my project works on spring boot 2. Spring data 2. And Ignite last released version is 2.6 and spring data 2 support will appears only in apache ignite 2.7!
2) I have to download 2.7 version nightly build, but how can I use it in my project? (need to install to local Maven repository?)
3) And after all, what will be the best architecture in that case? Datastore provider stores persistent data at disk, synchronized with each workspace In-memory cache and persist in-memory data to disk by timeout?
What will be the best solution and which database I should to choose?
(may be something works with existing PostgreSQL?)
Thx)

Your use case sounds like a common one with Hazelcast. You can store your data in memory (i.e. in an Hazelcast IMap), use a MapStore/MapLoader to persist changes to your database, or read from database. Persisting changes can be done in a write-through or write-behind manner based on your configuration. Also there is spring boot and spring-jpa integration available.
Also the amount of data you want to store is pretty big for 10-20 machines, so you might want to look into hazelcast High-Density Memory Store option to be able to store large amounts of data in commodity hardware without having GC problems.
Following links should give you further idea:
https://opencredo.com/spring-booting-hazelcast/
https://docs.hazelcast.org//docs/3.11/manual/html-single/index.html#loading-and-storing-persistent-data
https://hazelcast.com/products/high-density-memory-store/

Ignite is not suitable for that options, because JPA 1 supports only.
Redis isn't supports SQL queries.
Our choiсe is plain PostgreSQL master with slave replication. May be cockroachDB applies also.
Thx for help))

Related

Ehcache cache the data with support cluster disk storage (replication)

I am using Ehcache technologies in our application and it is working in single server data persist in disk, if we moving to production, we have two different server which is clustered in our application.
if first request comes to server A, it will cache respect server A OS disk level cached data and working fine, similarly, if request goes to server B, the application can not find the cached data because the the cached disk object in server A disk. how to we replicate both disk in our ehcache-config.xml?

I recommend looking into clustering support with Terracotta.
See the documentation for this:
Ehcache 3.x line
Ehcache 2.x line

Oracle replication data using Apache kafka

I would like to expose the data table from my oracle database and expose into apache kafka. is it technicaly possible?
As well i need to stream data change from my oracle table and notify it to Kafka.
do you know good documentation of this use case?
thanks

You need Kafka Connect JDBC source connector to load data from your Oracle database. There is an open source bundled connector from Confluent. It has been packaged and tested with the rest of the Confluent Platform, including the schema registry. Using this connector is as easy as writing a simple connector configuration and starting a standalone Kafka Connect process or making a REST request to a Kafka Connect cluster. Documentation for this connector can be found here
To move change data in real-time from Oracle transactional databases to Kafka you need to first use a Change Data Capture (CDC) proprietary tool which requires purchasing a commercial license such as Oracle’s Golden Gate, Attunity Replicate, Dbvisit Replicate or Striim. Then, you can leverage the Kafka Connect connectors that they all provide. They are all listed here
Debezium, an open source CDC tool from Redhat, is planning to work on a connector that is not relying on Oracle Golden Gate license. The related JIRA is here.

You can use Kafka Connect for data import/export to Kafka. Using Kafka Connect is quite simple, because there is no need to write code. You just need to configure your connector.
You would only need to write code, if no connector is available and you want to provide your own connector. There are already 50+ connectors available.
There is a connector ("Golden Gate") for Oracle from Confluent Inc: https://www.confluent.io/product/connectors/

At the surface this is technically feasible. However, understand that the question has implications on downstream applications.
So to comprehensively address the original question regarding technical feasibility, bear in mind the following:
Are ordering/commit semantics important? Particularly across tables.
Are continued table changes across instance crashes (Kafka/CDC components) important?
When the table definition changes - do you expect the application to continue working, or will resort to planned change control?
Will you want to move partial subsets of data?
What datatypes need to be supported? e.g. Nested table support etc.
Will you need to handle compressed logical changes - e.g. on update/delete operations? How will you address this on the consumer side?

You can consider also using OpenLogReplicator. This is a new open source tool which reads Oracle database redo logs and sends messages to Kafka. Since it is written in C++ it has a very low latency like around 10ms and yet a relatively high throughput ratio.
It is in an early stage of development but there is already a working version. You can try to make a POC and check yourself how it works.

Oracle to Cassandra real-time replication

We have an Oracle Database that resides tables. We would like to implement a new project as I mentioned in title; Oracle to Cassandra real-time replication.
But this new Cassandra environment would be as a reporting service. From the application (in-house), datas is inserted to Oracle production environment. Then our custom service (or what ever) will read delta and insert to Cassandra (this would be like Goldengate may be).
Briefly, does the Cassandra will answer our needs for this scenario?
In our case, we have 20 oracle DBs in different locations (these 20 dbs has similar implementation) 1 central report DB that is daily refresh from these 20 DBs. We use "outdated" snapshot technology, every night our central single report DB (REPORTDB) with fast refresh option, we gather the daily delta from these 20 dbs within oracle ss. we need a structure that reads data from 20 dbs and real-time injection to new cassandra database just like REPORDB

These days you can run spark jobs on Cassandra, thanks to Datastax so yes it can be used as a reporting tool. It's best utilized as a key value store if your number of writes are high compared to your reads.
Reading delta is not real time so you should try using Oracle's AQs. I've been doing real time replication of Oracle to Cassandra using Oracle's AQ and Apache Storm for almost 4 years now and it's running flawlessly.

I don't understand this Oracle/Cassandra architecture running alongside.
Either Oracle suits your needs then you should stick with it. Or it doesn't and you need scalability/high availability then switch to Cassandra.
Can you elaborate on the reasons that make you choose Cassandra for the reporting service ?

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)

Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables

Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Redis replication for cached session in Django

I am developing the django-backend of a ios app. I will use cached-session using redis. Once a user logs in, I will save his session in the redis-cache (backed up by mysql), I just want to know (for the long run), can I use redis replication to keep copy of the cached session incase I scale the redis server in a master-slave format in the future. Or I should always access cache value from one particular redis server?

It makes sense to keep copy with replication of redis in master/slave format since there isn't the possibility of sharding like in mongodb for redis yet (AFAIK). So you have to get your session from one particual redis server until if you dont want to control several redis-server manually.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio