How can we create Azure's Data Factory pipeline with Neo4jDB (with Graph) as data sink ? (data source being OracleDB ) - oracle

In the Azure environment, I have to copy data from OracleDb by using copy activity and a Neo4j Graph sink location.Using an Azure Data Factory, I need to insert/update data from the Oracledb to the GraphDb.
My thinking is that I need to first transform the data to json and from there insert it into the GraphDb.how to do this ?
In the Azure environment, I have to copy data from OracleDb by using copy activity and a Neo4j Graph sink location.Using an Azure Data Factory, I need to insert/update data from the Oracledb to the GraphDb.
My thinking is that I need to first transform the data to json and from there insert it into the GraphDb.how to do this ?

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

Managing primary keys between data lake and oracle db

I have setup an ingest pipeline for streaming json data which gets ingested into AWS S3 data lake through pyspark and also Oracle db.
In Oracle db, primary keys are created using sequences.
What is the most efficient way I can design the data lake ingestion to have the primary keys always in-sync with the oracle db?
Both the ingestions happen parallely.

How to ingest CDC events produced by Oracle CDC Source Connector into Snowflake

Our current pipeline is following a structure similar to the one outlined here except we are pulling events from Oracle and pushing them to snowflake. The flow goes something like this:
Confluent Oracle CDC Source Connector mining the Oracle transaction log
Pushing these change events to a Kafka topic
Snowflake Sink Connector reading off the Kafka topic and pulling raw messages into Snowflake table.
In the end I have a table of record_metadata, and record_content fields that contain the raw kafka messages.
I'm having to build a set of procedures that handle the merge/upsert logic operating on a stream on top of the raw table. The tables I'm trying to replicate in snowflake are very wide and there are around 100 of them, so writing the SQL merge statements by hand is unfeasible.
Is there a better way to ingest the Kafka topic containing all of the CDC events generated from the Oracle connector straight into Snowflake, handling auto-creating nonexistent tables, auto-updating/deleting/etc as events come across the stream?

Scalable elasticsearch module with spring data elasticsearch possible?

I am working on designing a scalable service(springboot) using which data will be indexed to elastic search.
Use case:
My application uses 6 databases(mySql) having same schema. Each database caters to specific region. I have a micro service that connects to all these dbs and indexes data from specific tables to elasicsearch server(v6.8.8) in similar fashion having 6 elasticsearch indexes one for each db.
Quartz jobs are employed for this purpose and RestHighLevelClient. Also there are delta jobs running each second to look for changes using audit and indexes.
Current problem:
Current design is not scalable - one service doing all the work(data loading, mapping, upsert in bulk). Because indexing is done through quarts jobs, scaling services(running multiple instances) will run the same job multiple times.
No failover - Looking for a distributed elasticsearch nodes and indexing data to both nodes. How to do this efficiently.
I am considering spring data elasticsearch to index data sametime when it is going to be persisted to db.
Does it offer all features ? I use :
Elasticsearch right from installing template to creating/deleting indexes, aliases.
Blue/green deployment - index to non-active nodes and change the aliases.
bulk upsert, querying, aggregations..etc
Any other solutions are welcome. Thanks for your time.
Your one of the use case is to move data from DB (Mysql) to ES in a scalable manner. It is basically a CDC (Change data capture) pipeline.
You can use kafka-connect framework for the same.
The flow should be like:
Read Mysql Transaction logs => Publish the data to Kafka (This can be accomplished using Debezium Source Connector)
Consume data from Kafka => Push it to Elastic Search (This can be accomplished using ES-SYNC Connector)
Why to use the framework ?
Using connect framework data can be read directly from Mysql Transaction logs without writing code.
Connect framework is a distributed & Scalable system
It will reduce the load on your database as you now don't need to query your database for detecting any changes
Easy to set-up

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Resources