I have setup an ingest pipeline for streaming json data which gets ingested into AWS S3 data lake through pyspark and also Oracle db.
In Oracle db, primary keys are created using sequences.
What is the most efficient way I can design the data lake ingestion to have the primary keys always in-sync with the oracle db?
Both the ingestions happen parallely.
Related
We are currently using SQL Server in AWS. We are looking at ways to create a data warehouse from that data in SQL Server.
It seems like the easiest way was to use AWS DMS tool and send data to redshift having it constantly sync. Redshift is pretty expensive so looking at other ways of doing it.
I have been working with EMR. Currently I am using sqoop to take data from SQL Server and put it into Hive. I am currently use the HDFS volume to store data. I have not used S3 yet for that.
Our database has many tables with millions of rows in each.
What is the best way to update this data everyday? Does sqoop support updating data. If not what other tool is used for something like this.
Any help would be great.
My suggestion you can go for Hadoop clusters(EMR) if the processing is too complex and time taken process or better to use Redshift.
Choose the right tool. If it is for the data warehouse then go for Redshift.
And why DMS? are you going to sync in real-time? You want the daily sync. So no need to use DMS.
Better solution:
Make sure you have a primary key column and column that tell us when the row gets updated like updated_at or modified_at.
Run BCP to export the data in bulk from SQL Server to CSV files.
Upload the CSV to S3 then import to RedShift.
Use glue to fetch the incremental data (based on the primary key column and the update_at column) then export it to S3.
Import files from S3 to RedShift staging tables.
Run upsert command (update + insert) to merge the staging table with the main table.
If you feel running the glue is a bit expensive, then use SSIS or Powershell script to do steps 1 to 4. Then psql command to import files from S3 to Redshift and do steps 5 and 6.
This will handle the Insert and updates in your SQL server tables. But the deletes will not be a part of it. If you need all CRUD operations then go for the CDC method with DMS or Debezium. Then push it to S3 and RedShift.
We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.
Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here
I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.
I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse
So Let's say I have the following setup to perform some data processing for a certain use case
Hadoop Cluster for Distributed Data processing
Hive for providing infrastructure and Functions for querying data from a data warehouse
My data sitting in an RDBMS or a NoSQL database
In the above example, what exactly is the Data Warehouse? My naive brain thinks that it is the RDBMS or the NoSQL database in the above context is the Data warehouse. But by definition, isn't a Data warehouse a database used for reporting and data analysis? (Definition shamelessly stolen from Wikipedia). So can I call a traditional RDBMS/NoSQL database a Data Warehouse? Thanks.
You cannot call every relational database system a data warehouse, since one of data warehouses main feature is to aggregate data from multiple databases (with different schemas). It is usually done with a "star schema" allowing to combine multiple dimensions and multiple granularities.
Because NoSQL database systems (graph-based or map-reduce-based) are schema-less they can indeed store data from different schemas. Moreover Map-Reduce can be used to aggregate data with different granularities (e.g. aggregate daily data to compare them with monthly data).