We need to replicate all tables from an Oracle database into HDFS (hive), detecting changes from oracle in real-time.
We understand that the Oracle GoldenGate is the right tool to do this, but looking at this docs :
https://www.oracle.com/technetwork/middleware/goldengate/documentation/gg-java-adp-bigdata-hdfs-adp-2601138.pdf,
seems that the adapter for HDFS only stores the transaction log into HDFS as a list of operation done on the database,instead we need the entire database replicated.
There is a way to achieve that? We are still evaluating how to do it,so any suggestions are welcome
GoldenGate is a "change data capture" replication solution. That means it only replicate the "changed data", i.e., the delta generated by DML to your tables.
If you want to dump the entire database, assuming you meant to replicate each table and all the data in the table, you should use initial load Extract to generate the trail file, which will turn each record as an insert operation, then use GG BigData HDFS adapter to dump those data to HDFS. However, that is only static data. Almost like an ETL tool.
Related
I'm new to hive and read about it online too. But still having doubts which are not cleared.
for hive external tables, hive keep table's metadata within HDFS, but not in its warehouse which is also in HDFS. correct ?
whether its internal or external table, in both cases data of table will be available in HDFS only but NOWHERE else. Mean to say, data can taken from anywhere but has to be loaded in HDFS, because HIVE uses hadoop's processing engine to process data. Correct ?
internal table, table's metadata and table's data both will be available in HIVE's data warehouse, and this data warehouse will be at nowhere else but in HDFS only. correct ?
in external table, table's metadata and table's data both will be NOT available in HIVE's data warehouse but in HDFS. But hive must be keeping some info with itself that where is table's metadata located and where is its data located in HDFS, correct ?
Can anyone share feedback to above understanding ?
THanks
Hive uses relational database like MySQL, MariaDB, PostgreSQL, Oracle, DerbyDB(for embedded deployment only) for storing metadata (databases, tables definitions, statistics, grants, etc). See deployment modes and database requirements. Does not matter Internal or external table, the metadata are stored in the relational database.
Yes, the data is stored in HDFS, but also Hive supports integration with external databases using JDBC storage handler. Such table looks like normal Hive table, but the data is stored in some database, your queries executed in the database, predicate push-down works, you can use hive native tables with storage handler tables in single query. Also HBase storage handler is available, Kafka storage handler, etc, you can write your own storage handler.
Depending on your Hive version/vendor It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS. Though Cloudera prefers to have managed tables in dedicated HDFS location for them, see https://stackoverflow.com/a/67073849/2700344 and does not allow to specify location for managed tables outside the warehouse root by default. Read abot the difference between managed and external tables here.
Everything seems correct except last one. When you create external table table metadata will be stored in the Hive otherwise you can not query through hive. HDFS itself keeps control of your data when you create external table. While when you create internal table Hive will be responsible. Dropping internal table drops your data and metadata but dropping external table only drops metadata from Hive. But your data will be remain inside of your file system. Thats why we are changing table types a lot as a workaround when some of our external connection is not compatible with our hive version.
I am finding a resolution to record DML and DDL changes made to specified Oracle schemas or tables dynamically, which meaning that schemas and tables monitored can be changed in application run time.
In a word, I am going to achieve an Oracle database probe, not for synchronizing databases.
Updated
For example, I set a monitor to a table test for database db. I want to retrieve all changes made to test, such as drop/add/modify a column or insert/update/delete records and so on, I need to analyze and send all changes to a blockchain such as table test added a column field1,that's why I want to get all executed SQL for the monitored tables.
I have read Oracle docs about data guard and streams.
Data guard doc says:
SQL Apply (logical standby databases only)
Reconstitutes SQL statements from the redo received from the primary database and executes the SQL statements against the logical standby database.
Logical standby databases can be opened in read/write mode, but the target tables being maintained by the logical standby database are opened in read-only mode for reporting purposes (providing the database guard was set appropriately). SQL Apply enables you to use the logical standby database for reporting activities, even while SQL statements are being applied.
Stream doc says:
Oracle Streams provides two ways to capture database changes implicitly: capture processes and synchronous captures. A capture process can capture DML changes made to tables, schemas, or an entire database, and DDL changes. A synchronous capture can capture DML changes made to tables. Rules determine which changes are captured by a capture process or synchronous capture.
And before this, I have already tried to get SQL change by analyzing redo log with oracle LogMinner and finally did it.
The Oracle stream seems to be the most appropriate way of achieving my purpose, but it implements steps are too complicated and manually. And in fact, there is an open-source for MySQL published by Alibaba which named canal, canal pretends itself as a slave so that MySQL will dump binlog and push it to canal service, and then canal reconstitutes the original SQL from binlog.
I think Oracle standby database is like MySQL slave so that the probe can be implemented in a similar way. So I want to use the data guard way, but I don't want to analyze the redo log myself since it needs root privilege to shut down the database and enable some functions, however, in production I only have a read-only user. I want to use logical standby database, but the problem is that I didn't see how to get the Reconstitutes SQL statements described above.
So, are there any pros can make some suggestions?
Anyway thanks a lot.
I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.
I recently did an integration between Hive and HBase. I created a hive table with HBase serde and when i insert the records into the hive table it gets loaded into the HBase table. I am trying to understand what if the insert into HiveHBase table fails in between ? (HBase service fails / network issue). I assume the records which have already loaded into the HBase will be there and when i do a rerun of the operation i will have two copies of data with different timestamp (Assuming out of 20K records 10k was inserted and the failure occured).
What is the best way to insert records into HBase ?
Can Hive provide me the security check to see if the data is already there ?
Is mapreduce the best shot for scenarios like these ? I will write a mapreduce program that reads data from hive and checks record by record in hbase before the insertion . This makes sure there are no duplicate writes
Any help on this would be greatly appreciated.
Yes, you will have 2 versions of data when you rerun the load operation. But that's ok since the 2nd version will get cleaned up on the next compaction. As long as your inserts are idempotent (which they most likely are), you won't have a problem.
At Lithium+Klout, we use a custom built HBaseSerDe which writes HFiles, instead of using Put's to insert the data. So we generate the HFiles and use the bulk load tool to load all of the data after the job has completed. That's another way you can integrate Hive and HBase.
I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.