Sqoop unload from oracle table while the table is getting loaded - oracle

I am able to unload data from oracle data base using sqoop. But some times my job kicks off while the upstream load is going on. I don't have dependency set up as the upstream jobs are out side my environment.
Sqoop query:
select * from PFSIEBEL.'${TBL_NM}' where trunc(last_upd) >=(to_char(to_date('${ODATE}','YYYYMMDD')))
This is the query I am using the pull delta records from the table.
I would like to know what sqoop does while pulling data from rdbms ?
What happens when transaction is going on the records in RDBMS ?

Sqoop doesnt do dirty read means doesnt import data which is yet not committed and are being modified.
--relaxed-isolation option can be used to instruct Sqoop to use read uncommitted isolation level.
but it doesnt support for oracle.

Related

Questions about Hive

I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.

How to transfer data & metadata from Hive to RDBMS

There are more than 300 tables in my hive environment.
I want to export all the tables from Hive to Oracle/MySql including metadata.
My Oracle database doesn't have any tables corresponding to these Hive tables.
Sqoop import from Oracle to Hive creates tables in Hive if the table doesn't exists.But Sqoop export from Hive to Oracle doesn't create table if not exists and fails with an exception.
Is there any option in Sqoop to export metadata also? or
Is there any other Hadoop tool through which I can achieve this?
Thanks in advance
The feature you're asking for isn't in Spark. I don't know of a current hadoop tool which can do what you're asking either unfortunately. A potential workaround is using the "show create table mytable" statement in Hive. It will return the create table statements. You can parse this manually or pragmatically via awk and get the create tables in a file, then run this file against your oracle db. From there, you can use sqoop to populate the tables.
It won't be fun.
Sqoop can't copy metadata or create table in RDBMS on the basis of Hive table.
Table must be there in RDBMS to perform sqoop export.
Why is it so?
Mapping from RDBMS to Hive is easy because hive have only few datatypes(10-15). Mapping from multiple RDBMS datatypes to Hive datatype is easily achievable. But vice versa is not that easy. Typical RDBMS has 100s of datatypes (that too different in different RDBMS).
Also sqoop export is newly added feature. This feature may come in future.

Can a single sqoop job be used for multiple tables and be running at the same time

I just started with Sqoop Hands-on. I have a question, lets say I have 300 tables in a database and I want to perform an incremental load on those tables. I understand I can do incremental imports with either append mode or last modified.
But do I have to create 300 jobs, if the only thing in job which varies is Table name , CDC column and the last value/updated value?
Has anyone tried using the same job and passing this above things as parameter which can be read from a text file in a loop and execute the same job for all the tables in parallel.
What is the industry standard and recommendations ?
Also, is there a way to truncate and re-load the hadoop tables which is very small instead of performing CDC and merging the tables later?
There is import-all-tables "Import tables from a database to HDFS"
However it will not provide way to change CDC column for each table.
Also see sqoop import multiple tables
There is no truncation but same can be achieved through following.
--delete-target-dir "Delete the import target directory if it exists"

Hive HBase Integration behavior in the event of a failure

I recently did an integration between Hive and HBase. I created a hive table with HBase serde and when i insert the records into the hive table it gets loaded into the HBase table. I am trying to understand what if the insert into HiveHBase table fails in between ? (HBase service fails / network issue). I assume the records which have already loaded into the HBase will be there and when i do a rerun of the operation i will have two copies of data with different timestamp (Assuming out of 20K records 10k was inserted and the failure occured).
What is the best way to insert records into HBase ?
Can Hive provide me the security check to see if the data is already there ?
Is mapreduce the best shot for scenarios like these ? I will write a mapreduce program that reads data from hive and checks record by record in hbase before the insertion . This makes sure there are no duplicate writes
Any help on this would be greatly appreciated.
Yes, you will have 2 versions of data when you rerun the load operation. But that's ok since the 2nd version will get cleaned up on the next compaction. As long as your inserts are idempotent (which they most likely are), you won't have a problem.
At Lithium+Klout, we use a custom built HBaseSerDe which writes HFiles, instead of using Put's to insert the data. So we generate the HFiles and use the bulk load tool to load all of the data after the job has completed. That's another way you can integrate Hive and HBase.

Impala can't access all hive table

I try to query hbase data through hive (I'm using cloudera). I did a fiew hive external table pointing to hbase but the thing is Cloudera's Impala doesn't have an access to all those tables. All hive external tables appear in the metastore manager but when I do a simple "show tables" in Impala, I see that 3 tables are missing.
Would it be a privileges problem ? I see that in the metastore manager that the 3 tables missing are readable by everybody so...
Run the query 'invalidate metadata' in Impala and your tables will show-up.
Though the INVALIDATE METADATA command in impala works it is documented to be expensive, in recent versions it is now possible to invalidate the metadata of just 1 table, which will have less impact:
INVALIDATE METADATA mynewtable
Alternately, if you use HUE, there is also a less expensive option available. Which may be convenient if you have added multiple new tables:
Beneath is the ? online help explanation:
Missing some tables? In order to update the list of tables/metadata seen by Impala, execute one of these queries:
"invalidate metadata" invalidates the entire catalog metadata. All table metadata will be reloaded on the next access.
"invalidate metadata <table>" invalidates the metadata, load on the next access
"refresh <table>" refreshes the metadata immediately. It is a faster, incremental refresh.

Resources