How to sqoop big table from oracle db to hdfs? - oracle

One of my Oracle table contains 265 million records, I need to push that table from Oracle database to hdfs but this table doesnt have any primary key/Unique column. Hence, I cant use multiple mappers. If I use multiple mappers, I have to specify Split by column.
Whats the best way to sqoop the table.
Any leads are appreciated.

In order to use multiple mappers, you will need a --split-by parameter. The best column to choose is one that is not null in all 265m rows and evenly distributed. Primary key meets that criteria because it is sequential and in all rows.
Any column that is evenly distributed across the data set could be a good choice for a --split-by choice. The link #yammanuruarun posted includes the --boundary-query argument to help limit the work the RDBMS has to do to return those rows. I suggest using a Fibbonacci sequence for your -m 1,2,3,5,8.
Also, check out:
How to find optimal number of mappers when running Sqoop import and export?

Related

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

Can I directly consider the Hive partition columns similar to the partitions columns present in source (Teradata) tables?

Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.

what is the purpose of split-by <column> --target-dir in Sqoop

What happens internally when we write --split-by in sqoop?
Example:
sqoop import --connect jdbc:mysql://localhost/test --username root --password training123 --query 'select * from transaction where $CONDITIONS' --split-by Txnid --target-dir input/transaction
Hadoop MAP Reduce is all about divide and conquer .
In order to partition data into multiple independent slices that will be transferred in a parallel manner, Sqoop needs to find the minimum and maximum value of the column specified in the --split-by parameter
When using the split-by option, you should choose a column which contains values that are uniformly distributed.
in the query we are telling data is evenly distributed on base column 'Txnid' and use the column for making splits.
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--split-by <column-name> - Column of the table used to split work units
Reference: Sqoop User Guide
It specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism.
Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits. We can choose the column through --split-by which can result in the best splitting and thus increasing parallelism and better performance.
split-by in sqoop is used to create input splits for the mapper. It is very useful for parallelism factor as splitting imposes the job to run faster.

Import to HDFS or Hive(directly)

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source is a MS SQL database of around 1.6TB and around 25 tables
The ultimate objective is to check if the existing queries can run faster on the HDP
There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
4.As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :
sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose
Now I plan to create a Hive table but I have some questions mentioned here.
My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)
For Loading, you can try these options
1) You can do a mysql import to csv file that will be stored in your Linux file system as backup then do a distcp to HDFS.
2) As mentioned, you can do a Sqoop import and load the data to Hive table (parent_table).
For checking the performance using different formats & Partition table, you can use CTAS (Create Table As Select) queries, where you can create new tables from the base table (parent_table). In CTAS, you can mention the format like parque or avro etc and partition options is also there.
Even if you delete new tables created by CTAS, the base table will be there.
Based on my experience, Parque + partition will give a best performance, but it also depends on your data.
I see that the connection and settings are all correct. But I dint see --fetch-size in the query. By default the --fetch-size is 1000 which would take forever in your case. If the no of columns are less. I would recommend increasing the --fetch-size 10000. I have gone up to 50000 when the no of columns are less than 50. Maybe 20000 if you have 100 columns. I would recommend checking the size of data per row and then decide. If there is one column which has size greater than 1MB data in it. Then I would not recommend anything above 1000.

Few Hive Interview Questions

I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?
Questions:
Difference between "Sort By" and "Group by" in Hive. How they work?
If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
How to optimize Hive Performance?
Difference between "Internal Table" and "External Table"
What is the main difference between Hive and SQL
Please provide me few useful resources, so that I can learn in the better way. Thanks
PFB the answers:
1. Difference between "Sort By" and "Group by" in Hive. How they work?
Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
Ans. I think Reducer will work, because as per Hive documentation --
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Having to randomly pick, it has to have complete result output from Reducer.
- How to optimize Hive Performance?
Ans. These links should answer this
5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER
5 Tips for efficient Hive queries with Hive Query Language
- Difference between "Internal Table" and "External Table"
Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.
"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.
- What is the main difference between Hive and SQL
Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.
SQL is a standard RDBMS language for accessing and manipulating databases.
I am new to Hadoop and Hive as well so I can't give you a complete answer.
From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).
As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.
The Main Difference
From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.
Information obtained from Hadoop The Definitive Guide
Really good book and gives a good overview of all the technologies involved.
EDIT:
For external and internal tables, check out this response:
Difference between Hive internal tables and external tables?
Information regarding Sort By and Group By
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
(Taken from the link provided maybe this will help with the difference between Group By and Sort By)
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
Group By:
Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
Information on Optimizing Hive Performance
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Optimizing Joins
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/
General Hive Performance Tips
https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips
Some extra resources
SQL to Hive Cheat Sheet
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Hive LIMIT Documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
Best of luck in your interview!
From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set
1. Difference between "Sort By" and "Group by" in Hive. How they work?
SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.
GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement.
select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.
3. How to optimize Hive Performance?
Using Tez session
Using bucketing and Partitioning
Using Orc file format
Using vectorisation
Using CBO
4. Difference between "Internal Table" and "External Table"
Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.
External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.
5. What is the main difference between Hive and SQL
Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Reducer will not run if we use limit in select clause.
select * from table_name limit 5;

Resources