Perform Incremental Sqoop on table that contains joins? - hadoop

I have some very large tables that I am trying to sqoop from a Source System Data Warehouse into HDFS, but limited bandwidth to do so. I would like to only pull the columns I need, and minimize the run-time for getting the tables stood up.
The sqoop currently pulls something like this:
SELECT
ColumnA,
ColumnB,
....
ColumnN
FROM
TABLE_A
LEFT JOIN
TABLE_B
ON
...
LEFT JOIN
TABLE_N
....
Is It possible to perform an incremental sqoop, given that the data is stored in a star-schema format, and the dimensions could update independently of the facts?
Or, is the only solution to sqoop the entire table, for the columns that I need, incrementally, and perform the joins on the HDFS side?

For incremental imports you need to use --incremental flag. Please refer to below link for more info :-
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
you need to specify —incremental to tell sqoop that you want an incremental load —check-column to specify which column is used for incremental sqooping and —last-value to say from which value you want to start sqooping the next load.
This is just half the picture. There are more ways to do this.for eg. you can use —query option and your query would be like Select * from table where column > 123. This is basically the same thing. You would need to record the last/max value for the selected column and use it for next import.

Related

Does sqoop preserves order of imported rows as in Database

I am sqooping a table from oracle database to AWS S3 & then creating a hive table over it.
After importing the data, is the order of records present in database preserved in hive table?
I want to fetch few hundred rows from database as well as hive using java JDBC then compare each row present in ResultSet. Assuming I don't have a primary key, can I compare the rows from both ResultSets as they appear(sequentially, using resultSet.next()) or does the order gets changed due to parallel import?
If order isn't preserved whether ORDER BY is a good option?
Order is not preserved during import, also order is not determined when selecting without ORDER BY or DISTRIBUTE+SORT due to parallel select processing.
You need to specify order by when selecting data, does not matter how it was inserted.
ORDER BY orders all data, will work on single reducer, DISTRIBUTE BY + SORT orders per reducer and works in distributed mode.
Also see this answer https://stackoverflow.com/a/40264715/2700344

Loading more records than actual in HIve

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?
My query would be looking like this:
insert overwrite table_a
select col1,col2,col3,... from table_b;
My table_b consists of 6405465 records.
After inserting from table_b to table_a, i found total records in table_a are 6406565.
Can any one please help here ?
If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.
In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.
See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

Few Hive Interview Questions

I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?
Questions:
Difference between "Sort By" and "Group by" in Hive. How they work?
If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
How to optimize Hive Performance?
Difference between "Internal Table" and "External Table"
What is the main difference between Hive and SQL
Please provide me few useful resources, so that I can learn in the better way. Thanks
PFB the answers:
1. Difference between "Sort By" and "Group by" in Hive. How they work?
Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
Ans. I think Reducer will work, because as per Hive documentation --
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Having to randomly pick, it has to have complete result output from Reducer.
- How to optimize Hive Performance?
Ans. These links should answer this
5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER
5 Tips for efficient Hive queries with Hive Query Language
- Difference between "Internal Table" and "External Table"
Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.
"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.
- What is the main difference between Hive and SQL
Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.
SQL is a standard RDBMS language for accessing and manipulating databases.
I am new to Hadoop and Hive as well so I can't give you a complete answer.
From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).
As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.
The Main Difference
From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.
Information obtained from Hadoop The Definitive Guide
Really good book and gives a good overview of all the technologies involved.
EDIT:
For external and internal tables, check out this response:
Difference between Hive internal tables and external tables?
Information regarding Sort By and Group By
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
(Taken from the link provided maybe this will help with the difference between Group By and Sort By)
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
Group By:
Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
Information on Optimizing Hive Performance
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Optimizing Joins
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/
General Hive Performance Tips
https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips
Some extra resources
SQL to Hive Cheat Sheet
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Hive LIMIT Documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
Best of luck in your interview!
From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set
1. Difference between "Sort By" and "Group by" in Hive. How they work?
SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.
GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement.
select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.
3. How to optimize Hive Performance?
Using Tez session
Using bucketing and Partitioning
Using Orc file format
Using vectorisation
Using CBO
4. Difference between "Internal Table" and "External Table"
Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.
External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.
5. What is the main difference between Hive and SQL
Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Reducer will not run if we use limit in select clause.
select * from table_name limit 5;

How can I know all the column in hbase table?

In hbase shell , I use describe 'table_name' , there is only column_family return. How can I get to know all the column in each columnfamily?
As #zsxwing said you need to scan all the rows since in HBase each row can have a completely different schema (that's part of the power of Hadoop - the ability to store poly-structured data). You can see the HFile file structure and see that HBase doesn't track the columns
Thus the column family(s) and its(their) setting are in fact the schema of the HBase table and that's what you get when you 'describe' it

Resources