I am sqooping a table from oracle database to AWS S3 & then creating a hive table over it.
After importing the data, is the order of records present in database preserved in hive table?
I want to fetch few hundred rows from database as well as hive using java JDBC then compare each row present in ResultSet. Assuming I don't have a primary key, can I compare the rows from both ResultSets as they appear(sequentially, using resultSet.next()) or does the order gets changed due to parallel import?
If order isn't preserved whether ORDER BY is a good option?
Order is not preserved during import, also order is not determined when selecting without ORDER BY or DISTRIBUTE+SORT due to parallel select processing.
You need to specify order by when selecting data, does not matter how it was inserted.
ORDER BY orders all data, will work on single reducer, DISTRIBUTE BY + SORT orders per reducer and works in distributed mode.
Also see this answer https://stackoverflow.com/a/40264715/2700344
Related
We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table
I have some very large tables that I am trying to sqoop from a Source System Data Warehouse into HDFS, but limited bandwidth to do so. I would like to only pull the columns I need, and minimize the run-time for getting the tables stood up.
The sqoop currently pulls something like this:
SELECT
ColumnA,
ColumnB,
....
ColumnN
FROM
TABLE_A
LEFT JOIN
TABLE_B
ON
...
LEFT JOIN
TABLE_N
....
Is It possible to perform an incremental sqoop, given that the data is stored in a star-schema format, and the dimensions could update independently of the facts?
Or, is the only solution to sqoop the entire table, for the columns that I need, incrementally, and perform the joins on the HDFS side?
For incremental imports you need to use --incremental flag. Please refer to below link for more info :-
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
you need to specify —incremental to tell sqoop that you want an incremental load —check-column to specify which column is used for incremental sqooping and —last-value to say from which value you want to start sqooping the next load.
This is just half the picture. There are more ways to do this.for eg. you can use —query option and your query would be like Select * from table where column > 123. This is basically the same thing. You would need to record the last/max value for the selected column and use it for next import.
Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?
If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data.
From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \
There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table
select * from stg.table stg_tbl
where stg_tbl.col1 not in (select col1 from trgt.table);
here trgt is target database, stg is staging database - both are in hive.
I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.
I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?
Questions:
Difference between "Sort By" and "Group by" in Hive. How they work?
If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
How to optimize Hive Performance?
Difference between "Internal Table" and "External Table"
What is the main difference between Hive and SQL
Please provide me few useful resources, so that I can learn in the better way. Thanks
PFB the answers:
1. Difference between "Sort By" and "Group by" in Hive. How they work?
Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
Ans. I think Reducer will work, because as per Hive documentation --
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Having to randomly pick, it has to have complete result output from Reducer.
- How to optimize Hive Performance?
Ans. These links should answer this
5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER
5 Tips for efficient Hive queries with Hive Query Language
- Difference between "Internal Table" and "External Table"
Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.
"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.
- What is the main difference between Hive and SQL
Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.
SQL is a standard RDBMS language for accessing and manipulating databases.
I am new to Hadoop and Hive as well so I can't give you a complete answer.
From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).
As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.
The Main Difference
From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.
Information obtained from Hadoop The Definitive Guide
Really good book and gives a good overview of all the technologies involved.
EDIT:
For external and internal tables, check out this response:
Difference between Hive internal tables and external tables?
Information regarding Sort By and Group By
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
(Taken from the link provided maybe this will help with the difference between Group By and Sort By)
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
Group By:
Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
Information on Optimizing Hive Performance
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Optimizing Joins
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/
General Hive Performance Tips
https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips
Some extra resources
SQL to Hive Cheat Sheet
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Hive LIMIT Documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
Best of luck in your interview!
From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set
1. Difference between "Sort By" and "Group by" in Hive. How they work?
SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.
GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement.
select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.
3. How to optimize Hive Performance?
Using Tez session
Using bucketing and Partitioning
Using Orc file format
Using vectorisation
Using CBO
4. Difference between "Internal Table" and "External Table"
Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.
External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.
5. What is the main difference between Hive and SQL
Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Reducer will not run if we use limit in select clause.
select * from table_name limit 5;