Does Impala makes effective use of Buckets in a Hive Bucketed table? - hadoop

I'm in the process of improving the performance of a table.
Say this table:
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
STORED AS PARQUET;
I'm planning to apply bucketing by user_id, as the queries usually involve user_id as a clause.
like this
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
CLUSTERED BY(user_id) INTO 256 BUCKETS
STORED AS PARQUET;
This table will be created and loaded with Hive, and queried from Impala...
What i wanted to know is, whether bucketing this table will improve the performance of impala queries - I'm not sure how impala works with buckets.

I tried creating a bucketed and non-bucketed table table through Hive (which is a table 6GB in size)
I tried benchmarking the results from both. There is slight/no difference.
I also tried analyzing the profile of both the queries, which didn't show much difference.
So the answer is, Impala doesn't know whether a table is bucketed or not, so it doesn't take advantage of it (IMPALA-1990). The only way it becomes aware of the partitions and files in the table is with COMPUTE STATS
By the way, bucketing the tables used by Impala is not wasteful.
If we have to limit the number of small files in the table, we can bucket it and switch on Hive transactions (available from Hive 0.13.0)

Related

Rules to be followed before creating a Hive partitioned table

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?
What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.

Bucketing not optimized in Hive

I have bucketed table based on column flightnum(10 buckets), data size is approx 700MB, bucketing enforced as well.
When I am executing the query :
select count(flightnum) from flight_buck where flightnum=10;
getting the response in approx 46s. Total number of mappers were 27.
When executing the same query on non bucketed table with same data :
select count(flightnum) from flight_temp where flightnum=10;
getting the response in approx 47s. Total number of mappers used were 30.
Why I am getting the response in same amount of time?
Bucketing helps join to be faster, to increase the simple SELECT speed you have to use partitioned tables.
Try to partition table by flightnum and run again the selects.
Why does this happen ?
Let's create a bucketed not partitioned table like this:
create table `t1b`(
`exchange` string,
`stock_symbol` string,
`date` string,
`stock_price_open` float,
`stock_price_high` float,
`stock_price_low` float,
`stock_price_close` float,
`stock_volume` int,
`stock_price_adj_close` float)
clustered by ( `stock_symbol` ) sorted by ( `date` ) into 306 buckets;
And let's fill it with data... There are as many reducers as many buckets because each reducer will process only record with the same keys and will store the data into its file using the sorting you like, in this case by date
Let's look at HDFS...
Please note what we got.. 306 files (buckets) ...
and inside each of them there are records which have the same clustering key...
But all the files are into the same folder, and when SELECTing with hive there is no way to understand which files hold the value we are looking for, so bucketing with no partitioning does not speed up select because there are no info about where are the data we are looking for.
What does bucketing do ? When you are JOINing data, the whole bucket can be loaded into RAM and we can get a fast join in MAP instead to get a slow join in REDUCE.

Hive - dynamic partition insert fails with less than one hundred partitions

I am trying to insert records in hive from one table (not partitioned) to another using dynamic partitioning. I've set these hive properties as suggested in few other questions:
hive.exec.dynamic.partition=True
hive.exec.dynamic.partition.pernode=5000
hive.exec.dynamic.partition.pernode=2048
hive.exec.dynamic.partition.mode=nonstrict
Here you can find the table definitions together with the insert query I am running:
Non-partitioned table
CREATE EXTERNAL TABLE IF NOT EXISTS dataretention.non_partitioned(
recordType STRING,
potentialDuplicate STRING,
...
partDate STRING,
partHour STRING,
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/tmp/temp-tables/csv/not-partitioned'
Partitioned table
CREATE TABLE IF NOT EXISTS dataretention.partitioned(
recordType STRING,
potentialDuplicate STRING,
...)
PARTITIONED BY (partDate STRING, partHour STRING) STORED AS ORC LOCATION '/tmp/temp-tables/partitioned'
Insert query
INSERT INTO dataretention.partitioned PARTITION(partDate, partHour) SELECT recordType, ... partDate, partHour FROM dataretention.non_partitioned;
The file is 1000 record long, the tables have 158 fields. I've tested this first using the hortonworks HDP 2.4 on a sandbox. Since I was thinking about a resource problem, I've moved to a 4-machine m4.xlarge (4-cores, 16 GB ram each) cluster on AWS. I was no able to go over 9 different partitions on the sandbox, and 99 different ones on the cluster. I get a vertex failed from Tez four times and after that it quits the job.
Any help or suggestion is really appreciated. Thank you.

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Can in insert data multiple times into a bucketed hive table

I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table?
Does it affect the bucketing of the table?
I figured it out!!
Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such.
All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.
I dont think that should be a issue because you have declared that you want bucketing on user_id. so every time you would insert it will create 4 more files.
Bucketing is used for faster query processing so if it is making 4 more files everytime it will be making your query processing even faster.

Resources