Load hive partitioned table to Spark Dataframe - hadoop

I am using Spark 1.4.1 version. I am trying to load a partitioned Hive table in to a DataFrame where in the Hive table is partitioned by the year_week number, at a scenario I might have 104 partitions.
But I could see the DataFrame is getting loaded with the data into 200 partitions and I understand that it is due to the spark.sql.shuffle.partitions set to 200 by default.
I would like to know if there is any good way I can load my Hive table to Spark Dataframe with 104 partitions with making sure that the Dataframe is partitioned by year_week number during the Dataframe load time itself.
The reason for my expectation is that I will be doing few joins with huge volume tables, where all are partitioned by year_week number. So having the Dataframe partitioned by year_week number and loaded accordingly will save me a lot of time from re-partitioning them with year_week number.
Please let me know if you have any suggestions to me.
Thanks.

Use hiveContext.sql("Select * from tableName where pt='2012.07.28.10'")
Where, pt= partitionKey, in your case will be year_week
and corresponding value with it.

Related

HBase to Hive Mapping table is not showing up complete data

We have a HBase table with 1 column family and has 1.5 billion records in it.
HBase Row count was retrieved using command
"count '<tablename>'", {CACHE => 1000000}.
And HBase to Hive Mapping was done with the below command.
create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
')
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');
But While we retrieve the Hive Row Count using the below command,
select count(*) from stagingdata;
It only shows up 140 million rows in the Hive Mapped Table.
We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.
My Question is why the complete 1.5 billion records are not showing up in Hive?
Are we missing here anything ?
Your Immediate Answer would be highly appreciated.
Thanks,
Madhu.
What you see in hive is the latest version per key and not all the versions of a key
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp.
Hive HBase Integration

Bucketing not optimized in Hive

I have bucketed table based on column flightnum(10 buckets), data size is approx 700MB, bucketing enforced as well.
When I am executing the query :
select count(flightnum) from flight_buck where flightnum=10;
getting the response in approx 46s. Total number of mappers were 27.
When executing the same query on non bucketed table with same data :
select count(flightnum) from flight_temp where flightnum=10;
getting the response in approx 47s. Total number of mappers used were 30.
Why I am getting the response in same amount of time?
Bucketing helps join to be faster, to increase the simple SELECT speed you have to use partitioned tables.
Try to partition table by flightnum and run again the selects.
Why does this happen ?
Let's create a bucketed not partitioned table like this:
create table `t1b`(
`exchange` string,
`stock_symbol` string,
`date` string,
`stock_price_open` float,
`stock_price_high` float,
`stock_price_low` float,
`stock_price_close` float,
`stock_volume` int,
`stock_price_adj_close` float)
clustered by ( `stock_symbol` ) sorted by ( `date` ) into 306 buckets;
And let's fill it with data... There are as many reducers as many buckets because each reducer will process only record with the same keys and will store the data into its file using the sorting you like, in this case by date
Let's look at HDFS...
Please note what we got.. 306 files (buckets) ...
and inside each of them there are records which have the same clustering key...
But all the files are into the same folder, and when SELECTing with hive there is no way to understand which files hold the value we are looking for, so bucketing with no partitioning does not speed up select because there are no info about where are the data we are looking for.
What does bucketing do ? When you are JOINing data, the whole bucket can be loaded into RAM and we can get a fast join in MAP instead to get a slow join in REDUCE.

Hive - dynamic partition insert fails with less than one hundred partitions

I am trying to insert records in hive from one table (not partitioned) to another using dynamic partitioning. I've set these hive properties as suggested in few other questions:
hive.exec.dynamic.partition=True
hive.exec.dynamic.partition.pernode=5000
hive.exec.dynamic.partition.pernode=2048
hive.exec.dynamic.partition.mode=nonstrict
Here you can find the table definitions together with the insert query I am running:
Non-partitioned table
CREATE EXTERNAL TABLE IF NOT EXISTS dataretention.non_partitioned(
recordType STRING,
potentialDuplicate STRING,
...
partDate STRING,
partHour STRING,
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/tmp/temp-tables/csv/not-partitioned'
Partitioned table
CREATE TABLE IF NOT EXISTS dataretention.partitioned(
recordType STRING,
potentialDuplicate STRING,
...)
PARTITIONED BY (partDate STRING, partHour STRING) STORED AS ORC LOCATION '/tmp/temp-tables/partitioned'
Insert query
INSERT INTO dataretention.partitioned PARTITION(partDate, partHour) SELECT recordType, ... partDate, partHour FROM dataretention.non_partitioned;
The file is 1000 record long, the tables have 158 fields. I've tested this first using the hortonworks HDP 2.4 on a sandbox. Since I was thinking about a resource problem, I've moved to a 4-machine m4.xlarge (4-cores, 16 GB ram each) cluster on AWS. I was no able to go over 9 different partitions on the sandbox, and 99 different ones on the cluster. I get a vertex failed from Tez four times and after that it quits the job.
Any help or suggestion is really appreciated. Thank you.

Can I directly consider the Hive partition columns similar to the partitions columns present in source (Teradata) tables?

Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.

Hive index rebuild too slow in compare with PostgreSQL

I am trying to compare same functionality on my PostgreSQL data warehouse and newly created Hive data warehouse on same box with same data and same table structure . I am trying to understand Hive benefits, but... Despite the fact that data load into PostgreSQL running 3 times slower - the index creation/rebuild on PostgreSQL is 20 times faster, the index doesn't need to be rebuild every time like in Hive.
My question is: what I am missing in Hive configuration?
My setup is:
CREATE TABLE mytable
(
aa int,
bb string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/spaces/hadoop/hadoopfs';
LOAD DATA LOCAL INPATH '/data/Informix94/spaces/postgres/myfile_big' OVERWRITE INTO TABLE mytable;
CREATE INDEX mytable_indx ON TABLE mytable(aa) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD LOCATION '/data/spaces/hadoop/hadoopfs';
set hive.optimize.autoindex=true;
set hive.optimize.index.filter=true;
alter index mytable_indx ON mytable rebuild;
My Box is VM with 3 G ram with PostgreSQL running on it and taking ~ 1 G ram. He is serving as metadata store. I am using most recent stable versions of CentOS, Hadoop, Hive and didn't changed Hive default setting except matadata store location and statistics disabling.
The result:
index rebuild takes 4798 seconds on 260.000.000 rows or 80 seconds on 5.000.000 rows.
Hive only works well when your data doesn't fit on a single machine anymore. So the results you are seeing are expected results. So once you've collected Terabytes or Petabytes of data you'll be much happier with hive. In the use-case you describe PostgreSQL would be a much better match.

Resources