I have bucketed table based on column flightnum(10 buckets), data size is approx 700MB, bucketing enforced as well.
When I am executing the query :
select count(flightnum) from flight_buck where flightnum=10;
getting the response in approx 46s. Total number of mappers were 27.
When executing the same query on non bucketed table with same data :
select count(flightnum) from flight_temp where flightnum=10;
getting the response in approx 47s. Total number of mappers used were 30.
Why I am getting the response in same amount of time?
Bucketing helps join to be faster, to increase the simple SELECT speed you have to use partitioned tables.
Try to partition table by flightnum and run again the selects.
Why does this happen ?
Let's create a bucketed not partitioned table like this:
create table `t1b`(
`exchange` string,
`stock_symbol` string,
`date` string,
`stock_price_open` float,
`stock_price_high` float,
`stock_price_low` float,
`stock_price_close` float,
`stock_volume` int,
`stock_price_adj_close` float)
clustered by ( `stock_symbol` ) sorted by ( `date` ) into 306 buckets;
And let's fill it with data... There are as many reducers as many buckets because each reducer will process only record with the same keys and will store the data into its file using the sorting you like, in this case by date
Let's look at HDFS...
Please note what we got.. 306 files (buckets) ...
and inside each of them there are records which have the same clustering key...
But all the files are into the same folder, and when SELECTing with hive there is no way to understand which files hold the value we are looking for, so bucketing with no partitioning does not speed up select because there are no info about where are the data we are looking for.
What does bucketing do ? When you are JOINing data, the whole bucket can be loaded into RAM and we can get a fast join in MAP instead to get a slow join in REDUCE.
Related
I have dataset like this A person identified by an ID, use some object identified by another ID and amount of time he use that. I want to know the first 20 items highly used by the person. Amount of data is very large over 100 million and each id can produce about 200 object he may use.
So first thing i created a projection table with cluster and keep the things sorted how the things will happen in the mapper so that all the things will be at one place in the node so that mapper when it is distributing will find the things locally
CREATE TABLE person_objectid_dwell ( person string, objectid string, sum_dwell bigint)
CLUSTERED BY (person) SORTED BY (sum_dwell desc,objectid asc)INTO 100 BUCKETS STORED AS ORC;
And once done i inserted the data from the feeder table like this
insert into person_objectid_dwell select person, objectid, sum_dwell from person_objectid_dwell distribute by person sort by sum_dwell desc, objectid asc;
And then query using windowing with a table creation
create table person_top20_objectsdwell as select * from ( select person, objectid, sum_dwell,
rank() over (partition by person order by sum_dwell desc ) as rank
from person_objectid_dwell ) t where rank <21;
Problem is this i am not getting the performance what i think i should get, i set the number of reducers etc. Program is running with 3000+ mapper and 1000+ reducers and mapping phase is not getting over at all.
How do you efficiently design a Hive/Impala table considering the following facts?
The table receives tool data of about 100 million rows every
day. The date on which it receives the data is stored in a column in
the table along with its tool id.
Each tool receives about
500 runs per day which is identified by column run id. Each run id
contains data approximately of size 1 mb.
The default size of the block is 64 mb.
The table can be searched by date, tool id and run id in this order.
If you are doing analytics on this data then a solid choice with Impala is using Parquet format. What has worked well for our users is to partition the date by year, month, day based a date value on the record.
So for example CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) stored as parquet
When loading the data into this table we use something like this to create dynamic partitions:
INSERT INTO foo partition (year, month, day)
SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
FROM source_table;
Then you train your users that if they want the best performance to add YEAR, MONTH, DAY to their WHERE clause so that it hits the partition for better performance. Then have them add the eff_dt in the SELECT statement so they have a date value in the format they like see in their final results.
In CDH, Parquet is storing by default data in 256MB chunks (which is configurable). Here is how to configure it: http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html
I am using Spark 1.4.1 version. I am trying to load a partitioned Hive table in to a DataFrame where in the Hive table is partitioned by the year_week number, at a scenario I might have 104 partitions.
But I could see the DataFrame is getting loaded with the data into 200 partitions and I understand that it is due to the spark.sql.shuffle.partitions set to 200 by default.
I would like to know if there is any good way I can load my Hive table to Spark Dataframe with 104 partitions with making sure that the Dataframe is partitioned by year_week number during the Dataframe load time itself.
The reason for my expectation is that I will be doing few joins with huge volume tables, where all are partitioned by year_week number. So having the Dataframe partitioned by year_week number and loaded accordingly will save me a lot of time from re-partitioning them with year_week number.
Please let me know if you have any suggestions to me.
Thanks.
Use hiveContext.sql("Select * from tableName where pt='2012.07.28.10'")
Where, pt= partitionKey, in your case will be year_week
and corresponding value with it.
I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.
I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.
I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.
I'm in the process of improving the performance of a table.
Say this table:
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
STORED AS PARQUET;
I'm planning to apply bucketing by user_id, as the queries usually involve user_id as a clause.
like this
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
CLUSTERED BY(user_id) INTO 256 BUCKETS
STORED AS PARQUET;
This table will be created and loaded with Hive, and queried from Impala...
What i wanted to know is, whether bucketing this table will improve the performance of impala queries - I'm not sure how impala works with buckets.
I tried creating a bucketed and non-bucketed table table through Hive (which is a table 6GB in size)
I tried benchmarking the results from both. There is slight/no difference.
I also tried analyzing the profile of both the queries, which didn't show much difference.
So the answer is, Impala doesn't know whether a table is bucketed or not, so it doesn't take advantage of it (IMPALA-1990). The only way it becomes aware of the partitions and files in the table is with COMPUTE STATS
By the way, bucketing the tables used by Impala is not wasteful.
If we have to limit the number of small files in the table, we can bucket it and switch on Hive transactions (available from Hive 0.13.0)