I was curious if someone could provide a little more clarification on how to configure the bucketing property on a Hive table. I see that it helps with joins and I believe i read that its good to put it on a column that you will use to join. That could be wrong. I am also curious about how to determine the number of buckets to choose.
If anyone could give a brief explanation and some documentation on how to determine all of these things that would be great.
Thanks in advance for the assistance.
Craig
If you want to implement bucketing in your table first you should set the property
set hive.enforce.bucketing=true;
it will enforce the bucketing.
carnality : no.of possible values for column.
if your implementing bucketing using Cluster By clause, your bucketing column should have high carnality,then you will get the better performance.
if your implementing partitioning using Partitioned By clause your partitioned column should have low carnality,then you will get the better performance
depending on the use case you can choose the number of buckets.It's good to choose (number of buckets) < (your hdfs block size) and it should be power of 2.
bucketing will always creates file's not directories.
The following are few suggestions to be considered while designing buckets.
Buckets are generally created on the most critical columns , a single column or a set of columns, so it implies that these columns would be the primary columns for various join conditions , as the concept of bucketing is to hash these set of columns and store it in such a way that its easily accessible from the hdfs faster.Thus retrieving speed is fast.Its advised not to use all the join columns only the critical and which is we think would improve performance.
The number of buckets would be in exponents of 2. The number of buckets determine the number of reducers to be run and that determines the final number files in which the data is stored. So number of buckets has to be designed keeping in mind the size of data we are handling and there by keeping in mind of avoiding large number of small files in hdfs and few number of big files , thus improving the hive query retrieving speed and optimizations.
Related
Maybe this question is too generic but I think it is worth a try.
I am working with a table that has 270 fields. It is partitioned by the date (like dt=20180101). However when we are hitting this table with queries we are essentially doing a whole table scan because we use fields in the where clause that are not dt. I was wondering what is the right approach for enable bucketing for this table. I could pick one of the where clause fields and enable bucketing for that. For example:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
INTO 16 BUCKETS
Another approach is to use more than 1 field for bucketing:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class, other_field, other_field_2
)
INTO 128 BUCKETS
Is it worth to bucker by multiple field? I guess it will only speed up queries when the same exact fields are present in the select.
Another question, is it worth at least sort by multiple fields so when the file is read it is sequential read? Like this:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
SORTED BY (
other_field, other_field_2
)
INTO 16 BUCKETS
First, if you dont usually query on date and your queries span over many dates, then you might want to change your partitioning strategy.
Its not necessary that you will always query only for 1 or few dates but if your queries are usually totally NOT related to 'date' filtering then you should change that!
Second, bucketing basically splits your data based on hash of your bucketing columns. So it helps you to split your data into equally sized folders in file system and helps mapReduce program runnig over it manage the partitions in an efficient way. But, bucketing into large number of buckets can also have negative effects as all such metadata is also stored in Hive metastore. So, this metadata is read first when you execute some query and based on the result from metadata query, actual data (part of actual data) is read from file system.
So in actual there's no specific rule for bucketing; as to how many buckets should be there and on what all columns you should bucket.
So you should look into your queries and plan accordingly!
Third, sorting does help at the time of querying, as its easy for the engine to push down filtering and sorting criteria. But when you enable sorting on a table, ingestion of data actually becomes a little slower than the case where sorting isnt enabled! But definitely in high queries system it is bound to get you good benefits.
So all in all, these three are all optimization techniques and dont hold any particular rules for their application. It purely depends on your use case!
Hope this helps!!
When we should not use bucketing in hive? What is the bottleneck of this technique?
I guess you don't have to use bucketing when you can't benefit from it. As far as I know among main benefits of bucketing: more efficient sampling and map-side joins(see bellow). So if your table is small or you don't need fast sampling and map-side joins just don't use it because you will need to remember that you have to bucket you data before insertion, manually or using set hive.enforce.bucketing = true; There is no bottleneck, it's just one of possible data layouts which allow you to take advantage in some situations.
Hive map-side join example (see more here):
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for
each mapper of A, only the required buckets are fetched. For the query
above, the mapper processing bucket 1 for A will only fetch bucket 1
of B. It is not the default behavior, and is governed by the following
parameter
set hive.optimize.bucketmapjoin = true
Update Considering the data skew when bucketing.
Bucket number calculated using hash_function(bucketing_column) mod num_buckets. If your bucketing column is of int type then hash_int(i) == i(see more here). So if you have skewed values in that column, one value appears much more often then the others for example, then many more rows will be placed in a corresponding bucket, you will have disproportional buckets, this harms the query speed. Hive have build-in tools to overcome data skewness(see Skewed Tables) but I don't think you should use a column with skewed data for bucketing in the first place.
Bucketing is method by which we distribute the data into files. which would otherwise be unevenly distributed.
When to use Bucketing: When we know that query will use column such as "customer_id" which is sequencial or evenly distributed.
When Not to use Bucketing: We would not use bucketing when we know that most use case of the table involve reading subset of data.
For Example: although we keep historical data, we only process last 2 weeks data to determine something. In this scenario we would use partition by weekno.
You should not prefer bucketing when cardinality of partitioning field is not too high. In that case partitioning is more beneficial.
And bucketing can only be done on one field whereas partitioning can be done on multiple fields , with an order like(country, city, state).
what happen if
create table X (...) clustered by(date) sorted by (time)
but inserted without sort
insert into x select * from raw
Will data be sorted after fetched from raw before inserted?
If unsorted data inserted
What does "sorted by" do in create table statement.
It works just hint for later select queries?
The documentation explains:
The CLUSTERED BY and SORTED BY creation commands do not affect how
data is inserted into a table – only how it is read. This means that
users must be careful to insert data correctly by specifying the
number of reducers to be equal to the number of buckets, and using
CLUSTER BY and SORT BY commands in their query.
I think it is clear that you want to insert the data sorted if you are using that option.
No, the data will not be sorted.
As another answer explains, the SORTED BY and CLUSTERED BY options do not change how data will be returned from queries. While the documentation is technically accurate, the purpose of CLUSTER BY is to write underlying data to HDFS in a way that will make subsequent queries faster in some cases. Clustering (bucketing) is similar to partitioning as it allows the query processor to skip reading rows ... If the cluster is chosen wisely. A common use of buckets is sampling data, where you explicitly include only certain buckets, thereby avoiding reads against those excluded.
We have a huge table which are 144 million rows available right now and also increasing 1 million rows each day.
I would like to create a partitioning table on Oracle 11G server but I am not aware of the techniques. So I have two question :
Is it possible to create a partitioning table from a table that don't have PK?
What is your suggestion to create a partitioning table like huge records?
Yes, but keep in mind that the partition key must be a part of PK
Avoid global indexes
Chose right partitioning key - have it prepared for some kind of future maintenance ( cutting off oldest or unnecessary partitions, placing them in separate tablespaces... etc)
There are too many things to consider.
"There are several non-unique index on the table. But, the performance
is realy terrible! Just simple count function was return result after
5 minutes."
Partitioning is not necessarily a performance enhancer. The partition key will allow certain queries to benefit from partition pruning i.e. those queries which drive off the partition key in the WHERE clause. Other queries may perform worse, if there WHERE clause runs against the grain of the partition key,
It is difficult to give specific advice because the details you've posted are so vague. But here are some other possible ways of speeding up queries on big tables:
index compression
parallel query
better, probably compound, indexes.
I'm interested to find out if there is a performance benefit to partitioning a numeric column that is often the target of a query. Currently I have a materialized view that contains ~50 million records. When using a regular b-tree index and searching by this numeric column I get a cost of 7 and query results in about 0.8 seconds (with non-primed cache). After adding a global hash partition (with 64 partitions) for that column I get a cost of 6 and query results in about 0.2 seconds (again with non-primed cache).
My first reaction is that the partitioned index has improved the performance of my query. However, I realize that this may just be a coincidence and could be totally dependent on the values being searched on, or others I'm not aware of. So my question is: is there a performance benefit to adding a global hash partition to a numeric column on a large table or is the cost of determining which index partitions to scan out-weighed by the cost of just doing a full range scan on a non-indexed partition?
I'm sure this, like many Oracle questions, can be answered with an "it depends." :) I'm interested in learning what factors I should consider to determine the benefits of each approach.
Thanks!
I'm pretty sure you have found this reference in your research - Partitioned Tables and Indexes. However I give a link to it if somebody is interested, this is a very good material about partitioning.
Straight to the point - Partitioned index just decomposes the index into pieces (16 in your situation) and spread the data depending on their hashed partitioning key. When you want to use it, Oracle "calculates" the hash of the key and determine in which section to continue with searching.
Knowing how index searching works, on really huge data I think it is better to choose the partitioned index in order to decrease the index tree you traverse (regular index). It really depends on the data, which is in the table (how regular index tree is composed) and is hashing and direct jump to lower node faster than regular tree traverse from the start node.
Finally, you must be more confident with the test results. If one technique gives better results on your exact data than some other don't worry to implement it.