When we should not use bucketing in hive? - hadoop

When we should not use bucketing in hive? What is the bottleneck of this technique?

I guess you don't have to use bucketing when you can't benefit from it. As far as I know among main benefits of bucketing: more efficient sampling and map-side joins(see bellow). So if your table is small or you don't need fast sampling and map-side joins just don't use it because you will need to remember that you have to bucket you data before insertion, manually or using set hive.enforce.bucketing = true; There is no bottleneck, it's just one of possible data layouts which allow you to take advantage in some situations.
Hive map-side join example (see more here):
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for
each mapper of A, only the required buckets are fetched. For the query
above, the mapper processing bucket 1 for A will only fetch bucket 1
of B. It is not the default behavior, and is governed by the following
parameter
set hive.optimize.bucketmapjoin = true
Update Considering the data skew when bucketing.
Bucket number calculated using hash_function(bucketing_column) mod num_buckets. If your bucketing column is of int type then hash_int(i) == i(see more here). So if you have skewed values in that column, one value appears much more often then the others for example, then many more rows will be placed in a corresponding bucket, you will have disproportional buckets, this harms the query speed. Hive have build-in tools to overcome data skewness(see Skewed Tables) but I don't think you should use a column with skewed data for bucketing in the first place.

Bucketing is method by which we distribute the data into files. which would otherwise be unevenly distributed.
When to use Bucketing: When we know that query will use column such as "customer_id" which is sequencial or evenly distributed.
When Not to use Bucketing: We would not use bucketing when we know that most use case of the table involve reading subset of data.
For Example: although we keep historical data, we only process last 2 weeks data to determine something. In this scenario we would use partition by weekno.

You should not prefer bucketing when cardinality of partitioning field is not too high. In that case partitioning is more beneficial.
And bucketing can only be done on one field whereas partitioning can be done on multiple fields , with an order like(country, city, state).

Related

How to bucket a Hive table with ORC for a complex query?

Maybe this question is too generic but I think it is worth a try.
I am working with a table that has 270 fields. It is partitioned by the date (like dt=20180101). However when we are hitting this table with queries we are essentially doing a whole table scan because we use fields in the where clause that are not dt. I was wondering what is the right approach for enable bucketing for this table. I could pick one of the where clause fields and enable bucketing for that. For example:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
INTO 16 BUCKETS
Another approach is to use more than 1 field for bucketing:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class, other_field, other_field_2
)
INTO 128 BUCKETS
Is it worth to bucker by multiple field? I guess it will only speed up queries when the same exact fields are present in the select.
Another question, is it worth at least sort by multiple fields so when the file is read it is sequential read? Like this:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
SORTED BY (
other_field, other_field_2
)
INTO 16 BUCKETS
First, if you dont usually query on date and your queries span over many dates, then you might want to change your partitioning strategy.
Its not necessary that you will always query only for 1 or few dates but if your queries are usually totally NOT related to 'date' filtering then you should change that!
Second, bucketing basically splits your data based on hash of your bucketing columns. So it helps you to split your data into equally sized folders in file system and helps mapReduce program runnig over it manage the partitions in an efficient way. But, bucketing into large number of buckets can also have negative effects as all such metadata is also stored in Hive metastore. So, this metadata is read first when you execute some query and based on the result from metadata query, actual data (part of actual data) is read from file system.
So in actual there's no specific rule for bucketing; as to how many buckets should be there and on what all columns you should bucket.
So you should look into your queries and plan accordingly!
Third, sorting does help at the time of querying, as its easy for the engine to push down filtering and sorting criteria. But when you enable sorting on a table, ingestion of data actually becomes a little slower than the case where sorting isnt enabled! But definitely in high queries system it is bound to get you good benefits.
So all in all, these three are all optimization techniques and dont hold any particular rules for their application. It purely depends on your use case!
Hope this helps!!

Hive table sorted but inserted without sort

what happen if
create table X (...) clustered by(date) sorted by (time)
but inserted without sort
insert into x select * from raw
Will data be sorted after fetched from raw before inserted?
If unsorted data inserted
What does "sorted by" do in create table statement.
It works just hint for later select queries?
The documentation explains:
The CLUSTERED BY and SORTED BY creation commands do not affect how
data is inserted into a table – only how it is read. This means that
users must be careful to insert data correctly by specifying the
number of reducers to be equal to the number of buckets, and using
CLUSTER BY and SORT BY commands in their query.
I think it is clear that you want to insert the data sorted if you are using that option.
No, the data will not be sorted.
As another answer explains, the SORTED BY and CLUSTERED BY options do not change how data will be returned from queries. While the documentation is technically accurate, the purpose of CLUSTER BY is to write underlying data to HDFS in a way that will make subsequent queries faster in some cases. Clustering (bucketing) is similar to partitioning as it allows the query processor to skip reading rows ... If the cluster is chosen wisely. A common use of buckets is sampling data, where you explicitly include only certain buckets, thereby avoiding reads against those excluded.

Determining Bucketing Configuration on Hive Table

I was curious if someone could provide a little more clarification on how to configure the bucketing property on a Hive table. I see that it helps with joins and I believe i read that its good to put it on a column that you will use to join. That could be wrong. I am also curious about how to determine the number of buckets to choose.
If anyone could give a brief explanation and some documentation on how to determine all of these things that would be great.
Thanks in advance for the assistance.
Craig
If you want to implement bucketing in your table first you should set the property
set hive.enforce.bucketing=true;
it will enforce the bucketing.
carnality : no.of possible values for column.
if your implementing bucketing using Cluster By clause, your bucketing column should have high carnality,then you will get the better performance.
if your implementing partitioning using Partitioned By clause your partitioned column should have low carnality,then you will get the better performance
depending on the use case you can choose the number of buckets.It's good to choose (number of buckets) < (your hdfs block size) and it should be power of 2.
bucketing will always creates file's not directories.
The following are few suggestions to be considered while designing buckets.
Buckets are generally created on the most critical columns , a single column or a set of columns, so it implies that these columns would be the primary columns for various join conditions , as the concept of bucketing is to hash these set of columns and store it in such a way that its easily accessible from the hdfs faster.Thus retrieving speed is fast.Its advised not to use all the join columns only the critical and which is we think would improve performance.
The number of buckets would be in exponents of 2. The number of buckets determine the number of reducers to be run and that determines the final number files in which the data is stored. So number of buckets has to be designed keeping in mind the size of data we are handling and there by keeping in mind of avoiding large number of small files in hdfs and few number of big files , thus improving the hive query retrieving speed and optimizations.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Skewed tables in Hive

I am learning hive and came across skewed tables. Help me understanding it.
What are skewed tables in Hive?
How do we create skewed tables?
How does it effect performance?
What are skewed tables in Hive?
A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..
How do we create skewed tables?
create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES];
Example :
create table T (c1 string, c2 string) skewed by (c1) on ('x1')
How does it affect performance?
By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
EDIT :
x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,
create table T (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3')
Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.
In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced.
For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients.
I hope this will answer your question.

Resources