Skewed tables in Hive - hadoop

I am learning hive and came across skewed tables. Help me understanding it.
What are skewed tables in Hive?
How do we create skewed tables?
How does it effect performance?

What are skewed tables in Hive?
A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..
How do we create skewed tables?
create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES];
Example :
create table T (c1 string, c2 string) skewed by (c1) on ('x1')
How does it affect performance?
By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
EDIT :
x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,
create table T (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3')
Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.

In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced.
For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients.
I hope this will answer your question.

Related

Does order of partitioning columns matter in Hive?

Lets say I have a partitioned table with multiple columns as partition keys e.g.
partitioned by (department string,year int, month int,day int)
So does this specific order really matter? All the online resources refer to advantage of scanning only specific sub-directories for search. But ultimately everything is a file in big data, directories seem to be more like logical grouping. And when one specifies a filter on partitioned column, hive just needs to know which files are involved and where they are located, not sure how directory is going to be useful -- it's not like as if directories are loaded in memory -- files are loaded in memory -- and the directory path is more like a label for a given file. If that's the case, no matter which order we specify for partitioning , it shouldn't matter. This is especially more evident in HDInsight where the underlying file system (BLOBs) has no concept of directory.
Although you're right about directories being logical constructs, if you consider the amount of metadata your HiveServer2 has to get and sift through in order to execute an average query, the order does matter. If a query contains ...WHERE department='IT'..., and the partitions are laid out as you show, given 100 departments total, the partition pruning mechanism will be able to eliminate 99 subdirectories from the tree right away. But if the order of partition columns is reversed, the same query will need to retrieve metadata for (30 days x 12 month x N years) partitions from Hive MetaStore, just to figure out whether partition /department=IT actually exists in all of them. So the order of partitions can be decided by analyzing predominant query patterns.
Another common factor to consider is devops/maintenance related, especially if data is loaded into a table incrementally. If one needs to backoff/recover from unsuccessful load, will he need to drop a partition (day=08) in each department subtree individually, or can all department data be cleared at once by dropping partition (day=08)?

How to bucket a Hive table with ORC for a complex query?

Maybe this question is too generic but I think it is worth a try.
I am working with a table that has 270 fields. It is partitioned by the date (like dt=20180101). However when we are hitting this table with queries we are essentially doing a whole table scan because we use fields in the where clause that are not dt. I was wondering what is the right approach for enable bucketing for this table. I could pick one of the where clause fields and enable bucketing for that. For example:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
INTO 16 BUCKETS
Another approach is to use more than 1 field for bucketing:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class, other_field, other_field_2
)
INTO 128 BUCKETS
Is it worth to bucker by multiple field? I guess it will only speed up queries when the same exact fields are present in the select.
Another question, is it worth at least sort by multiple fields so when the file is read it is sequential read? Like this:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
SORTED BY (
other_field, other_field_2
)
INTO 16 BUCKETS
First, if you dont usually query on date and your queries span over many dates, then you might want to change your partitioning strategy.
Its not necessary that you will always query only for 1 or few dates but if your queries are usually totally NOT related to 'date' filtering then you should change that!
Second, bucketing basically splits your data based on hash of your bucketing columns. So it helps you to split your data into equally sized folders in file system and helps mapReduce program runnig over it manage the partitions in an efficient way. But, bucketing into large number of buckets can also have negative effects as all such metadata is also stored in Hive metastore. So, this metadata is read first when you execute some query and based on the result from metadata query, actual data (part of actual data) is read from file system.
So in actual there's no specific rule for bucketing; as to how many buckets should be there and on what all columns you should bucket.
So you should look into your queries and plan accordingly!
Third, sorting does help at the time of querying, as its easy for the engine to push down filtering and sorting criteria. But when you enable sorting on a table, ingestion of data actually becomes a little slower than the case where sorting isnt enabled! But definitely in high queries system it is bound to get you good benefits.
So all in all, these three are all optimization techniques and dont hold any particular rules for their application. It purely depends on your use case!
Hope this helps!!

When we should not use bucketing in hive?

When we should not use bucketing in hive? What is the bottleneck of this technique?
I guess you don't have to use bucketing when you can't benefit from it. As far as I know among main benefits of bucketing: more efficient sampling and map-side joins(see bellow). So if your table is small or you don't need fast sampling and map-side joins just don't use it because you will need to remember that you have to bucket you data before insertion, manually or using set hive.enforce.bucketing = true; There is no bottleneck, it's just one of possible data layouts which allow you to take advantage in some situations.
Hive map-side join example (see more here):
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for
each mapper of A, only the required buckets are fetched. For the query
above, the mapper processing bucket 1 for A will only fetch bucket 1
of B. It is not the default behavior, and is governed by the following
parameter
set hive.optimize.bucketmapjoin = true
Update Considering the data skew when bucketing.
Bucket number calculated using hash_function(bucketing_column) mod num_buckets. If your bucketing column is of int type then hash_int(i) == i(see more here). So if you have skewed values in that column, one value appears much more often then the others for example, then many more rows will be placed in a corresponding bucket, you will have disproportional buckets, this harms the query speed. Hive have build-in tools to overcome data skewness(see Skewed Tables) but I don't think you should use a column with skewed data for bucketing in the first place.
Bucketing is method by which we distribute the data into files. which would otherwise be unevenly distributed.
When to use Bucketing: When we know that query will use column such as "customer_id" which is sequencial or evenly distributed.
When Not to use Bucketing: We would not use bucketing when we know that most use case of the table involve reading subset of data.
For Example: although we keep historical data, we only process last 2 weeks data to determine something. In this scenario we would use partition by weekno.
You should not prefer bucketing when cardinality of partitioning field is not too high. In that case partitioning is more beneficial.
And bucketing can only be done on one field whereas partitioning can be done on multiple fields , with an order like(country, city, state).

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

oracle partitioning on columns frequently used in joins and where conditions

The customer table contains 9.5 million records. The customer_id column is the primary key. The database is Oracle.
Questions:
1) Should the table contain main partitions or sub-partitions? How do I decide?
Also, I don't think indexing columnA or columnB will help here because of the type of data.
TableA.columnA (varchar) has more than 80% of the records for columnA values 5,6,7. The columnA has values from 1 to 7 only.
TableA.columnB (varchar) has 90% of the records for columnB value = 102. The columnB has values from 1 to 999.
Moreover, the typical queries are (in no particular order):
Query1: where tableA.columnA = values
Query2: where tableA.columnB = values
Query3: where tableA.columnA = values AND/OR tableA.columnB = values
2) When we create sub-partitions, what happens if the query only contains a where clause for sub-partition column? Does the query execution go directly to sub-partition or through main partition?
3) the join contains tableA.partitioned_column = tableB.indexed_column
(eg. customer_Table.branch_code = branch_table.branch_code)
Does partitioning help in the case of JOIN? Will it improve performance?
1) It's very difficult to answer not knowing table structure, the way it's usually used etc. But generally for big tables partitioning is very often necessity.
2) If you will not specify partition then Oracle will have to browse through all partitions to find where the subpartition is (which is not very slow). And then use partition pruning on subpartition. It will be still significantly faster then not having subpartitions at all. But the best situation is to refer in WHERE to partition and subpartition.
3) For 99% I think it will help, because Oracle can use partition pruning to get at once needed rows from tableA. You will be 100% sure if you check query plan. But the best situation is when both column are partition keys.
If 80-90% of these columns have the same values and they are the most often queried values, then partitioning will help some. You would be pruning 10-20% of the data during these queries but you probably want to find another way for Oracle to hone in on the data your query needs (dates, perhaps?)
The value distribution in your two columns also brings up the point of statistics and making sure they are being gathered properly (with histograms to describe the skew in these columns).
As #psur points out, without knowing the details of your system it's hard give concrete suggestions.

Resources