When does it make sense to create multiple tables as opposed to a single table with a large number of columns. I understand that typically tables have only a few column families (1-2) and that each column family can support 1000+ columns.
When does it make sense to create separate tables when HBase seems to perform well with a potentially large number of columns within a single table?
Before answering the question itself, let me first state some of the major factors that come into play. I am going to assume that the file system in use is HDFS.
A table is divided into non-overlapping partitions of the keyspace called regions.
The key-range -> region mapping is stored in a special single region table called meta.
The data in one HBase column family for a region is stored in a single HDFS directory. It's usually several files but for all intents and purposes, we can assume that a region's data for a column family is stored in a single file on HDFS called a StoreFile / HFile.
A StoreFile is essentially a sorted file containing KeyValues. A KeyValue logically represents the following in order: (RowLength, RowKey, FamilyLength, FamilyName, Qualifier, Timestamp, Type). For example, if you have only two KVs in your region for a CF where the key is same but values in two columns, this is how the StoreFile will look like (except that it's actually byte encoded, and metadata like length etc. is also stored as I mentioned above):
Key1:Family1:Qualifier1:Timestamp1:Value1:Put
Key1:Family1:Qualifier2:Timestamp2:Value2:Put
The StoreFile is divided into blocks (default 64KB) and the key range contained in each data block is indexed by multi-level indexes. A random lookup inside a single block can be done using index + binary search. However, the scans have to go serially through a particular block after locating the starting position in the first block needed for scan.
HBase is a LSM-tree based database which means that it has an in-memory log (called Memstore) that is periodically flushed to the filesystem creating the StoreFiles. The Memstore is shared for all columns inside a single region for a particular column family.
There are several optimizations involved while dealing with reading/writing data from/to HBase, but the information given above holds true conceptually. Given the above statements, the following are the pros of having several columns vs several tables over the other approach:
Single Table with multiple columns
Better on-disk compression due to prefix encoding since all data for a Key is stored together rather than on multiple files across tables. This also results in reduced disk activity due to smaller data size.
Lesser load on meta table because the total number regions is going to be smaller. You'll have N number of regions for just one table rather than N*M regions for M tables. This means faster region lookup and low contention on meta table, which is a concern for large clusters.
Faster reads and low IO amplification (causing less disk activity) when you need to read several columns for a single row key.
You get advantage of row level transactions, batching and other performance optimizations when writing to multiple columns for a single row key.
When to use this:
If you want to perform row level transactions across multiple columns, you have to put them in a single table.
Even when you don't need row level transactions, but you often write to or query from multiple columns for the same row key. A good rule for thumb is that if on an average, more than 20% for your columns have values for a single row, you should try to put them together in a single table.
When you have too many columns.
Multiple Tables
Faster scans for each table and low IO amplification if the scans are mostly concerned only with one column (remember sequential look-ups in scans will unnecessarily read columns they don't need).
Good logical separation of data, especially when you don't need to share row keys across columns. Have one table for one type of row keys.
When to use:
When there is a clear logical separation of data. For example, if your row key schema differs across different sets of columns, put those sets of columns in separate tables.
When only a small percentage of columns have values for a row key (Look below for a better approach).
You want to have different storage configs for different sets of columns. E.g. TTL, compaction rate, blocking file counts, memstore size etc. (Look below for a better approach in this use case).
An alternative of sorts: Multiple CFs in single table
As you can see from above, there are pros of both the approaches. The choice becomes really difficult in cases where you have same structure of row key for several columns (so, you want to share row key for storage efficiency or need transactions across columns) but the data is very sparse (which means you write/read only small percentage of columns for a row key).
It seems like you need the best of both worlds in this case. That's where column families come in. If you can partition your column set into logical subsets where you mostly access/read/write only to a single subset, or you need storage level configs per subset (like TTL, Storage class, write heavy compaction schedule etc.), then you can make each subset a column family.
Since data for a particular column family is stored in single file (set of files), you get better locality while reading a subset of columns without slowing down the scans.
However, there is a catch:
Do not try to unnecessarily use column families. There is a cost associated with them, and HBase does not do well with 10+ CFs due to how region level write locks, monitoring etc. work in HBase. Use CFs only if you have a logical relationship between columns across CFs but you don't generally perform operations across CFs or need to have different storage configs for different CFs.
It's perfectly fine to use only a single CF containing all your columns if you share row key schema across them, unless you have a very sparse data set, in which case you might need different CFs or different tables based on above mentioned points.
I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?
Assuming, initially there were 5 partition and they had -
1st partition - 100 elements
2nd partition - 200 elements
3rd partition - 500 elements
4th partition - 5000 elements
5th partition - 200 elements
Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.
In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.
From what I have understood, repartition will certainly trigger shuffle. From Job Logical Plan document following can be said about repartition
- for each partition, every record is assigned a key which is an increasing number.
- hash(key) leads to a uniform records distribution on all different partitions.
If Spark can't load all data into memory then memory issue will be thrown. So default processing of Spark is all done in memory i.e. there should always be sufficient memory for your data.
Persist option can be used to tell spark to spill your data in disk if there is not enough memory.
Jacek Laskowski also explains about repartitions.
Understanding your Apache Spark Application Through Visualization should be sufficient for you to test and know by yourself.
In Hbase, I have configured hbase.hregion.max.filesize as 10GB. If the Single row exceeds the 10GB size, then the row will not into 2 regions as Hbase splits are done based on row key
For example, if I have a row which has 1000 columns, and each column varies between 25MB to 40 MB. So there is chance to exceed the defined region size. If this is the case, how will it affect the performance while reading data using rowkey alone or row-key with column qualifier?
First thing is Hbase is NOT for storing that much big data 10GB in a single row(its quite hypothetical).
I hope your have not saved 10GB in a single row (just thinking of saving that)
It will adversely affect the performance. You consider other ways like storing this much data in hdfs in a partitioned structure.
In general, these are the tips for generally applicable batch clients like Mapreduce Hbase jobs
Scan scan = new Scan();
scan.setCaching(500); //1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
Can have look at Performance
Cassandra stores column-keys of a row-key in sorted order physically.
My doubt is, how can it be sorted when the same row-key is present in multiple SSTables ?
I think this explanation of how compaction merges data sorted by partition key within each SSTable and then consolidates SSTables into a single file is what you want to know: http://www.datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_write_path_c.html?scroll=concept_ds_wt3_32w_zj__dml-compaction
I know both is performed on a column in the table but how is each operation different.
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.
If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.
There are a few details missing from the previous explanations.
To better understand how partitioning and bucketing works, you should look at how data is stored in hive.
Let's say you have a table
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system.
The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you do specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.
But what happens if you use let's say 256 buckets and the field you're bucketing on has a low cardinality (for instance, it's a US state, so can be only 50 different values) ? You'll have 50 buckets with data, and 206 buckets with no data.
Someone already mentioned how partitions can dramatically cut the amount of data you're querying. So in my example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.
I think that somebody also mentioned how bucketing can speed up joins with other tables that have exactly the same bucketing, so in my example, if you're joining two tables on the same employee_id, hive can do the join bucket by bucket (even better if they're already sorted by employee_id since it's going to mergesort parts that are already sorted, which works in linear time aka O(n) ).
So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.
Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.
Before going into Bucketing, we need to understand what Partitioning is. Let us take the below table as an example. Note that I have given only 12 records in the below example for beginner level understanding. In real-time scenarios you might have millions of records.
PARTITIONING
---------------------
Partitioning is used to obtain performance while querying the data. For example, in the above table, if we write the below sql, it need to scan all the records in the table which reduces the performance and increases the overhead.
select * from sales_table where product_id='P1'
To avoid full table scan and to read only the records related to product_id='P1' we can partition (split hive table's files) into multiple files based on the product_id column. By this the hive table's file will be split into two files one with product_id='P1' and other with product_id='P2'. Now when we execute the above query, it will scan only the product_id='P1' file.
../hive/warehouse/sales_table/product_id=P1
../hive/warehouse/sales_table/product_id=P2
The syntax for creating the partition is given below. Note that we should not use the product_id column definition along with the non-partitioned columns in the below syntax. This should be only in the partitioned by clause.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10))
Cons : We should be very careful while partitioning. That is, it should not be used for the columns where number of repeating values are very less (especially primary key columns) as it increases the number of partitioned files and increases the overhead for the Name node.
BUCKETING
------------------
Bucketing is used to overcome the cons that I mentioned in the partitioning section. This should be used when there are very few repeating values in a column (example - primary key column). This is similar to the concept of index on primary key column in the RDBMS. In our table, we can take Sales_Id column for bucketing. It will be useful when we need to query the sales_id column.
Below is the syntax for bucketing.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10)) Clustered by(Sales_Id) into 3 buckets
Here we will further split the data into few more files on top of partitions.
Since we have specified 3 buckets, it is split into 3 files each for each product_id. It internally uses modulo operator to determine in which bucket each sales_id should be stored. For example, for the product_id='P1', the sales_id=1 will be stored in 000001_0 file (ie, 1%3=1), sales_id=2 will be stored in 000002_0 file (ie, 2%3=2),sales_id=3 will be stored in 000000_0 file (ie, 3%3=0) etc.
I think I am late in answering this question, but it keep coming up in my feed.
Navneet has provided excellent answer. Adding to it visually.
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Suppose, you have a table with five columns, name, server_date, some_col3, some_col4 and some_col5. Suppose, you have partitioned the table on server_date and bucketed on name column in 10 buckets, your file structure will look something like below.
server_date=xyz
00000_0
00001_0
00002_0
........
00010_0
Here server_date=xyz is the partition and 000 files are the buckets in each partition. Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.
Hive Partitioning:
Partition divides large amount of data into multiple slices based on value of a table column(s).
Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.
Pros:
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the population from "Vatican city" returns very fast instead of searching entire population of world.
Cons:
Possibility of too many small partition creations - too many directories.
Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.
Hive Bucketing:
Bucketing decomposes data into more manageable or equal parts.
With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined during table creation scripts.
Pros
Due to equal volumes of data in each partition, joins at Map side will be quicker.
Faster query response like partitioning
Cons
You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.
The difference is bucketing divides the files by Column Name, and partitioning divides the files under By a particular value inside table
Hopefully I defined it correctly
There are great responses here. I would like to keep it short to memorize the difference between partition & buckets.
You generally partition on a less unique column. And bucketing on most unique column.
Example if you consider World population with country, person name and their bio-metric id as an example. As you can guess, country field would be the less unique column and bio-metric id would be the most unique column. So ideally you would need to partition the table by country and bucket it by bio-metric id.
Using Partitions in Hive table is highly recommended for below reason -
Insert into Hive table should be faster ( as it uses multiple threads
to write data to partitions )
Query from Hive table should be efficient with low latency.
Example :-
Assume that Input File (100 GB) is loaded into temp-hive-table and it contains bank data from across different geographies.
Hive table without Partition
Insert into Hive table Select * from temp-hive-table
/hive-table-path/part-00000-1 (part size ~ hdfs block size)
/hive-table-path/part-00000-2
....
/hive-table-path/part-00000-n
Problem with this approach is - It will scan whole data for any query you run on this table. Response time will be high compare to other approaches where partitioning and Bucketing are used.
Hive table with Partition
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 10 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 20 GB)
....
/hive-table-path/country=UK/part-00000-n (file size ~ 5 GB)
Pros - Here one can access data faster when it comes to querying data for specific geography transactions.
Cons - Inserting/querying data can further be improved by splitting data within each partition. See Bucketing option below.
Hive table with Partition and Bucketing
Note: Create hive table ..... with "CLUSTERED BY(Partiton_Column) into 5 buckets
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-2 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-3 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-4 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-5 (file size ~ 2 GB)
/hive-table-path/country=Canada/part-00000-1 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-3 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-4 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-5 (file size ~ 4 GB)
....
/hive-table-path/country=UK/part-00000-1 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-2 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-3 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-4 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-5 (file size ~ 1 GB)
Pros - Faster Insert. Faster Query.
Cons - Bucketing will creating more files. There could be issue with many small files in some specific cases
Hope this will help !!