Can't achieve 100% data locality with HBase - hadoop

I have 10 nodes with hadoop/hbase.
I create table with 100 regions and predefined split keys: 001, 002 ... 099
I assign regions 001-009 to first node, regions 010-019 to second node and so on, with hbase move command. It worked, I see correct distribution in hbase ui.
I add 1M rows to table, each row has key prefix from 000 to 099 based on key hash, I can assure that I have almost equal number of rows per region. Let's say 10k rows per region, 1M rows total.
Data is populated from other table with map reduce using HTable api.
If I check regions size using hdfs dfs -du -h /hbase/data/default/my_table I see they are almost the same ( +/- 5% )
The problem is, most of regions have 100% data locality, but ~10 regions have low locality (20-60%). They are not on the same node, e.g. 3 bad regions on first node, 4 on second, 3 on third.
Why does this happen? How can I achieve 100% data locality?
p.s. major compaction does not fix it

Related

Distribution of Key value pairs in Jboss Data Grid

I am loading 20 million non expiry entries in the Jboss Data Grid using Hotrod clients. My Hot rod clients are running on 5 different machines to load the data. The entries got added successfully. We have given a replication factor of 2. So there will be total 40 million entries in the grid. We found a variation of more than 10 % in the no of entries being added in each node. For eg, One node has 7.8 million entries while other node has 12 million entries.
So I was thinking why the entries are not equally distributed, ideally each node should have about 10 million entries. Our objective of the above test was to check whether the load/requests are getting equally distributed on all the nodes.
Any pointers on how the key/value pairs are distributed in JDG would be appreciated.
In Infinispan the hash space is divided into segments which then get mapped to the nodes in the cluster.
Entries are hashed by their keys by applying the MurmurHash3 function to them. This determines the segment which owns the key. It could be possible that your keys are causing a somewhat uneven distribution. You could try increasing the number of segments in your configuration. With your cluster, use at least 100 segments.
Also I had to lookup the meaning of "crore" and "lakh", as I had no idea what they were. You should probably use the 10M and 100K notations instead to make it easier to understand.

HAWQ distribute randomly and disk space for each segment is almost full. add new datanode

I am reading the specification of HAWQ. There is a question: I create a table (such as 'table_random') with distribute randomly in 3 data-nodes cluster, and disk space for each segment is almost insufficient. Then I add a new data-node to the cluster, then when I insert data to table 'table_random',
- Does HAWQ will distribute data to the old data-nodes ? What will happen actually?
- Does HQWQ will redistribute the data of table 'table_random' overall the cluster?
thanks
You can create tables in HAWQ that are specified with a hash distribution key or randomly. With HAWQ 2.0, you should use random distribution but first, let's talk about how hash distribution works in HAWQ.
create table foo (id int, bar text) distributed by (id);
HAWQ has a concept of buckets for hash distributed tables. Basically, there is a file in hdfs that corresponds to each bucket. With a partitioned table, there is a file per partition and per bucket but let's just focus on my foo table above.
When you init your database, the GUC default_hash_table_bucket_number gets set. It is calculated based on the number of nodes * 6. (Clusters with 85 - 102 nodes is 5 * number of nodes and so on) So a 10 node cluster will have a default_hash_table_bucket_number=60. Therefore, there will be 60 files in HDFS for my foo table.
When you execute a query against foo, there will be 60 virtual segments (one for each file) for that one table.
When you expand your cluster, the number of buckets for my table is fixed. 60 buckets will still work but it will be spread over all of the nodes.
After an expansion and using hash distribution, you should adjust default_hash_table_bucket_number based on the number of nodes in the cluster and then recreate hash distributed tables so that it will have the correct number of buckets.
You can also specify the number of buckets for at table like this:
create table foo (id int, bar text) with (bucketnum=10) distributed by (id);
Now I'm forcing the database to have 10 buckets for my table rather than using the value from default_hash_table_bucket_number.
But randomly distributed tables are recommended over hash. Why? Because of elasticity.
create table foo_random (id int, bar text) distributed randomly;
Now this table will only create a single file in hdfs. The number of vsegs is determined at runtime based on the query optimizer. For a small table, the optimizer may only execute a single virtual segment while a very large table may use 6 virtual segments per host.
When you expand your cluster, you will not need to redistribute the data. The database will automatically increase the total number of virtual segments if needed too.
hawq_rm_nvseg_perquery_perseg_limit is the GUC that determines how many possible virtual segments will be created per query per segment. By default, this is set to 6 but you can increase or decrease it. hawq_rm_nvseg_perquery_limit is another GUC that is important here. It defaults to 512 and controls the total number of virtual segments that can execute for a query cluster wide.
So in summary, in HAWQ with Random Distribution:
Recommended storage technique
Adding nodes doesn't require redistribution of the data
Removing nodes doesn't require redistribution of the data
hawq_rm_nvseg_perquery_perseg_limit can be increased from 6 to higher values to increase parallism.
hawq_rm_nvseg_perquery_limit may need to be increased from 512 to a higher value. It specifies the total number of virtual segments across the entire cluster per query.

How to split data vertically instead of horizontally?

I want to cluster and split (using Hadoop) a dataset with some 60K features (dimensions a.k.a. columns). This dataset has very few instances -- about 100 rows. Instead of splitting data horizontally, I want to split according to feature clusters. For instance, if I get 3 clusters, I want each cluster to have 20K columns and 100 rows, to run on 3 different nodes.
How to achieve this kind of split? Failing that, can you provide any suggestions for a framework other than Hadoop to facilitate this split?
First of all: with this tiny data set (60k*100 that is a few megabytes), MapReduce is a very bad choice. You get massive overhead, at zero benefit. Don't use Hadoop if your data fits into main memory! Even Excel will be faster.
Apart from that, you can obviously convert from row storage to column storage by mapping switching your row and your column identifiers:
def map(key, row):
for column, value in row:
send( column, (key, value) )

Hbase table duplication

There is a way to duplicate table data on every node of a cluster?
I need to do a performance test with the maximum grade of locality of the data.
By default, HBase distributes data on a small fraction of the cluster nodes (on 1 or 2 nodes), maybe because my data isn't very big-data ( ~ 2 GB ).
I know that Hbase is designed for much larger data sets, but in this case, it is a requirement for me.
There are a lot of nice reads* about it (see the end of the post) but I'll try to explain it with my own words ;)
HBase is not responsible of data replication, the Hadoop HDFS is, and by default is configured with a replication factor of 3, that means all data will be stored in at least 3 nodes.
Data locality is a key aspect to get good performance, but achieving maximum data locality is easy: you only need to colocate your HBase Regionservers (RS) along to the Hadoop Datanodes (DN), so, all your DN should have also the RS role. Once you have that, HBase will automatically move the data where it's needed (on major compactions) to achieve data locality and that's all, as long as each RS has the data of the regions it serves locally you'll have data locality.
Even when you have the data replicated to multiple DN, each region (and the rows they contain) will be served by just one RS, it doesn't matter you have a replication factor of 3, 10 or 100... Reading a row belonging to the region #1 will always hit the same RS, and that will be the one that hosts the region (which will read the data locally from the HDFS because of data locality). If the RS hosting that region goes down, the region will be assigned to another RS automatically (because the data is also replicated to other DN)
What you can do is to split your table in a way each RS has even buckets of rows (regions) assigned to it, so as much different RS as possible work simultaneously when you read or write data, increasing your overall throughput as long as you don't always hit the same regions (called regionserver hotspotting**).
Therefore, you should always start by ensuring that all the regions of your table are assigned to different RS and they receive the same volume of R/W requests. Once you've done that you can split your table into more regions once until you have an even number of regions on all the RS of your cluster (you may need to assign them manually if you're not happy with the load balancer).
Just remind that even when you seem to have a perfect distribution of regions you can still have poor performance if your data access pattern is not right (or it's uneven) and doesn't reach all regions evenly, in the end it all depends on your application.
(*) Recommended reads:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
(**) To avoid RS hotspotting we always design our tables to have non-monotonically increasing row keys, so rows 1, 2, 3 ... N are hosted different regions, the common approach is to use the MD5(id) + id as rowkey. This approach has it's own set of drawbacks: you cannot scan the first 10 rows because they're salted.

What is the difference between partitioning and bucketing a table in Hive ?

I know both is performed on a column in the table but how is each operation different.
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.
If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.
There are a few details missing from the previous explanations.
To better understand how partitioning and bucketing works, you should look at how data is stored in hive.
Let's say you have a table
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system.
The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you do specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.
But what happens if you use let's say 256 buckets and the field you're bucketing on has a low cardinality (for instance, it's a US state, so can be only 50 different values) ? You'll have 50 buckets with data, and 206 buckets with no data.
Someone already mentioned how partitions can dramatically cut the amount of data you're querying. So in my example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.
I think that somebody also mentioned how bucketing can speed up joins with other tables that have exactly the same bucketing, so in my example, if you're joining two tables on the same employee_id, hive can do the join bucket by bucket (even better if they're already sorted by employee_id since it's going to mergesort parts that are already sorted, which works in linear time aka O(n) ).
So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.
Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.
Before going into Bucketing, we need to understand what Partitioning is. Let us take the below table as an example. Note that I have given only 12 records in the below example for beginner level understanding. In real-time scenarios you might have millions of records.
PARTITIONING
---------------------
Partitioning is used to obtain performance while querying the data. For example, in the above table, if we write the below sql, it need to scan all the records in the table which reduces the performance and increases the overhead.
select * from sales_table where product_id='P1'
To avoid full table scan and to read only the records related to product_id='P1' we can partition (split hive table's files) into multiple files based on the product_id column. By this the hive table's file will be split into two files one with product_id='P1' and other with product_id='P2'. Now when we execute the above query, it will scan only the product_id='P1' file.
../hive/warehouse/sales_table/product_id=P1
../hive/warehouse/sales_table/product_id=P2
The syntax for creating the partition is given below. Note that we should not use the product_id column definition along with the non-partitioned columns in the below syntax. This should be only in the partitioned by clause.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10))
Cons : We should be very careful while partitioning. That is, it should not be used for the columns where number of repeating values are very less (especially primary key columns) as it increases the number of partitioned files and increases the overhead for the Name node.
BUCKETING
------------------
Bucketing is used to overcome the cons that I mentioned in the partitioning section. This should be used when there are very few repeating values in a column (example - primary key column). This is similar to the concept of index on primary key column in the RDBMS. In our table, we can take Sales_Id column for bucketing. It will be useful when we need to query the sales_id column.
Below is the syntax for bucketing.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10)) Clustered by(Sales_Id) into 3 buckets
Here we will further split the data into few more files on top of partitions.
Since we have specified 3 buckets, it is split into 3 files each for each product_id. It internally uses modulo operator to determine in which bucket each sales_id should be stored. For example, for the product_id='P1', the sales_id=1 will be stored in 000001_0 file (ie, 1%3=1), sales_id=2 will be stored in 000002_0 file (ie, 2%3=2),sales_id=3 will be stored in 000000_0 file (ie, 3%3=0) etc.
I think I am late in answering this question, but it keep coming up in my feed.
Navneet has provided excellent answer. Adding to it visually.
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Suppose, you have a table with five columns, name, server_date, some_col3, some_col4 and some_col5. Suppose, you have partitioned the table on server_date and bucketed on name column in 10 buckets, your file structure will look something like below.
server_date=xyz
00000_0
00001_0
00002_0
........
00010_0
Here server_date=xyz is the partition and 000 files are the buckets in each partition. Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.
Hive Partitioning:
Partition divides large amount of data into multiple slices based on value of a table column(s).
Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.
Pros:
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the population from "Vatican city" returns very fast instead of searching entire population of world.
Cons:
Possibility of too many small partition creations - too many directories.
Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.
Hive Bucketing:
Bucketing decomposes data into more manageable or equal parts.
With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined during table creation scripts.
Pros
Due to equal volumes of data in each partition, joins at Map side will be quicker.
Faster query response like partitioning
Cons
You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.
The difference is bucketing divides the files by Column Name, and partitioning divides the files under By a particular value inside table
Hopefully I defined it correctly
There are great responses here. I would like to keep it short to memorize the difference between partition & buckets.
You generally partition on a less unique column. And bucketing on most unique column.
Example if you consider World population with country, person name and their bio-metric id as an example. As you can guess, country field would be the less unique column and bio-metric id would be the most unique column. So ideally you would need to partition the table by country and bucket it by bio-metric id.
Using Partitions in Hive table is highly recommended for below reason -
Insert into Hive table should be faster ( as it uses multiple threads
to write data to partitions )
Query from Hive table should be efficient with low latency.
Example :-
Assume that Input File (100 GB) is loaded into temp-hive-table and it contains bank data from across different geographies.
Hive table without Partition
Insert into Hive table Select * from temp-hive-table
/hive-table-path/part-00000-1 (part size ~ hdfs block size)
/hive-table-path/part-00000-2
....
/hive-table-path/part-00000-n
Problem with this approach is - It will scan whole data for any query you run on this table. Response time will be high compare to other approaches where partitioning and Bucketing are used.
Hive table with Partition
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 10 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 20 GB)
....
/hive-table-path/country=UK/part-00000-n (file size ~ 5 GB)
Pros - Here one can access data faster when it comes to querying data for specific geography transactions.
Cons - Inserting/querying data can further be improved by splitting data within each partition. See Bucketing option below.
Hive table with Partition and Bucketing
Note: Create hive table ..... with "CLUSTERED BY(Partiton_Column) into 5 buckets
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-2 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-3 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-4 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-5 (file size ~ 2 GB)
/hive-table-path/country=Canada/part-00000-1 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-3 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-4 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-5 (file size ~ 4 GB)
....
/hive-table-path/country=UK/part-00000-1 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-2 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-3 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-4 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-5 (file size ~ 1 GB)
Pros - Faster Insert. Faster Query.
Cons - Bucketing will creating more files. There could be issue with many small files in some specific cases
Hope this will help !!

Resources