Realizing different distribution models in hdfs? - hadoop

As far as i have got to understand from the hadoop tuitorial, it takes the overall size of the input files and then divides them into the blocks/chunks then these block are replicated on different nodes.However i want to realize data distribution model according to the below given requirement -
(a) Case one : Each file is partitioned into the nodes in the cluster equally
-- so that each map gets this partition of table to be accessed. is it possible ?
(b) Case two : Each file is fully replicated in two or more nodes but not all nodes.
so that each map access some part of table on each node. is it possible ?

HDFS does not store tables, it stores files. Higher level projects offer 'relational tables', like Hive. Hive does allow you to partition a table stored on HDFS, see Hive Tutorial.
That being said, you should not tie partitioning to number of nodes in the cluster. Nodes come and go, clusters grow and shrink. Partitioned relational tables partition/bucket by natural boundaries w/o relation to cluster size. Import, export, daily operations all play a role in partitioning (and usually a much bigger role then cluster size). Even a single table (file) can well spread on each node of the cluster.
If you want to tune a MR job for optimal split size/location, there are plenty of ways to do that. You still have a lot to read, you are optimizing too early.

Related

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Is there a relation between the number of partitions/Buckets in a hive table and the number of map tasks it launches for any operation on this data?

I know that the number of map tasks is the same as the number of input splits given by the input format. When performing an operation on a partitioned or bucketed hive table how does the InputFormat class calculate input Splits as the data is in the form of files in a directory for partitioned or bucketed data? Is there any relation between the input splits(number of map tasks) and the number of partitions or buckets?
The short answer is 'sortof'.. hive delegates the splits to hadoop, and hadoop only cares about how much data is in that partitions, not really about partitions and buckets. The amount of data depends indirectly on the number of partitions, so to answer your question more correctly, it does not directly depend on the number of partitions.
When executing a query, to make the splits hive uses by default CombineHiveInputFormat
which is actually only a wrapper around Hadoop's CombineFileInputFormat.
So actually hive will delegate to hadoop how to make the splits.
Hadoop's CombineFileInputFormat will group smaller files together so the splits will be according to what's configured as minimum split size.
Note that it belongs to hadoop, not hive, therefore has no knowledge of buckets, but it will just group files based on their size and locality (rack, etc), since it's better of a split is actually all on the same node or at least on the same rack.
You can have a look at how the splits are created in the function getSplits here.

Hbase table duplication

There is a way to duplicate table data on every node of a cluster?
I need to do a performance test with the maximum grade of locality of the data.
By default, HBase distributes data on a small fraction of the cluster nodes (on 1 or 2 nodes), maybe because my data isn't very big-data ( ~ 2 GB ).
I know that Hbase is designed for much larger data sets, but in this case, it is a requirement for me.
There are a lot of nice reads* about it (see the end of the post) but I'll try to explain it with my own words ;)
HBase is not responsible of data replication, the Hadoop HDFS is, and by default is configured with a replication factor of 3, that means all data will be stored in at least 3 nodes.
Data locality is a key aspect to get good performance, but achieving maximum data locality is easy: you only need to colocate your HBase Regionservers (RS) along to the Hadoop Datanodes (DN), so, all your DN should have also the RS role. Once you have that, HBase will automatically move the data where it's needed (on major compactions) to achieve data locality and that's all, as long as each RS has the data of the regions it serves locally you'll have data locality.
Even when you have the data replicated to multiple DN, each region (and the rows they contain) will be served by just one RS, it doesn't matter you have a replication factor of 3, 10 or 100... Reading a row belonging to the region #1 will always hit the same RS, and that will be the one that hosts the region (which will read the data locally from the HDFS because of data locality). If the RS hosting that region goes down, the region will be assigned to another RS automatically (because the data is also replicated to other DN)
What you can do is to split your table in a way each RS has even buckets of rows (regions) assigned to it, so as much different RS as possible work simultaneously when you read or write data, increasing your overall throughput as long as you don't always hit the same regions (called regionserver hotspotting**).
Therefore, you should always start by ensuring that all the regions of your table are assigned to different RS and they receive the same volume of R/W requests. Once you've done that you can split your table into more regions once until you have an even number of regions on all the RS of your cluster (you may need to assign them manually if you're not happy with the load balancer).
Just remind that even when you seem to have a perfect distribution of regions you can still have poor performance if your data access pattern is not right (or it's uneven) and doesn't reach all regions evenly, in the end it all depends on your application.
(*) Recommended reads:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
(**) To avoid RS hotspotting we always design our tables to have non-monotonically increasing row keys, so rows 1, 2, 3 ... N are hosted different regions, the common approach is to use the MD5(id) + id as rowkey. This approach has it's own set of drawbacks: you cannot scan the first 10 rows because they're salted.

Is a collocated join (a-la-netezza) theoretically possible in hive?

When you join tables which are distributed on the same key and used these key columns in the join condition, then each SPU (machine) in netezza works 100% independent of the other (see nz-interview).
In hive, there's bucketed map join, but the distribution of the files representing the tables to datanode is the responsibility of HDFS, it's not done according to hive CLUSTERED BY key!
so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that key - can hive get a guarantee from HDFS that matching buckets will sit on the same node? or will it always have to move the matching bucket of the small table to the datanode containing the big table bucket?
Thanks, ido
(note: this is a better phrasing of my previous question: How does hive/hadoop assures that each mapper works on data that is local for it?)
I think it is not possible to tell to HDFS where to store blocks of data.
I can consider the following trick, which will do for small clusters - to increase replication factor for one of the tables to the number close or equal to the number of nodes in the cluster.
As a result - during join process appropriate data will be almost always (or always) present on the required node.

Resources