Is it possible to get data both hadoop cluster? - hadoop

I have two hadoop cluster and i am moving one big table to another cluster.
But i do not have enough space to move the whole table and open the table to the users. So while i m moving the table to the another cluster i need to show my table data from both hadoop cluster.
I know i can use presto for this but the table is too big and i have problem with presto cluster server ram capacity.
So is there any way to do this?

Related

HDFS vs HIVE partitioning

This may be a simple thing but i'm struggling to find the answer. When the data is loaded to HDFS its distributed and loaded into multiple nodes. The data is partitioned and distributed.
For HIVE there is a separate option to PARTITION the data. I'm pretty sure that even if you don't mention the PARTITION option, the data will be split and distributed to different nodes on the cluster, when loading a hive table. What additional benefit does this command give in this case.
summarizing comments and for Hadoop v1-v2.x:
a logical partitioning, eg. related to a date or field in a string, as written in the comments above, is only possible in hive, hcat or a another sql or parallel engine working on top of hadoop, using a fileformat which supports partitioning (Parquet, ORC, CSV are ok, but eg. XML is hard or nearly impossible to partition)
logical partitioning (like in hive, hcat) can be used as a replacement for not having an indexes
'partitioning of hdfs storage' on local or distributed nodes is possible by defining the partitions during setup of hdfs, see https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cluster-planning/content/ch_partitioning_chapter.html
HDFS is able to "balance" or 'distribute' blocks over nodes
Natively, blocks can't be split and distributed to folders by HDFS according to their content, only moved at whole to another node
blocks (not files!) are replicated in the HDFS cluster according to the HDFS replication factor:
$ hdfs fsck /
(thanks David and Kris for your discussion above, also explains most of it and please take this post as summary)
HDFS partition : Mainly deals with the storage of files on the node. For fault tolerance, files are replicated across the cluster( Using replication factor)
Hive partition : It's an optimization technique in Hive.
Inside Hive DB, while storing tables and for better performance on the queries we go for partitioning.
Partitioning gives information about how data is stored in hive and how to read the data.
Hive Partitioning can be controlled on the column level of the table data.

If you store something in HBase, can it be accessed directly from HDFS?

I was told HBase is a DB that sits on top of HDFS.
But lets say you are using hadoop after you put some information into HBase.
Can you still access the information with map reduce?
You can read data of HBase tables either by using map reduce programs or hive queries or pig scripts.
Here is the example for map reduce
Here is the example for Hive. Once you create hive table, you can run select queries on top of HBase tables which will process data using map reduce.
You can easily integrate HBase tables even with other Hadoop eco system tools such as Pig.
Yes, HBase is a column oriented database that sits on top of hdfs.
HBase is a database that stores it's data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn't mean that HBase can't work on any other filesystem. It's just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
Flexible data model to work with and data is indexed by the row key.
Fast scans across tables.
Scale in terms of writes as well as total volume of data.

Realizing different distribution models in hdfs?

As far as i have got to understand from the hadoop tuitorial, it takes the overall size of the input files and then divides them into the blocks/chunks then these block are replicated on different nodes.However i want to realize data distribution model according to the below given requirement -
(a) Case one : Each file is partitioned into the nodes in the cluster equally
-- so that each map gets this partition of table to be accessed. is it possible ?
(b) Case two : Each file is fully replicated in two or more nodes but not all nodes.
so that each map access some part of table on each node. is it possible ?
HDFS does not store tables, it stores files. Higher level projects offer 'relational tables', like Hive. Hive does allow you to partition a table stored on HDFS, see Hive Tutorial.
That being said, you should not tie partitioning to number of nodes in the cluster. Nodes come and go, clusters grow and shrink. Partitioned relational tables partition/bucket by natural boundaries w/o relation to cluster size. Import, export, daily operations all play a role in partitioning (and usually a much bigger role then cluster size). Even a single table (file) can well spread on each node of the cluster.
If you want to tune a MR job for optimal split size/location, there are plenty of ways to do that. You still have a lot to read, you are optimizing too early.

hbase cluster need multiple data node

According to hbase official tutorial, when configure the hbase distributed cluster, in hbase-sit.xml file , need configure the property hbase.rootdir point to hdfs cluster address. and, all the hbase data will save on the hdfs. In this case , will the hbase cluster need multiple data node ?
Hbase has nothing to do with the number of DNs you have. But, greater the number of DNs better performance and availability you get, as replication actually takes place at HDFS level. So, if you have just 1 node all your Hbase data will go there and if that server is down you are in the middle of nowhere. There are several advantages of having multiple DNs though, like lesser and balanced load on the machines vs all the load on a single machine, higher parallelism, high availability etc etc.

How does HBase distribute new regions from MapReduce across the cluster?

My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.
Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!
I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.
The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?
Thanks for you input!
I believe that this is a known issue. Currently HBase distributes regions across the cluster as a whole without regard for which table they belong to.
Consult the HBase book for background:
http://hbase.apache.org/book/regions.arch.html
It could be that you are on an older version of hbase:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155
See the following for a discussion of load balancing and region moving
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549
By default, it just balance regions on each RS without take table into account.
You can set hbase.master.loadbalance.bytable to get it.

Resources