Hive Tables in multiple nodes - Processing - hadoop

I have a conceptual doubt in Hive. I know that Hive s a data warehouse tool that runs on top of Hadoop. We know that Hadoop has a distributed file system -HDFS.
Suppose, I have one master and three slaves. Now, I have created a table employees in HiveQL. The table is so huge that it cant be stored in one machine. Hence it must be stored in all four machines. How can I load such data. Should it be done manually. Or like I type "LOAD DATA ... " in the master and it will be automatically get distributed among all the machines.

Hive uses HDFS as warehouse to store the data. So HDFS concept is used for data storage.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Please refer HDFS architecture for more detail.

Related

Why cant the metadata be stored in HDFS

Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.

understanding how hbase uses hdfs

I’m trying to understand how hbase uses the hdfs.
so here is what I understand (please correct me if I'm wrong):
I know that hbase use hdfs to store data and that data is split into regions, and that each region server my serve many regions,so I guess that one region (exclusively) may communicate with many data node to get and put data, so If that is correct then if that region server fails then data stored in those data node, will not be accessible anymore
thank you in advance :)
In general, a Regionserver runs on a datanode.
Due to how HDFS works, the Regionserver will perform its reads and writes to the local datanode when possible, and then HDFS will ensure that the data is replicated onto two other random datanodes. So at all times, the data written by that regionserver is stored on 3 nodes in HDFS.
While a regionserver is serving a region, only it will read / write the data for that region, but if the regionserver process crashes, the HBase master will select another regionsever to serve that region. The data will be unavailable for a few minutes, but HBase will recover quickly.
If the entire host fails, then as HDFS ensured the data was written onto two other nodes, the scenario is the same - the master will select a new regionserver to open the failed region and the data not be lost.

Copy files/chunks from HDFS to local file system of slave nodes

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

Localizing HFile blocks in HDFS

We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).
Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.
I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.
HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.
When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.
If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.

Hadoop HDFS dependency

In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.

Resources