Localizing HFile blocks in HDFS - hadoop

We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).
Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.
I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.

HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.
When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.
If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.

Related

Can you use HDFS as your principal storage?

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

Does Hadoop HBase support self healing data blocks?

HDFS supports a mechanism which is called 'self-healing'. As far as I understood this means, that when a file (or better a data block) is written into HDFS, the block is replicated over a cluster of data-nodes. HDFS verifies the consistency of the data blocks over all nodes and automatically detects inconsistent data to be replicated again into a new data block. This is a feature which I am looking for.
Now - Hbase is based on HDFS. As far as I understood Hbase is optimized for random access to 'smaler' datasets (with only a few MB). Hbase is also supporting primar keys and query language. This is what I am also looking for.
My Question is: does Hbase still support the 'self-healing' feature of HDFS or is this lost because of the different approach of a relational database analogy?

Hadoop HDFS dependency

In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.

Please clarify my understanding of Hadoop/HBase

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.
So with that, here's what I think I know.
As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this
-----------------------------------------
| Mapreduce |
-----------------------------------------
| Hadoop | <-- hbase export--| HBase |
| | --apache pig --> | |
-----------------------------------------
| HDFS |
----------------------------------------
In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.
Now, assuming the above is correct, here is what else I think I know.
Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.
Ok, so if im still accurate to this point, this would be a valid use case.
You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.
Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.
Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?
**********Edit one,
i have updated my beliefs above based on the answers received.
You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.
Since HBase and Hadoop are different database engines, one can not access the data in the other directly. For HBase to get something out of Hadoop, it must go thru Mapreduce and vice versa.
This is not true since Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

Why is there no locking issues when doing context.write in hadoop reducer

I'm confused as to how conext.write works in hadoop reducer.
Why is there no locking issues in hadoop reducers(if there is more than 1 reducer) if all are writing to the same file in HDFS?
Normally, if we would write to the same file ourselves in a hadoop mapper/reducer, we would get locking errors that we can't write to the same file concurrently.
If your map reduce program runs on a Multi node cluster then there will be unique Map and Reduce programs running on each node.
Reduce in Map Reduce doesn't directly write to the file itself. It delegates this task to OutputFormat which is responsible for sinking of Data. It could be to a File, Database Table or any other location. FileOutputFormat will sink to a location in Hadoop Distributed File System (HDFS). DBOutputFormat will sink to a Database table (read this post).
For your question of file locks please have a look at this post at Yahoo Developer Network.

Resources