Why is hadoop using hbase even though hdfs is available for storage?
We can also store table data as blocks in hdfs.
Is the data stored in hbase? If so, then role will hdfs serve?
HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing of data but doesn’t provide fast individual record lookups.
HBase is built on top of HDFS ,actually data gets store on HDFS and is designed to provide access to single rows of data in large tables.
Overall, the differences between HDFS and HBase are
HDFS –
Is suited for High Latency operations batch processing
Data is primarily accessed through MapReduce
Is designed for batch processing and hence doesn’t have a concept of random reads/writes
HBase –
Is built for Low Latency operations
Provides access to single rows from billions of records
Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift
Hadoop can use HDFS as well as HBase. You need to see difference between filesystem (HDFS) and database (HBase) which offers many features compared to plain filesystem (e. g. random access to the data).
You will need the HDFS running in both cases, bacause HBase is built on top HDFS filesystem.
Related
I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....
I want to understand design principle behind using RDBMS for Hive Metadata and not filesystem
From my perspective RDBMS is providing -
Concurrency control
ACID properties
Sub-second latency etc.
Filesystem could have provided -
Replication of data
Concurrency could have been achieved using Zookeeper
Any other thing which impacted this decision during design of Hive?
you can find out the reason why hive uses RDBMS in the paper : "Hive: a warehousing solution over a map-reduce framework".
It describes as following
"The storage system for the metastore should be optimized
for online transactions with random accesses and updates.
A file system like HDFS is not suited since it is optimized
for sequential scans and not for random access. So, the
metastore uses either a traditional relational database (like
MySQL, Oracle) or file system (like local, NFS, AFS) and
not HDFS. As a result, HiveQL statements which only access
metadata objects are executed with very low latency. However,
Hive has to explicitly maintain consistency between
metadata and data."
As per my knowledge, they choose this approach of storing meta information of hive tables in RDBMS, as opposed to storing this information in hdfs as they need the Meta store(schema, partition, other information) to be very low latency.
Reasons to use RDBMS for storing metadata:
CRUD operations not possible,
Editing files or data present in HDFS is not allowed,
Metadata Stores metadata using RDBMS to provide low query latency,
HDFS read/write operations are time-consuming processes.
I was told HBase is a DB that sits on top of HDFS.
But lets say you are using hadoop after you put some information into HBase.
Can you still access the information with map reduce?
You can read data of HBase tables either by using map reduce programs or hive queries or pig scripts.
Here is the example for map reduce
Here is the example for Hive. Once you create hive table, you can run select queries on top of HBase tables which will process data using map reduce.
You can easily integrate HBase tables even with other Hadoop eco system tools such as Pig.
Yes, HBase is a column oriented database that sits on top of hdfs.
HBase is a database that stores it's data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn't mean that HBase can't work on any other filesystem. It's just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
Flexible data model to work with and data is indexed by the row key.
Fast scans across tables.
Scale in terms of writes as well as total volume of data.
Though I have understood the architecture of hadoop a bit , I have some void in understanding of where the data is exactly situated.
My question is like " Suppose I have large data of some random books .. is the data of books stored in multiple Nodes previously using HDFS and we perform MapReduce on each node and get the result in our system ?
'OR'
Do we store data some where in large database and whenever we want to perform the MapReduce operation, we take the chunks and store them in multiple Nodes for performing operation ?
Either is possible, it really depends on your use case and needs. However, generally Hadoop MapReduce runs against data stored in HDFS. The system is designed around data locality which requires the data be in HDFS. That is the Map tasks run on the same piece of hardware where the data is stored in order to improve performance.
That said if for some reason your data must be stored outside of HDFS and then processed using MapReduce it can be done but is a bit more work and is not as efficient as processing data in HDFS locally.
So lets take two use cases. Start with log files. Log files as they are are not particularly accessible. They just need to be stuck somewhere and stored for later analysis. HDFS is perfect for this. If you really need a log back out you can get it but generally people will be looking for the output of the analytics. So store your logs in HDFS and process them normally.
However, data in the format ideal for HDFS and Hadoop Map Reduce (many records in a single large flat file) is not what I would consider highly accessible. Hadoop Map Reduce expects to have input files that are multi megabytes in size with many records per file. The more you diverge from this case, the more your performance will decline. Sometimes your data is needed online at all times, and HDFS is not ideal for this. For instance we will use your book example. If these books are used in an application that needs the content accessible in an online fashion, I.E. editting and annotating, you may choose to store them in a database. Then when you need to run batch analytics you use a custom InputFormat to retrieve the records from the database and process them in MapReduce.
I am currently doing this with a web crawler that stores the web pages individually in Amazon S3. Web pages are too small to serve as a single efficient input to MapReduce, so I have a custom InputFormat that feeds each mapper several files. The output of this MapReduce job is eventually written back to S3, and because I am using Amazon EMR, the Hadoop cluster goes away.
I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles).
Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica?
In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into separate files on HDFS but HDFS does not guarantee the co-location of related files nor blocks by default (rather tries to achieve the opposite).
Found some information about Impala's impact on HDFS development but not clear if these are already implemented or still in plans:
http://www.slideshare.net/deview/aaron-myers-hdfs-impala
(slides 23-24)
Thank you in advance for all.
About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.
It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.
No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.