HDFS vs HIVE partitioning - hadoop

This may be a simple thing but i'm struggling to find the answer. When the data is loaded to HDFS its distributed and loaded into multiple nodes. The data is partitioned and distributed.
For HIVE there is a separate option to PARTITION the data. I'm pretty sure that even if you don't mention the PARTITION option, the data will be split and distributed to different nodes on the cluster, when loading a hive table. What additional benefit does this command give in this case.

summarizing comments and for Hadoop v1-v2.x:
a logical partitioning, eg. related to a date or field in a string, as written in the comments above, is only possible in hive, hcat or a another sql or parallel engine working on top of hadoop, using a fileformat which supports partitioning (Parquet, ORC, CSV are ok, but eg. XML is hard or nearly impossible to partition)
logical partitioning (like in hive, hcat) can be used as a replacement for not having an indexes
'partitioning of hdfs storage' on local or distributed nodes is possible by defining the partitions during setup of hdfs, see https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cluster-planning/content/ch_partitioning_chapter.html
HDFS is able to "balance" or 'distribute' blocks over nodes
Natively, blocks can't be split and distributed to folders by HDFS according to their content, only moved at whole to another node
blocks (not files!) are replicated in the HDFS cluster according to the HDFS replication factor:
$ hdfs fsck /
(thanks David and Kris for your discussion above, also explains most of it and please take this post as summary)

HDFS partition : Mainly deals with the storage of files on the node. For fault tolerance, files are replicated across the cluster( Using replication factor)
Hive partition : It's an optimization technique in Hive.
Inside Hive DB, while storing tables and for better performance on the queries we go for partitioning.
Partitioning gives information about how data is stored in hive and how to read the data.
Hive Partitioning can be controlled on the column level of the table data.

Related

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

Spark with HBASE vs Spark with HDFS

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....

If you store something in HBase, can it be accessed directly from HDFS?

I was told HBase is a DB that sits on top of HDFS.
But lets say you are using hadoop after you put some information into HBase.
Can you still access the information with map reduce?
You can read data of HBase tables either by using map reduce programs or hive queries or pig scripts.
Here is the example for map reduce
Here is the example for Hive. Once you create hive table, you can run select queries on top of HBase tables which will process data using map reduce.
You can easily integrate HBase tables even with other Hadoop eco system tools such as Pig.
Yes, HBase is a column oriented database that sits on top of hdfs.
HBase is a database that stores it's data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn't mean that HBase can't work on any other filesystem. It's just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
Flexible data model to work with and data is indexed by the row key.
Fast scans across tables.
Scale in terms of writes as well as total volume of data.

Why hbase even though hdfs is present

Why is hadoop using hbase even though hdfs is available for storage?
We can also store table data as blocks in hdfs.
Is the data stored in hbase? If so, then role will hdfs serve?
HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing of data but doesn’t provide fast individual record lookups.
HBase is built on top of HDFS ,actually data gets store on HDFS and is designed to provide access to single rows of data in large tables.
Overall, the differences between HDFS and HBase are
HDFS –
Is suited for High Latency operations batch processing
Data is primarily accessed through MapReduce
Is designed for batch processing and hence doesn’t have a concept of random reads/writes
HBase –
Is built for Low Latency operations
Provides access to single rows from billions of records
Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift
Hadoop can use HDFS as well as HBase. You need to see difference between filesystem (HDFS) and database (HBase) which offers many features compared to plain filesystem (e. g. random access to the data).
You will need the HDFS running in both cases, bacause HBase is built on top HDFS filesystem.

Control data locality in Impala by partitioning

I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles).
Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica?
In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into separate files on HDFS but HDFS does not guarantee the co-location of related files nor blocks by default (rather tries to achieve the opposite).
Found some information about Impala's impact on HDFS development but not clear if these are already implemented or still in plans:
http://www.slideshare.net/deview/aaron-myers-hdfs-impala
(slides 23-24)
Thank you in advance for all.
About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.
It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.
No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.

Resources