Hbase mapreduce interaction - hadoop

I have an program hbase and mapreduce.
I store data in HDFS, size of this file is : 100G. Now i put this data to Hbase.
I use mapreduce to scan this file lost 5 minutes. But to scan hbase table lost 30 minutes.
How to increase the speed when using hbase and mapreduce ?
Thanks.

I am assuming you are having a Single Node HDFS. If you had your 100Gb file in a Multi Node cluster of HDFS, it would have been much faster for both Map Reduce and Hive.
You could try increasing no of mappers and reducers on Map Reduce to gain some performance increase, have a look at this post.
Hive is essentially a Data Warehousing tool built on top of HDFS and every query is underneath is a Map Reduce task itself. So above post would answer this problem also.

Related

Are mappers reducers involved while uploading/Inserting data to HDFS?

I have a big confusion here.When we upload/insert/put the data in HADOOP HDFS it is known that data is stored in chunks based on the block size and replication factor . Moreover Map reduce only works when processing the data.
I'm using MRV2 when i insert any data in one of my table i can see there is MAP REDUCE progress bar.So what is the exact picture here.in reality are there an mappers and reducers involved while insertion/uploading the data to HDFS?
Need for MapReduce depends on the type of Write operation.
Operations like hdfs dfs -put or -copyFromLocal do not use MapReduce when writing data from LocalFS to HDFS. Whereas DistCp, to perform inter/intra cluster HDFS data copying, uses Mappers. Similarly, Sqoop uses mappers to import data into HDFS. Hive's LOAD statements do not while INSERT's do.
And they are Mapper only MapReduce jobs.
I'm using MRV2 when i insert any data in one of my table
I assume, you are inserting data into a Hive table. INSERT statements in Hive use Mappers.
are there an mappers and reducers involved while insertion/uploading
the data to HDFS?
Not always. Based on the write operation, mappers are involved.
The HDFS client writes directly to the datanodes after consulting with the namenode for block locations. No mappers or reducers are required.
Ref: Architecture of HDFS Read and Write
Because there's a progress bar, doesn't mean it's a MapReduce process.
If every file written to HDFS was a MapReduce process, then YARN ResourceManager UI would log it all, so if you don't believe me, check there
MapReduce is not used when you copy data from local or put data in HDFS.

If you store something in HBase, can it be accessed directly from HDFS?

I was told HBase is a DB that sits on top of HDFS.
But lets say you are using hadoop after you put some information into HBase.
Can you still access the information with map reduce?
You can read data of HBase tables either by using map reduce programs or hive queries or pig scripts.
Here is the example for map reduce
Here is the example for Hive. Once you create hive table, you can run select queries on top of HBase tables which will process data using map reduce.
You can easily integrate HBase tables even with other Hadoop eco system tools such as Pig.
Yes, HBase is a column oriented database that sits on top of hdfs.
HBase is a database that stores it's data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn't mean that HBase can't work on any other filesystem. It's just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
Flexible data model to work with and data is indexed by the row key.
Fast scans across tables.
Scale in terms of writes as well as total volume of data.

Spark Fundamentals

I am new to Spark... some basic things i am not clear when going through fundamentals:
Query 1. For distributing processing - Can Spark work without HDFS - Hadoop file system on a cluster (like by creating it's own distributed file system) or does it requires some base distributed file system in place as a per-requisite like HDFS, GPFS, etc.
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level (for distributed processing) or will just use the block distribution as per the Haddop HDFS cluster.
Query 3. Other than defining of a DAG does SPARK also creates the partitions like MapReduce does and shuffles partitions to the reducer nodes for further computation?
I am confused on same, as till DAG creation it's clear that Spark Executor working on each Worker node loads data blocks as RDD in memory and computation is applied as per DAG .... but where does the part goes required for partitioning the data as per Keys and taking them to other nodes where reducer task will be performed (just like mapreduce) how that is done in-memory??
This would be better asked as separate questions and question 3 is hard to understand. Anyway:
No, Spark does not require a distributed file system.
By default Spark will create one partition per HDFS block, and will co-locate computation with the data if possible.
You're asking about shuffle. Shuffle creates blocks on the mappers that the reducers will fetch from them. The spark.shuffle.memoryFraction parameter controls how much memory to allocate to shuffle block files. (20% by default.) The spark.shuffle.spill parameter controls whether to spill shuffle blocks to local disk when the memory runs out.
Query 1. For distributing processing - Can Spark work without HDFS ?
For distributed processing, Spark does not require HDFS. But it may read/write data from/to HDFS system. For some use case, it may write data to HDFS. For teragen sorting world record program, it used HDFS for sorting the data instead of using in-memoery.
Spark don't provide distributed storage. But integration with HDFS is one option for storage. But Spark can use other storage systems like Cassnadra etc. Have a look at this article for more details : https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level
I agree with Daniel Darabos response. Spark will create one partition per HDFS block.
Query 3: on shuffle
Depending on size of the data, shuffle will be done in-memory Or it may use disk (e.g. teragen sorting) or it may use both. Have a look at this excellent article on Spark shuffle.
Fine with this. What if you don’t have enough memory to store the whole “map” output? You might need to spill intermediate data to the disk. Parameter spark.shuffle.spill is responsible for enabling/disabling spilling, and by default spilling is enabled
The amount of memory that can be used for storing “map” outputs before spilling them to disk is “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, with default values it is “JVM Heap Size” * 0.2 * 0.8 = “JVM Heap Size” * 0.16.
Query 1.
Yes it can work with others as well . Spark works with RDDs , if you have corresponding RDD implemented thats it .When you actually create a RDD by opening a file in HDFS , it inherently creates a HADOOP RDD which has implementation for understanding the HDFS , if you write your own Distributed file system you can write your own implementation for the same and instantiate the class its done . But writing the connector RDD to our own DFS is the challenge . For more you can look at the RDD interface in spark code
Query 2. It wont re create , instead my means of the HADOOP/HDFS RDD connector it knows where the blocks are .It will also try to use the same yarn nodes to run the jvm task to do processing .
Query 3. Not sure about this
Query 1 :- For simple spark provide distribute processing because of abstraction RDD (resilent distribute dataset), and without HDFS this cann't provide distribute Storage.
Query 2:- No it won't recreate.Here Spark will provide every block as partition(which means reference to that block) so it launch yarn on same block
Query 3:- no idea.

Spark vs MapReduce , why is Spark faster than MR ,the principle?

As I know ,Spark preload the data from every nodes' disk(HDFS) into every nodes' RDD to compute. But as I guess, MapReduce must also load the data from HDFS to memory and then compute it in memory. So.. why is Spark more faseter?
Just because MapReduce load the data to memory at every time when MapReduce want to do the compute but Spark preload the data? Thank you very much.
There is a concept of an Resilient Distributed Dataset (RDD), which Spark uses, it allows to transparently store data on memory and persist it to disc when needed.
On other hand in Map reduce after Map and reduce tasks data will be shuffled and sorted (synchronisation barrier) and written to disk.
In Spark, there is no synchronisation barrier that slows map-reduce down. And the usage of memory makes the execution engine really fast.
Hadoop Map Reduce
Hadoop Map Reduce is Batch Processing
2.In HDFS high latency. Here is a full explanation about Hadoop MapReduce and Spark
http://commandstech.com/basic-difference-between-spark-and-map-reduce-with-examples/
Spark:
Coming to Spark is Streaming processing
Low latency because of RDDs.

Hive over HBase vs Hive over HDFS

My data does not need to be loaded in realtime so I don't have to use HBASE, but I was wondering if there are any performance benefits of using HBASE in MR Jobs, shouldn't the joins be faster due to the indexed data?
Anybody have any benchmarks?
Generally speaking, hive/hdfs will be significantly faster than HBase. HBase sits on top of HDFS so it adds another layer. HBase would be faster if you are looking up individual records but you wouldn't use an MR job for that.
Performance of HBase vs. Hive:
Based on the results of HBase, Hive, and Hive on Hbase: it appears that the performance between either approach is comparable.
Hive on HBase Performance
Respectfully :) I want to tell you that if your data is not real and you are also thinking for mapreduce jobs then only go hive over hdfs as Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.
As hive also provide us
Bulk processing/ real time(if possible) as well as SQL like interface Built in optimized map-reduce Partitioning of large data which is more compatible with hdfs and help to reduce the layer of HBase otherwise if you add HBase here then it would be redundant features for you :)

Resources