What does Spark's API newHadoopRDD really do?

What does Spark's API newHadoopRDD really do? - hadoop

I know internally it uses MapReduce to get inputs from Hadoop, but who can explain this with more details?
Thanks.

What you are thinking that is right.HadoopRDD RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
* sources in HBase, or S3).
it uses HadoopPartition.
When an HadoopRDD is computed you can see the logs Input split:
example: INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784
properties are set upon partition execution:
task id of this task’s attempt mapred.tip.id
task attempt’s id mapred.task.id
mapred.task.is.map true
mapred.task.partition split id
mapred.job.id
This HadoopRDD cant do nothing when checkpoint() called.
you can see the comment section in HadoopRDD.scala each and every properties are pretty explanatory.

New Hadoop RDD provide core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).
It also provide various other methods for finding out the configurations details about the partitions, inputsplits etc.
You can visit the documentation for more detailed overview
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/NewHadoopRDD.html
Hope this will solve your query

Related

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.

You can use HCatalog or Impala for faster querying.

From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Please clarify my understanding of Hadoop/HBase

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.
So with that, here's what I think I know.
As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this
-----------------------------------------
| Mapreduce |
-----------------------------------------
| Hadoop | <-- hbase export--| HBase |
| | --apache pig --> | |
-----------------------------------------
| HDFS |
----------------------------------------
In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.
Now, assuming the above is correct, here is what else I think I know.
Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.
Ok, so if im still accurate to this point, this would be a valid use case.
You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.
Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.
Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?
**********Edit one,
i have updated my beliefs above based on the answers received.

You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.

Since HBase and Hadoop are different database engines, one can not access the data in the other directly. For HBase to get something out of Hadoop, it must go thru Mapreduce and vice versa.
This is not true since Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

Understanding more about Hadoop/HDFS Data Loading

im researching Hadoop and MapReduce (I'm a beginner!) and have a simple question regarding HDFS. I'm a little confused about how HDFS and MapReduce work together.
Lets say I have logs from System A, Tweets, and a stack of documents from System B. When this is loaded into Hadoop/HDFS, is this all thrown into one big HDFS bucket, or would there be 3 areas (for want of a better word)? If so, what is the correct terminology?
The questions stems from understanding how to execute a MapReduce job. If I only wanted to concentrate on the Logs for example, can this be done, or are all jobs executed on the entire content stored on the cluster?
Thanks for your guidance!
TM

HDFS is a file system. As in your local filesystem you can organize all your logs and documents into multiple files and directories. When you run MapReduce jobs you usually specify a directory with your input files. Thus it is possible to execute a job only on the logs from system A or the documents from system B.
However the input for your mappers is specified by the InputFormat. Most implementations originate from FileInputFormat which reads files. However it is possible to implement custom InputFormats in order to read data from other sources. You can find an explanation on input and output formats in this Hadoop Tutorial.

Load data into Hbase table using HBASE MAP REDUCE API

I am very new for Hbase and Map Reduce API.I am very confused with Map Reduce concepts. I need to Load text file into Hbase table using MAPReduce API. I googled some Examples but in that I can find MAPPER () not reducer method. I am confused with when to use mapper and when to use Reducer (). I am thinking in the way like :
TO write data to a Hbase we use mapper
TO read data from
HBASE we use mapper and reducer(). please can any one clear me with
detail explanation.
I am trying to load data from text file into
HBASE table. I googled and tried some code but i dont know, how to
load the text file and read in HBASE mapreduce API.
I really thank full for certain help

With regard to your questions:
The Mapper receives splits of data and returns a pair key, set<values>
The Reducer receives the output of from the Mapper and generates a pair <key, value>
Generally, will be your Reducer task which will write results (to the filesystem or to HBase), but the Mapper can do that too. There are MapReduce jobs which don't require a Reducer. With regard to reading from HBase, it's the Mapper class that has the configuration from which table to read from. But there's nothing related a Mapper is a reader and Reducer a writer. This article "HBase MapReduce Examples" provides good examples about how to read from and write into HBase using MapReduce.
In any case, if what you need is to bulk import some .csv files into HBase, you don't really need to do it with a MapReduce job. You can do it directly with the HBase API. In pseudocode:
table = hbase.createTable(tablename, fields);
foreach (File file: dir) {
content = readfile(file);
hbase.insert(table, content);
}
I wrote an importer of .mbox files into HBase. Take a look at the code, it may give you some ideas.
Once your data is imported into HBase, then you do need to code a MapReduce job to operate with that data.

Using HFileOutputFormat with CompleteBulkLoad is best and fastest way to load data in HBase.
You will find sample code here

Here are a couple responses of mine that address loading data into HBASE.
What is the fastest way to bulk load data into HBase programmatically?
Writing to HBase in MapReduce using MultipleOutputs
EDIT: Adding additional link based on comment
This link might help make the file available for processing.
Import external libraries in an Hadoop MapReduce script

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?

No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.

Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.

Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio