Hadoop Map/Reduce with database - hadoop

I am very new to hadoop , learned about its map/reduce functionality a bit , understands it wordcount demo , but not get the actual use of hadoop map/reduce in relate to database specific computations. That is not getting correct way that how map/reduce help me in some computations or database specific processings. Can anyone provide me a link or some guide which will help me in getting what is the best use and which senerio I can implement to better understand Hadoop map/reduce part.

Hadoop provides with a couple of Input and Outputs formats. The base InputFormat and the OutputFormat classes can be extended for customized Input/Output formats.
DBInputFormat/DBOutputFormat come with Hadoop. Here is the documentation from Cloudera on using the MapReduce with Database.

Related

What does Spark's API newHadoopRDD really do?

I know internally it uses MapReduce to get inputs from Hadoop, but who can explain this with more details?
Thanks.
What you are thinking that is right.HadoopRDD RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
* sources in HBase, or S3).
it uses HadoopPartition.
When an HadoopRDD is computed you can see the logs Input split:
example: INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784
properties are set upon partition execution:
task id of this task’s attempt mapred.tip.id
task attempt’s id mapred.task.id
mapred.task.is.map true
mapred.task.partition split id
mapred.job.id
This HadoopRDD cant do nothing when checkpoint() called.
you can see the comment section in HadoopRDD.scala each and every properties are pretty explanatory.
New Hadoop RDD provide core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).
It also provide various other methods for finding out the configurations details about the partitions, inputsplits etc.
You can visit the documentation for more detailed overview
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/NewHadoopRDD.html
Hope this will solve your query

Difference between hive,pig,map-reduce use cases

Difference between map-reduce ,hive ,pig
pig : its a data flow language, it can work on any data basically used to convert semi structure ,unstructured data to structure so that can be used in hive advance analytics using windowing function etc.
Hive : Work on structure data and provide sql type query language .
I know at back end both pig and hive uses map -reduces .
I know map-reduce can be good tool for programmer ,hive or pig for sql guy
I just want to know is there any specific use cases where we go for hive,pig and map-reduce
basically we decide that we have to use pig here hive here or we must use map -reduce .
Map-Reduce: Has better performance than pig or hive but requires more development time.
PIg: Less development time but poor performance when compared to map-reduce.
Hve: SQL type language with some good features like partitioning and bucketing to improve performance reads.Also, hive enforces schema on read.
Pig is used to format your unstructured/semi structure data format.Lets say you have a timestamp in your data which is not as per Hive timestamp format.You can convert same using pigUDF and format your data.This is just a example to explain.You can do many more things using Pig.
Hive is basically used for structured data .This maynot work well with unstructured data.This takes more time to execute as it converts into Mapreduce job.I suggest you to use impala which is much faster than hive.
Pig is a data flow language. This means that you can not use if statements or loops.
If you need to do a lot of repetition, it would be preferable to learn mapreduce.
You are able to get around this by embedding pig into a python script but this would take even longer since it would have to load all the jar files with every iteration of the loop.
Basically it boils down to how much time you spend prototyping vs. how much production work you have.
If you are a data scientist or an analyst, most of your work is new projects that require a lot of prototyping. This means that you care about getting results fast. Then you would prefer Pig or Hive.
If you are in a development team, you want to build robust code based on agreed upon methodology that does not need to be tested and then you would prefer mapreduce.
There are companies like Cloudera that provide a package of Pig, Hive, and other Hadoop tools so you wouldn't have to choose between the two.
Map Reduce is a inner component of hadoop, other Pig and hive are hadoop eco systems it means run on the top of hadoop. The purpose of both mapreduce, pig and hive purpose is process the vast amount of data in different manner.
Mapreduce: apache implemented it. highly recommendable to process entire data, it's time consume and required program skills like java (highly recommendable), pyghon, ruby and other programming languages. total data aggregate and sort by using mapper and reducer functions. Hadoop use it by default.
Hive: Facebook implemented it. most of the analysts especially bigdata analysts use this tool to analyze the data especially structure data. Backend this hive tool use mapreduce to be processed. Internally Hive use special language called HQL, It's subset of SQL language. Who is wellever in SQL, they can goes with Hive. It's highly recommended to the Datawarehouse oriented projects. Much difficult to process un structured especially schema-less data.
Pig:
Pig is a scripting language, implemented by Yahoo. The main difference between pig and Hive is pig can process any type of data, either structured or unstructured data. It means it's highly recommendable for streaming data like satellite generated data, live events, schema-less data etc. Pig first load the data later programmer write a program depends on data to make it structured. Who is expert in programming languages they will choose this Hadoop ecosystems.

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Hector's batch Mutation vs. using Hadoop jobs to load data into Cassandra?

Can someone highlight the pros and cons for Hector's batch Mutation and using Hadoop jobs to load data into Cassandra?
I know in Hector you can do something like the following:
mutator.addInsertion(...);
mutator.execute();
And in Hadoop you can use MR jobs to load data into Cassandra.
I'm looking for the reasons to use or not to use each of them. Thanks!
If the datasource is not currently in hadoop (or hbase) I would recommend just a multi-threaded loader using Mutator as above to keep down the number of moving parts.
This gist is dated, but the approach would be similar:
https://gist.github.com/397574
Let me know if you want more details.

Resources