Time taken by MapReduce jobs - hadoop

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?

For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.

it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.

Related

Mapreduce Vs Spark Vs Storm Vs Drill - For Small files

I know spark does the in memory computation and is much faster then MapReduce.
I was wondering how well does spark work for say records < 10000 ?
I have huge number of files around ( each file having around 10000 records , say 100 column file) coming into my hadoop data platform and i need to perform some data quality checks before i load then into hbase.
I do the data quality check in hive which uses MapReduce at the back-end. For each file it takes about 8 mins and thats pretty bad for me.
Will spark give me a better performance lets say 2-3 mins ?
I know I got to do a bench marking but i was trying to understand the basics here before i really get going with spark.
As I recollect creating RDD's for the first time will be an overhead and since i got to create a new RDD for each incoming file that going to cost me a bit.
I am confused which would be the best approach for me - spark , drill, storm or Mapreduce itself ?
I am just exploring the performance of Drill vs Spark vs Hive over around millions of records. Dill & Spark both are around 5-10 times faster in my case (I did not perform any performance test over cluster with significant RAM, I just tested on single node) The reason for fast computation - both of them perform the in-memory computation.
The performance of drill & spark is almost comparable in my case. So, I can't say which one is better. You need to try this at your end.
Testing on Drill will not take much time. Download the latest drill, install on your mapr hadoop cluster, add hive-storage plugin and perform the query.

Pig on a single machine

Imagine that i have a file with 100 MM of records, and I want to use pig to wrangle it.
I don't have a cluster, but I still want to use PIG for productivity reasons. Could I use PIG in a single machine or it will have a poor performance?
Does Pig will simulate a MR job in a a single machine, or will use a self backend engine to execute the process?
Surely single machine with 100MM records processing by Hadoop won't give you performance.
For Development/Testing purpose you can use single machine with small/moderate amount of data, but not in production.
Hadoop Linearly scales it's performace as you add more number of nodes to the cluster.
Single machine also can act as a cluster.
PIG can run in 2 modes, local and mapreduce.
In local mode no hadoop daemons and hdfs.
In mapreduce, your pig script will be converted to MR Jobs and then gets executed.
Hope it helps!

What is the Hadoop ecosystem and how does Apache Spark fit in?

I'm having a lot of trouble grasping what exactly a 'Hadoop ecosystem' is conceptually. I understand that you have some data processing tasks that you want to run and so you use MapReduce to split the job up into smaller pieces but I'm unsure about what people mean when they say 'Hadoop Ecosystem'. I'm also unclear as to what the benefits of Apache Spark are and why this is seen as so revolutionary? If it's all in-memory calculation, wouldn't that just mean that you would need higher RAM machines to run Spark jobs? How is Spark different than writing some parallelized Python code or something of that nature.
Your question is rather broad - the Hadoop ecosystem is a wide range of technologies that either support Hadoop MapReduce, make it easier to apply, or otherwise interact with it to get stuff done.
Examples:
The Hadoop Distributed Filesystem (HDFS) stores data to be processed by MapReduce jobs, in a scalable redundant distributed fashion.
Apache Pig provides a language, Pig Latin, for expressing data flows that are compiled down into MapReduce jobs
Apache Hive provides an SQL-like language for querying huge datasets stored in HDFS
There are many, many others - see for example https://hadoopecosystemtable.github.io/
Spark is not all in-memory; it can perform calculations in-memory if enough RAM is available, and can spill data over to disk when required.
It is particularly suitable for iterative algorithms, because data from the previous iteration can remain in memory. It provides a very different (and much more concise) programming interface, compared to plain Hadoop. It can provide some performance advantages even when the work is mostly done on disk rather than in-memory. It supports streaming as well as batch jobs. It can be used interactively, unlike Hadoop.
Spark is relatively easy to install and play with, compared to Hadoop, so I suggest you give it a try to understand it better - for experimentation it can run off a normal filesystem and does not require HDFS to be installed. See the documentation.

MapReduce or Spark for Batch processing on Hadoop?

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.
But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.
So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?
Any thoughts?
Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.
With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.
Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.
For reference this blog post explains how databricks performed the petabyte sort.
I'm assuming when you say Hadoop you mean HDFS.
There are number of benefits of using Spark over Hadoop MR.
Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.
1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.
1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.
1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.
For example, here is what code for word count looks in Spark (Scala).
val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
I'm sure you have to write a few more lines if you are using standard Hadoop MR.
Here are some common misconceptions about Spark.
Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data
Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.
Apache Spark runs in memory, making it much faster than mapreduce.
Spark started as a research project at Berkeley.
Mapreduce use disk extensively (for external sort, shuffle,..).
As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.
So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.
Refer to spark.apache.org for more details on spark

Performance comparison : Hive & MapReduce

Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Thanks
Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do
I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.
If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.

Resources