Pig local vs mapreduce mode performance comparision - hadoop

I have setup a 3 node Hadoop cluster with Cloudera manager CDH4. When ran a Pig job in mapreduce mode it took double the time than that of the local mode for same data set. Is that an expected behavior?
Also is there any documentation available for performance tuning options for mapreduce jobs?
Thanks much for any help!

This is probably because you are using a toy dataset and the overhead of mapreduce is larger than the benefit of parallelization

A good start for performance tuning is the "Making Pig Fly" chapter from the "Programming Pig" book.

Another reason is when you run in -x local mode, Pig does not do the same jar compilations as it does for map reduce mode. With small data sets and complex pig script the actual jar compilation time becomes noticeable.

Related

Performance of Pig in local mode vs mapreduce mode

I have a Hadoop cluster with 3 nodes and 12 GB of data / 1.5 mid records . I understood that Pig can be run in local mode (for development purpose) and in mapreduce mode.
For a little research project I am comparing processing times of running Pig in local and mapreduce mode.
When doing performance measurements the processing time in local mode is much faster than in mapreduce mode. (My code consists of loading the data file using JsonLoader with a schema , filtering and dumping the result.)
Is there a rule of thumb when map reduce mode is faster than local mode ?
Thank you !
It's not clear how you've tuned the YARN cluster to accomodate your workload, or how large the files you're reading actually are.
In general, 12 GB is not enough data to warrant the use of Hadoop/Mapreduce, assuming Pig can do multi-processing on its own.
However, if the files are split amongst datanodes, and you have allocated enough resources to each of those 3 machines, then the job should complete faster than just one machine.
You could even further enhance runtimes by using Pig on Tez or Spark engines.

Pig on a single machine

Imagine that i have a file with 100 MM of records, and I want to use pig to wrangle it.
I don't have a cluster, but I still want to use PIG for productivity reasons. Could I use PIG in a single machine or it will have a poor performance?
Does Pig will simulate a MR job in a a single machine, or will use a self backend engine to execute the process?
Surely single machine with 100MM records processing by Hadoop won't give you performance.
For Development/Testing purpose you can use single machine with small/moderate amount of data, but not in production.
Hadoop Linearly scales it's performace as you add more number of nodes to the cluster.
Single machine also can act as a cluster.
PIG can run in 2 modes, local and mapreduce.
In local mode no hadoop daemons and hdfs.
In mapreduce, your pig script will be converted to MR Jobs and then gets executed.
Hope it helps!

Time taken by MapReduce jobs

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?
For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.
it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.

Performance comparison : Hive & MapReduce

Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Thanks
Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do
I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.
If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.

Does Apache Pig have any limitations on the input data size?

When working with TeraBytes of data, and for a typical data filtering problem, is Apache PIG the right choice? Or is it better to have a custom MapReduce code doing the job.
Apache PIG does not serve as a storage layer. PIG is a scripting language that simplifies creation of the code that can run on Hadoop. PIG script is compiled into a set of Hadoop MapReduce jobs that are submitted to the Hadoop and which run in the same way as any other MapReduce Job.
Hadoop does the data storage and not PIG.
To answer your question: No, there are no limitations on the size of the input data. As long as the input data can be parsed by PIG load functions and it is splittable by the Hadoop InputFormats.
PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multiquery execution, which can make your complex queries execute quicker.

Resources