Pig: Slow Group By operator - hadoop

After benchmarking Hive and Pig, I found that the Group By operator in Pig is drastically slower that Hive's. I was wondering whether anybody has experienced the same? And whether people may have any tips for improving the performance of this operation? (Adding a DISTINCT as suggested by an earlier post on here doesn't help. I am currently re-running the benchmark with LZO compression enabled).

It seems that you are looking in the wrong way. Group By just groups the data in some way, it is very important what you do afterwards. When trying to analyze performance in Pig, you should keep these things in mind:
1) Several statements can be merged into a single MR job, so don't look at the statements, look at the performance of the generated MR jobs.
2) There should be a reason for a drastic difference in performance. This may be:
2.1 Different input format, other circumstances when benchmarking Pig vs Hive.
2.2 Combiner being disabled for some reason:
http://pig.apache.org/docs/r0.9.1/perf.html#When+the+Combiner+is+Used
This happens to be the bottleneck for me in most cases.
And according to my experience there is no drastic difference in Pig/Hive performance.

Related

How to join big dataframes in Spark SQL? (best practices, stability, performance)

I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.
Therefore, my question is:
What are best practices to join huge dataframes in Spark SQL >= 1.6.0?
More specific questions are:
How to tune number of executors and spark.sql.shuffle.partitions to achieve better stability/performance?
How to find the right balance between level of parallelism (num of executors/cores) and number of partitions? I've found that increasing the num of executors is not always the solution as it may generate I/O reading time out exceptions because of network traffic.
Is there any other relevant parameter to be tuned for this purpose?
My understanding is that joining data stored as ORC or Parquet offers better performance than text or Avro for join operations. Is there a significant difference between Parquet and ORC?
Is there an advantage of SQLContext vs HiveContext regarding stability/performance for join operations?
Is there a difference regarding performance/stability when the dataframes involved in the join are previously registerTempTable() or saveAsTable()?
So far I'm using this is answer and this chapter as a starting point. And there are a few more stackoverflow pages related to this subject. Yet I haven't found a comprehensive answer to this popular issue.
Thanks in advance.
That are a lot of questions. Allow me to answer these one by one:
Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions.
A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.
The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.
You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.
When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.
Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.
Cheers, Fokko

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?
Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. But it requires more time to code and introduce changes.
Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
Get top N elements for each group;
Calculate total per each group and than put that total against each row in the group;
Use Bloom filters for JOIN optimisations;
Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)
Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. But its tables can be read from Pig or Standard MapReduce.
One more thing, Hive and Pig are not well suited to work with hierarchical data.
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.
So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.
HTH
Mapreduce:
Strengths:
works both on structured and unstructured data.
good for writing complex business logic.
Weakness:
long development type
hard to achieve join functionality
Hive :
Strengths:
less development time.
suitable for adhoc analysis.
easy for joins
Weakness :
not easy for complex business logic.
deals only structured data.
Pig
Strengths :
Structured and unstructured data.
joins are easily written.
Weakness:
new language to learn.
converted into mapreduce.
Hive
Pros:
Sql like
Data-base guys love that.
Good support for structured data.
Currently support database schema and views like structure
Support concurrent multi users, multi session scenarios.
Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already
Cons:
Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it.
Hierarchical data is a challenge.
Un-structured data requires udf like component
Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc
Pig:
Pros:
Great script based data flow language.
Cons:
Un-structured data requires udf like component
Not a big community support
MapReudce:
Pros:
Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code.
Most of the times MR yields better performance.
MR support for hierarchical data is great especially implement tree like structures.
Better control at partitioning / indexing the data.
Job chaining.
Cons:
Need to know api very well to get a better performance etc
Code / debug / maintain
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG
When you need definite driver program control
Whenever the job requires implementing a custom Partitioner
If there already exists pre-defined library of Java Mappers or Reducers for a job
If you require good amount of testability when combining lots of large data sets
If the application demands legacy code requirements that command physical structure
If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining
If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins
Pros of Pig/Hive :
Hadoop MapReduce requires more development effort than Pig and Hive.
Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program.
When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch.
There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive.
Have a look at this post for Pig Vs Hive comparison.
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.
Here is the great comparison.
It specifies all the use case scenarios.

What approximate amount of semistructured data is enough for setting up Hadoop cluster?

I know, Hadoop is not only alternative for semistructured data processing in general — I can do many things with plain tab-separated data and a bunch of unix tools (cut, grep, sed, ...) and hand-written python scripts. But sometimes I get really big amounts of data and processing time goes up to 20-30 minutes. It's unacceptable to me, because I want experiment with dataset dynamically, running some semi-ad-hoc queries and etc.
So, what amount of data do you consider enough to setting Hadoop cluster in terms of cost-results of this approach?
Without know exactly what you're doing, here are my suggestions:
If you want to run ad-hoc queries on the data, Hadoop is not the best way to go. Have you tried loading your data into a database and running queries on that?
If you want to experiment with using Hadoop without the cost of setting up a cluster, try using Amazon's Elastic MapReduce offering http://aws.amazon.com/elasticmapreduce/
I've personally seen people get pretty far using shell scripting for these kinds of tasks. Have you tried distributing your work over machines using SSH? GNU Parallel makes this pretty easy: http://www.gnu.org/software/parallel/
I think this issue has several aspects. The first one - what you can achieve with usual SQL technologies like MySQL/Oracle etc. If you can get solution with them - I think it will be better solution.
Should be also pointed out that hadoop processing of tabular data will be much slower then conventional DBMS. So I am getting to the second aspect - are you ready to build hadoop cluster with more then 4 machines? I think 4-6 machines is a bare minimum to feel some gains.
Third aspect is - are your ready to wait for data loading to the database - it can take time, but then queries will be fast. So if you makes a few queries for each dataset - it is in hadoop advantage.
Returning to the original question - I think that you need at least 100-200 GB of data so Hadoop processing will have some sense. 2 TB I think is a clear indication that hadoop might be a good choice.

MongoDB: What's the point of using MapReduce without parallelism?

Quoting http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism
As of right now, MapReduce jobs on a
single mongod process are single
threaded. This is due to a design
limitation in current JavaScript
engines. We are looking into
alternatives to solve this issue, but
for now if you want to parallelize
your MapReduce jobs, you will need to
either use sharding or do the
aggregation client-side in your code.
Without parallelism, what are the benefits of MapReduce compared to simpler or more traditional methods for queries and data aggregation?
To avoid confusion: the question is NOT "what are the benefits of document-oriented DB over traditional relational DB"
The main reason to use MapReduce over simpler or more traditional queries is that it simply can do things (i.e., aggregation) that simple queries cannot.
Once you need aggregation, there are two options using MongoDB: MapReduce and the group command. The group command is analogous to SQL's "group by" and is limited in that it has to return all its results in a single database response. That means group can only be used when you have less than 4MB of results. MapReduce, on the other hand, can do anything a "group by" can, but outputs results to a new collection so results can be as large as needed.
Also, parallelism is coming, so it's good to have some practice :)
M/R is already parallel in MongoDB if you're running a sharded cluster. This is the main point of M/R anyway - to put the computation on the same node as the data.
super fast map/reduce is on the roadmap
it will not be in the 1.6 release (summer release)
so late this year likely

Resources