Performance tuning based on number of part files created in spark - hadoop

The following are phases of my job :
Phase 1 - Do some computation and persist the temp data into file.There will be multiple temp dataframes persisted and read in the flow.
Phase 2 - Read the temp data and do some other computation and store it to the final data file.
NOTE: I am persisting multiple temp files as I cannot hold them in memory since the data is huge.(84 million rows , 2 million distinct primary key kindoff value )
I use coleasce(n) or repartition(n) , where n is a large number eg: 200. Now this leads to 200 files created in the output for each of the temp data that I'm persisting. I know coleasce/repartition is a costly job for write performance. But i do get better parallelism when i use n=200 than when n=50. This is all with respect to write.
Now , this temp data is going to be read by next processes , So will n=200 be better or n=50 ?
Also, I am aware that the parent partition number (n) will be the base for the next write operation and so on.
Qs:
When to use coleasce(no shuffle) and when to use repartition (shuffle) ?
The partition value to be used and why?
What strategy should I follow for getting a better performance?

1) Use coalesce when the size of the output files is unlikely to be skewed (1 files 2GB, rest 0GB). Repartitioning is most useful when you want to balance the work among executors so each partition is similarly sized.
2) set your output partitions based on assigned value to write and read time. For example, have a large number of partitions (smaller files) for writing and reading both once (re intermediate output), but set the partitions lower (larger files) when writing once reading many times (WORM for using parquet as analytics). The more partitions, the more concurrent tasks that can be done at once.
3) try the different approaches if you can and measure the writing and reading times; determine the tradeoffs that best suit your use case.
It's a lot like compression algorithms where some can compress fast (eg LZO), others store with minimal footprint (eg BZip2) and others decompress fast (eg Snappy).

Related

Spark not ignoring empty partitions

I am trying to read a subset of a dataset by using pushdown predicate.
My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.
Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).
I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:
Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?
Thank you in advance
Using S3 Select, you can retrieve only a subset of data.
With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.
There is actually very similar question, where by testing you can see that:
The input size was always the same as the Spark job that processed all of the data
You can also see this question about optimizing data read from s3 of parquet files.
Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here.
However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.
20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata.
Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..

joins vs distributed cache in hadoop

what is the difference between joins and distributed cache in hadoop. I am really confusing with map-side join and reduce-side join an dhow it works. how distributed cache is different while processing the data in mapreduce job. Please share with example.
Regards,
Ravi
Let's say you have 2 files of data with the following records:
word -> frequency
Same words can be present in both files.
Your task is to merge these files, compute total frequency for each term, and produce the aggregated file.
Map side joins.
Useful when your data on both sides of the join already presorted by keys. In this case, it is a simple merge of two streams with linear complexity. In our example, our word-frequency data have to be pre-sorted alphabetically by words in both files.
Pros: works with virtually unlimited input data (does not have to fit in memory).
Does not require a reducer, thus it is very efficient.
Cons: requires your input data to be pre-sorted (for example, as a result of a previous map/reduce job)
Reduce joins.
Useful when our files are not sorted yet, and they are too large to fit in memory. So you have to merge them using distributed sort with reducer(s).
Pros: works with virtually unlimited input data (does not have to fit in memory).
Cons: requires reduce phase
Distributed cache.
Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
Cons: Requires one of the inputs to fit in memory.

Hadoop - Reducers spending a lot of time writing data (multiple outputs)

So I am using MultipleOutputs from the package org.apache.hadoop.mapreduce.lib.output.
I have a reducer that is doing a join of 2 data sources and emitting 3 different outputs.
55 reduce tasks were invoked and on an average each of them took about 6 minutes to emit data. There were outliers that took about 11 minutes.
So I observed that if I comment the pieces where actual output is happening, i.e. the call to mos.write() (multiple output) then the average time reduces to seconds and the whole job completes in about 2 minutes.
I do have a lot of data to emit (Approximately 40-50 GBs) of data.
What can I do to speed up things a bit, with and without considering compression.
Details: I am using TextOutputFormat and giving a hdfs path/uri.
Further clarifications:
I have small input data to my reducer, however the reducers are doing a reduce side join and hence emitting large amount of data. Since an outlier reducer is approximately taking about 11 minutes, reducing the number of reducers will increase this time and hence increase the overall time of my job and won't solve my purpose.
Input to the reducer comes from 2 mappers.
Mapper 1 -> Emits about 10,000 records. (Key Id)
Mapper 2 -> Emits about 15M records. (Key Id, Key Id2, Key Id3)
In reducer I get everything belonging to Key Id, sorted by Key Id, KeyId2 and KeyId3.
So I know I have an iterator which is like:
Mapper1 output and then Mapper2 output.
Here I store Mapper1 output in an ArrayList and start streaming Mapper2's output.
For every Mapper2 record I do, a mos.write(....)
I conditionally store a part of this record in memory (in a HashSet)
For every time KeyId2 changes, I do an extra mos.write(...)
At the close method of my reducer, I emit if I stored anything in the conditional step. So third mos.write(...)
I have gone through the article http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
as mentioned:
Tip1 : Configuring my cluster correctly is beyond my control.
Tip2 : Use LZO compression - or compression in general. Something I am trying alongside.
Tip3 : Tune the number of mappers and reducers - My mappers finish really fast (in order of seconds) Probably because they are almost identity mappers. Reducers take some time, as mentioned above (This is the time I'm trying to reduce) So increasing the number of reducers will probably help me - but then there will be resource contention and some reducers will have to wait. This is more of an experimental try and error sort of stuff for me.
Tip4 : Write a combiner. Does not apply to my case (of reduce side joins)
Tip5 : Use apt writable - I need to use Text as of now. All these 3 outputs are going into directories that have a hive schema sitting on top of it. Later when I figure how to emit ParquetFormat files from multiple outputs, I might change this and the tables storage method.
Tip6 : Reuse writables. Okay this is something I have not considered so far, but I still believe that its the disk writes that are taking time and not processing or java heap. But anyway I'll give it a shot again.
Tip7 : Use poor man's profiling. Kind of already done that and figured out that its actually mos.write steps that are taking most of the time.
DO
1. Reduce the number of reducers.
Optimized no of reducer count(for a general ETL operation) is to have around 1GB data for 1 reducer.
Here your input data(in GBs) itself is less than the no of reducers.
2. Code optimization can be done. Share the code else refer http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ to optimize it.
3. If this does not help then understand your data. The data might be skewed. In case if you don't know what is skewed, skewed join in pig will help.

How to decide on the number of partitions required for input data size and cluster resources?

My use case as mentioned below.
Read input data from local file system using sparkContext.textFile(input path).
partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.
My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated.
I am using spark 1.0 on yarn.
Thanks,
AG
Two notes from Tuning Spark in the Spark official documentation:
1- In general, we recommend 2-3 tasks per CPU core in your cluster.
2- Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
These are two rule of tumb that help you to estimate the number and size of partitions. So, It's better to have small tasks (that could be completed in hundred ms).
Determining the number of partitions is a bit tricky. Spark by default will try and infer a sensible number of partitions. Note: if you are using the textFile method with compressed text then Spark will disable splitting and then you will need to re-partition (it sounds like this might be whats happening?). With non-compressed data when you are loading with sc.textFile you can also specify a minium number of partitions (e.g. sc.textFile(path, minPartitions) ).
The coalesce function is only used to reduce the number of partitions, so you should consider using the repartition() function.
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))

Writing RCFile - how many reducers?

I have a MapReduce implementation for processing certain logfiles directly into GZip Compressed RCFile, for easy loading into Hive (via external table projections).
In any event, I have code that successfully and correctly runs, emitting data as BytesRefArrayWritable into RCFileOutputFormat.
Currently, I am running this as a Map-only job, meaning that for N input splits, I get N output files. For example, for 50 input splits, I will get 50 files of .rc extension. Hive can interpret these files together without issue, but my question is as follows:
Is it optimal to have 50 (or N, as it were) RCFile in a single directory, or is it optimal to have a single RCFile containing all the data? I know that RCFile is a columnar format, so IO is optimized for queries such as filtering on a particular column's value.
In the example I mentioned above with 50 input splits, in the first case, MapReduce will need to open 50 files and seek to the location of a column in question. It will also be able to parallelize this operation, given that these 50 files will be spread across HDFS. In the second case (all data in one RCFile), I would imagine MapReduce would sequentially stream the column values in the single RCFile and not have to stitch together 50 different results...
Is there a good way to reason about this? Is it a function of HDFS blocksize and the aggregate size of the Hive table?
Please let me know if I can clarify anything -- thanks in advance
Is it a function of HDFS blocksize
Primarily yes. Adjust the number of reducers as to not create partitions smaller than a block. I would consider this as the main driving factor.
Other than that, a smaller number of files is healthier for the name node. You also get some administrative goodness from not having x50 times more partitions than you really need on a Hive table (think operations like removal of obsolete partitions).
And I must reiterate the point of trying to move to the arguably superior ORC format.

Resources