Spark internals - Does repartition loads all partitions in memory? - performance

I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?
Assuming, initially there were 5 partition and they had -
1st partition - 100 elements
2nd partition - 200 elements
3rd partition - 500 elements
4th partition - 5000 elements
5th partition - 200 elements
Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.
In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.

From what I have understood, repartition will certainly trigger shuffle. From Job Logical Plan document following can be said about repartition
- for each partition, every record is assigned a key which is an increasing number.
- hash(key) leads to a uniform records distribution on all different partitions.
If Spark can't load all data into memory then memory issue will be thrown. So default processing of Spark is all done in memory i.e. there should always be sufficient memory for your data.
Persist option can be used to tell spark to spill your data in disk if there is not enough memory.
Jacek Laskowski also explains about repartitions.
Understanding your Apache Spark Application Through Visualization should be sufficient for you to test and know by yourself.

Related

hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.

Spark RDD - is partition(s) always in RAM?

We all know Spark does the computation in memory. I am just curious on followings.
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?
If I do not delete RDD, will it be in memory forever?
If my dataset(file) size exceeds available RAM size, where will data to stored?
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD
data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory.
link to read more
If my dataset size exceeds available RAM size, where will data to
stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed.
link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.
If I create 10 RDD in my Pyspark shell, does it mean all these 10 RDD
data will reside on Spark Memory?
Answer: RDD only contains the "lineage graph" (the applied transformations). So, RDD is not data!!! When ever we perform any action on an RDD, all the transformations are applied before the action. So if not explicitly (of course there are some optimisations which cache implicitly) cached, each time an action is performed the whole transformation and action are performed again!!!
E.g - If you create an RDD from HDFS, apply some transformations and perform 2 actions on the transformed RDD, HDFS read and transformations will be executed twice!!!
So, if you want to avoid the re-computation, you have to persist the RDD. For persisting you have the choice of a combination of one or more on HEAP, Off-Heap, Disk.
If I do not delete RDD, will it be in memory for ever?
Answer: Considering RDD is just "lineage graph", it will follow the same scope and lifetime rule of the hosting language. But if you have already persisted the computed result, you could unpersist!!!
If my dataset size exceed available RAM size, where will data to stored?
Answer: Assuming you have actually persisted/cached the RDD in memory, it will be stored in memory. And LRU is used to evict data. Refer for more information on how memory management is done in spark.

Performance Issue when Single row in Hbase exceeds hbase.hregion.max.filesize

In Hbase, I have configured hbase.hregion.max.filesize as 10GB. If the Single row exceeds the 10GB size, then the row will not into 2 regions as Hbase splits are done based on row key
For example, if I have a row which has 1000 columns, and each column varies between 25MB to 40 MB. So there is chance to exceed the defined region size. If this is the case, how will it affect the performance while reading data using rowkey alone or row-key with column qualifier?
First thing is Hbase is NOT for storing that much big data 10GB in a single row(its quite hypothetical).
I hope your have not saved 10GB in a single row (just thinking of saving that)
It will adversely affect the performance. You consider other ways like storing this much data in hdfs in a partitioned structure.
In general, these are the tips for generally applicable batch clients like Mapreduce Hbase jobs
Scan scan = new Scan();
scan.setCaching(500); //1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
Can have look at Performance

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have:
sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)
How does the number of partitions I decide to partition my RDD in influence the performance?
And how does this depend on the number of core my machine has?
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
To add to javadba's excellent answer, I recall the docs recommend to have your number of partitions set to 3 or 4 times the number of CPU cores in your cluster so that the work gets distributed more evenly among the available CPU cores. Meaning, if you only have 1 partition per CPU core in the cluster you will have to wait for the one longest running task to complete but if you had broken that down further the workload would be more evenly balanced with fast and slow running tasks evening out.
Number of partition have high impact on spark's code performance.
Ideally the spark partition implies how much data you want to shuffle. Normally you should set this parameter on your shuffle size(shuffle read/write) and then you can set the number of partition as 128 to 256 MB per partition to gain maximum performance.
You can set partition in your spark sql code by setting the property as:
spark.sql.shuffle.partitions
or
while using any dataframe you can set this by below:
df.repartition(numOfPartitions)

Hive index mapreduce memory errors

I am new to hive and hadoop and just created a table (orc fileformat) on Hive. I am now trying to create indexes on my hive table (bitmap index). Every time I run the index build query, hive starts a map reduce job to index. At some point my map reduce job just hangs and one of my nodes (randomly different across multiple retries so its probably not the node) fails. I tried increasing my mapreduce.child.java.opts to 2048mb but that was giving me errors with using up more memory than available so I increased, mapreduce.map.memory.mb and mapreduce.reduce.memory.mb to 8GB. All other configurations are left to the defaults.
Any help with what configurations I am missing out would be really appreciated.
Just for context, I am trying to index a table with 2.4 Billion rows, which is 450GB in size and has 3 partitions.
First, please confirm, if the indexing worked for data at small scale. Assuming it is done, the way the map reduce jobs are run by Hive, depends on many issues.
1. Type of queries(using count(*) or just Select *).
2. Also, the amount of memory a reducer is allocated during the execution phase.(This is controlled by hive.exec.reducers.bytes.per.reducer property).
In your care it can be second point.
Give the scale at which you are running your program, please calculated the memory requirements accordingly. This post has more information. Happy learning and coding

Resources