I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.
Is there any other way to go about this?
There isn't an easy way to simply delete the empty partitions from a RDD.
coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).
The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.
Related
I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?
Assuming, initially there were 5 partition and they had -
1st partition - 100 elements
2nd partition - 200 elements
3rd partition - 500 elements
4th partition - 5000 elements
5th partition - 200 elements
Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.
In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.
From what I have understood, repartition will certainly trigger shuffle. From Job Logical Plan document following can be said about repartition
- for each partition, every record is assigned a key which is an increasing number.
- hash(key) leads to a uniform records distribution on all different partitions.
If Spark can't load all data into memory then memory issue will be thrown. So default processing of Spark is all done in memory i.e. there should always be sufficient memory for your data.
Persist option can be used to tell spark to spill your data in disk if there is not enough memory.
Jacek Laskowski also explains about repartitions.
Understanding your Apache Spark Application Through Visualization should be sufficient for you to test and know by yourself.
In Hbase, I have configured hbase.hregion.max.filesize as 10GB. If the Single row exceeds the 10GB size, then the row will not into 2 regions as Hbase splits are done based on row key
For example, if I have a row which has 1000 columns, and each column varies between 25MB to 40 MB. So there is chance to exceed the defined region size. If this is the case, how will it affect the performance while reading data using rowkey alone or row-key with column qualifier?
First thing is Hbase is NOT for storing that much big data 10GB in a single row(its quite hypothetical).
I hope your have not saved 10GB in a single row (just thinking of saving that)
It will adversely affect the performance. You consider other ways like storing this much data in hdfs in a partitioned structure.
In general, these are the tips for generally applicable batch clients like Mapreduce Hbase jobs
Scan scan = new Scan();
scan.setCaching(500); //1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
Can have look at Performance
I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:
When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).
Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.
Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?
I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ).
Sparks APIs allows you to think in terms of RDD and no longer splits.
You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem.
Your whole dataset, under spark, is called RDD.
Hope the below answer would help you.
When Spark reads a file from HDFS, it creates a single partition for a single input split.
If you have a 30GB text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions.
I know that the number of map tasks is the same as the number of input splits given by the input format. When performing an operation on a partitioned or bucketed hive table how does the InputFormat class calculate input Splits as the data is in the form of files in a directory for partitioned or bucketed data? Is there any relation between the input splits(number of map tasks) and the number of partitions or buckets?
The short answer is 'sortof'.. hive delegates the splits to hadoop, and hadoop only cares about how much data is in that partitions, not really about partitions and buckets. The amount of data depends indirectly on the number of partitions, so to answer your question more correctly, it does not directly depend on the number of partitions.
When executing a query, to make the splits hive uses by default CombineHiveInputFormat
which is actually only a wrapper around Hadoop's CombineFileInputFormat.
So actually hive will delegate to hadoop how to make the splits.
Hadoop's CombineFileInputFormat will group smaller files together so the splits will be according to what's configured as minimum split size.
Note that it belongs to hadoop, not hive, therefore has no knowledge of buckets, but it will just group files based on their size and locality (rack, etc), since it's better of a split is actually all on the same node or at least on the same rack.
You can have a look at how the splits are created in the function getSplits here.
I have read that Cassandra columns are sorted physically. I felt this is correct if only single row of a key is present in a node(in single SSTable). If same key is there in multiple SSTables with different/Same columns , the node itself has to sort it out after read from each SSTables. If this is correct, how the wide row concept of Cassandra, that is used for column sort/order by purposes will become efficient.
You are right that Cassandra keeps rows sorted on disk based on Clustering Columns. This reduces the seeks on disk to satisfy a query.
You are also right that a partition can exist in multiple SSTables on disk, each SSTable will be sorted on disk but when the node reads a partition it merges the values from each sstaqble in memory + any values for that partition in the memtable.
Compaction is designed to minimise the number of SSTables exist to keep the number of disk seeks down. Disk is likely to be slower than merging sorted data.