Not able to repartition the DStream - spark-streaming

val sparkConf = new SparkConf().setMaster("yarn-cluster")
.setAppName("SparkJob")
.set("spark.executor.memory","2G")
.set("spark.dynamicAllocation.executorIdleTimeout","5")
val streamingContext = new StreamingContext(sparkConf, Minutes(1))
var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_2))
val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) =>
{
//some mapping
}
dstream_1.foreachRDD(r => r.repartition(500))
val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
dstream_2.foreachRDD(r => r.repartition(500))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)
val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)
filtered.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._2.get))
historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
historyRdd_2 = formatted // assign the new history
historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)
filteredStream.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._1.get))
historyRdd.unpersist(false) // unpersist the 'old' history RDD
historyRdd = formatted // assign the new history
historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
}
}
I am not able to repartition the DStream using the above code , I was getting 128 partitions for my input which is the no. of Kafka partitons , and because of Join I need to shuffle read and write data so I wanted to increase the parallelism by increasing no- of partitions. But the partitions remains the same.Why is it so?

Just like map or filter, repartition is a transformation in Spark, meaning 3 things:
it returns another immutable RDD
it's lazy
it needs to be materialized by some action
Considering this code:
dstream_1.foreachRDD(r => r.repartition(500))
Using repartition as a side-effect within a foreachRDD does nothing. The resulting RDD is never used and therefore the repartitioning never takes place.
We should 'chain' this transformation with the other operations in the job. In this context, a simple way to achieve this would be to use transform instead:
val repartitionedDStream = dstream_1.transform(rdd => rdd.repartition(500))
... use repartitionedDStream further on ...

Related

Checkpointing of stateful operators in spark streaming

I am developing a streaming application which uses a mapwithstate function internally...
I need to set the checkpointing interval of my checkpoinitnd data manually..
This is my sample code..
var newContextCreated = false // Flag to detect whether new context was created or not
// Function to create a new StreamingContext and set it up
def creatingFunc(): StreamingContext = {
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
// Create a stream that generates 1000 lines per second
val stream = ssc.receiverStream(new DummySource(eventsPerSecond))
// Split the lines into words, and create a paired (key-value) dstream
val wordStream = stream.flatMap { _.split(" ") }.map(word => (word, 1))
// This represents the emitted stream from the trackStateFunc. Since we emit every input record with the updated value,
// this stream will contain the same # of records as the input dstream.
val wordCountStateStream = wordStream.mapWithState(stateSpec)
wordCountStateStream.print()
// A snapshot of the state for the current batch. This dstream contains one entry per key.
val stateSnapshotStream = wordCountStateStream.stateSnapshots()
stateSnapshotStream.foreachRDD { rdd =>
rdd.toDF("word", "count").registerTempTable("batch_word_count")
}
ssc.remember(Minutes(1)) // To make sure data is not deleted by the time we query it interactively
ssc.checkpoint("dbfs:/streaming/trackstate/100")
println("Creating function called to create new StreamingContext")
newContextCreated = true
ssc
}

Merging inputs in distributed application

INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce

How can I improve the performance when completing a table with statistical methods in Apache-Spark?

I have a dataset with 10 field and 5000 rows. I want to complete this dataset with some statistical methods in Spark with Scala. I filled the empty cells in a field with the mean value of that field, if it consists of continuous values and I put most frequent value in the field, if it consists of discrete values. Here is my code:
for(col <- cols){
val datacount = table.select(col).rdd.map(r => r(0)).filter(_ == null).count()
if(datacount > 0)
{
if (continuous_lst contains col) // put mean of data to null values
{
var avg = table.select(mean(col)).first()(0).asInstanceOf[Double]
df = df.na.fill(avg, Seq(col))
}
else if(discrete_lst contains col) // put most frequent categorical value to null values
{
val group_df = df.groupBy(col).count()
val sorted = group_df.orderBy(desc("count")).take(1)
val most_frequent = sorted.map(t => t(0))
val most_frequent_ = most_frequent(0).toString.toDouble.toInt
val type__ = ctype.filter(t => t._1 == col)
val type_ = type__.map(t => t._2)
df = df.na.fill(most_frequent_, Seq(col))
}
}
}
The problem is that this code works very slowly with this data. I use spark-submit with executor memory 8G parameter. And I use repartition(4) parameter before sending the data to this function.
I should work bigger sized datasets. So how can I speed up this code?
Thanks for your help.
Here is a suggestion:
import org.apache.spark.sql.funcitons._
def most_frequent(df: DataFrame, col: Column) = {
df.select(col).map { case Row(colVal) => (colVal, 1) }
.reduceByKey(_ + _)
.reduce({case ((val1, cnt1), (val2, cnt2)) => if (cnt1 > cnt2) (val1, cnt1) else (val2, cnt2)})._1
}
val new_continuous_cols = continuous_lst.map {
col => coalesce(col, mean(col)).as(col.toString)
}.toArray
val new_discrete_cols = discrete_lst.map {
col => coalesce(col, lit(most_frequent(table, col)).as(col.toString))
}.toArray
val all_new_cols = new_continuous_cols ++ new_discrete_cols
val newDF = table.select(all_new_cols: _*)
Considerations:
I assumed that continuous_lst and discrete_lstare lists of Column. If they are lists of String the idea is the same, but some adjustments are necessary;
Note that I used map and reduce to calculate the most frequent value of a column. That can be better than grouping by and aggregating in some cases. (Maybe there is room for improvement here, by calculating the most frequent values for all discrete columns at once?);
Additionally, I used coalesce to replace null values, instead of fill. This may result in some improvement as well. (More info about the coalesce function in the scaladoc API);
I cannot test at the moment, so there may be something missing that I didn't see.

duplicated rows in RDD

I encountered the following problem in spark:
...
while(...){
key = filtersIterator.next()
pricesOverLeadTimesKeyFiltered = pricesOverLeadTimesFilteredMap_cached
.filter(x => x._1.equals(key))
.values
resultSRB = processLinearBreakevens(pricesOverLeadTimesKeyFiltered)
resultsSRB = resultsSRB.union(resultSRB)
}
....
By this way, I accumulate the same resultSRB in resultsSRB.
But here are "some" tricks allowing me to add a different/right resultSRB for each iteration
call resultSRB.collect() or resultSRB.foreach(println) or println(resultSRB.count) after each processLinearBreakevens(..) call
perform the same operation on pricesOverLeadTimesKeyFiltered at the beginning of processLinearBreakevens(..)
It seems I need to ensure that all operations must be "flushed" before performing the union. I already tried the union through a temporary variable, or to persist resultSRB, or to persist pricesOverLeadTimesKeyFiltered but still the same problem.
Could you help me please?
Michael
If my assumption is correct; that all of these are var, then the problem is closures. key needs to be a val as it is being lazily captured into your filter. So, when it is finally acted on, the filtering is always using the last state of key
My example:
def process(filtered : RDD[Int]) = filtered.map(x=> x+1)
var i = 1
var key = 1
var filtered = sc.parallelize(List[Int]())
var result = sc.parallelize(List[Int]())
var results = sc.parallelize(List[Int]())
val cached = sc.parallelize(1 to 1000).map(x=>(x, x)).persist
while(i <= 3){
key = i * 10
val filtered = cached
.filter(x => x._1.equals(key))
.values
val result = process(filtered)
results = results.union(result)
i = i + 1
}
results.collect
//expect 11,21,31 but get 31, 31, 31
To fix it, change key to be val in the while loop and will get your expected 11,21,31

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources